Autotuner: Comprehensive Deployment Architecture Analysis¶
Executive Summary¶
The autotuner is a prototype LLM inference parameter optimization system that uses Kubernetes-native deployment through the OME (Open Model Engine) framework. It supports two benchmark execution modes and dynamically deploys InferenceServices with different parameter configurations for automated tuning.
1. Current Deployment Architecture¶
1.1 Overall Architecture Pattern¶
┌─────────────────────────────────────────────────────────────────┐
│ Autotuner (Python) │
│ src/run_autotuner.py (Orchestrator) │
└────────────────────────────────────────────────────────────────┘
│
┌─────────────┴────────────────┐
│ │
┌──────▼──────┐ ┌──────────▼─────────┐
│ OMEController │ │ BenchmarkController │
│ (K8s Native) │ │ (Two Modes) │
└──────┬──────┘ └──────────┬──────────┘
│ │
┌───────────┴──────────┐ ┌──────────┴────────────┐
│ │ │ │
┌───▼────────┐ ┌──────▼───┐ │ ┌──────────▼────┐
│InferenceService │Model│Runtime │ │ Direct CLI │
│(Custom CRD) │(CRD)│ (CRD) │ │(Port Forward)│
└───────────┘ └─────┴──────┘ └──────────────┘
│ │ │
└───────────┬───────┴───────────────────────────┘
│
┌───────▼───────┐
│ Kubernetes │
│ (K8s Cluster)│
└───────────────┘
│
┌───────────┴──────────────┐
│ │
┌───▼──────────┐ ┌──────▼───────┐
│SGLang Runtime│ │Genai-Bench │
│ Pods │ │ Benchmarking │
└──────────────┘ └──────────────┘
1.2 Deployment Modes: Two Benchmark Execution Pathways¶
Mode 1: Kubernetes BenchmarkJob (Default)¶
Uses OME’s BenchmarkJob Custom Resource Definition
Runs genai-bench inside Kubernetes pods
Results stored in PersistentVolumeClaim (PVC)
Command:
python src/run_autotuner.py examples/simple_task.json
Mode 2: Direct CLI Mode (Recommended)¶
Uses locally installed genai-bench executable
Automatic
kubectl port-forwardto InferenceServiceBypasses Docker image dependencies
Command:
python src/run_autotuner.py examples/simple_task.json --direct
2. Key Deployment Components¶
2.1 OME (Open Model Engine) - The Core Infrastructure¶
OME is a required prerequisite that provides:
Custom Resource Definitions (CRDs):
InferenceService.ome.io/v1beta1- Model serving endpointsBenchmarkJob.ome.io/v1beta1- Performance testing jobsClusterBaseModel.ome.io/v1beta1- Model metadata and storageClusterServingRuntime.ome.io/v1beta1- Runtime configurations
Supporting Components:
OME Controller Manager (Kubernetes Deployment)
Model Agent DaemonSet (model distribution)
RBAC configurations
Webhooks (validation and mutation)
Installation:
Via Helm:
oci://ghcr.io/moirai-internal/charts/ome-crdandcharts/ome-resourcesDependencies: cert-manager, KEDA
Namespace:
ome
2.2 InferenceService Deployment¶
What It Is:
OME custom resource that wraps SGLang inference engines
Dynamically created per experiment with unique parameter configurations
Template-Based Generation:
File:
/src/templates/inference_service.yaml.j2(Jinja2 template)Contains:
Namespace declaration
InferenceService metadata with experiment labels
Model specification
SGLang engine configuration with tunable parameters
GPU resource requests
Template Variables:
{{ namespace }} - K8s namespace (e.g., "autotuner")
{{ isvc_name }} - Unique service name (e.g., "simple-tune-exp1")
{{ task_name }} - Task identifier (e.g., "simple-tune")
{{ experiment_id }} - Experiment number (1, 2, 3, ...)
{{ model_name }} - Base model (e.g., "llama-3-2-1b-instruct")
{{ runtime_name }} - Runtime config (e.g., "llama-3-2-1b-instruct-rt")
{{ tp_size }} - Tensor parallelism (GPU count)
{{ mem_frac }} - Memory fraction (0.6-0.95)
{{ max_total_tokens }} - Optional token limit
{{ schedule_policy }} - Optional scheduling (lpm, random, fcfs)
SGLang Container Args:
- --host=0.0.0.0
- --port=8080
- --model-path=/mnt/data/models/{model_name}
- --tp-size={tp_size}
- --mem-frac={mem_frac}
GPU Resources:
limits:
nvidia.com/gpu: {{ tp_size }}
requests:
nvidia.com/gpu: {{ tp_size }}
2.3 BenchmarkJob Deployment¶
When Used (K8s Mode Only):
After InferenceService becomes ready
For performance evaluation
Template-Based Generation:
File:
/src/templates/benchmark_job.yaml.j2Unique per experiment
Key Configuration:
podOverride:
image: "kllambda/genai-bench:v251014"
endpoint:
url: "http://{isvc_name}.{namespace}.svc.cluster.local"
apiFormat: "openai"
outputLocation:
storageUri: "pvc://benchmark-results-pvc/{benchmark_name}"
2.4 Persistent Storage¶
PersistentVolumeClaim:
File:
/config/benchmark-pvc.yamlName:
benchmark-results-pvcNamespace:
autotunerSize: 1Gi
Access: ReadWriteOnce
Purpose: Store benchmark results from BenchmarkJobs
3. Deployment Logic Implementation¶
3.1 Main Orchestrator: src/run_autotuner.py¶
Class: AutotunerOrchestrator
Constructor (__init__):
def __init__(self, kubeconfig_path: str = None, use_direct_benchmark: bool = False)
Initializes OMEController for InferenceService management
Selects benchmark controller based on mode:
DirectBenchmarkController (–direct flag)
BenchmarkController (default)
Prints active mode to console
Main Execution Flow (run_task):
Load JSON task configuration
Generate parameter grid (Cartesian product of all parameter combinations)
For each parameter combination:
Call
run_experiment()with parametersAppend results to list
Find best result by objective score
Save summary to
results/{task_name}_results.json
Per-Experiment Flow (run_experiment):
Step 1: Deploy InferenceService
└─> OMEController.deploy_inference_service()
└─> Creates unique InferenceService with parameters
Step 2: Wait for Ready
└─> OMEController.wait_for_ready()
└─> Polls status.conditions for Ready=True
└─> Timeout: configurable (default 600s)
Step 3: Run Benchmark
├─> If --direct mode:
│ └─> DirectBenchmarkController.run_benchmark()
│ └─> Sets up kubectl port-forward
│ └─> Runs genai-bench CLI locally
│
└─> Else (K8s mode):
└─> BenchmarkController.create_benchmark_job()
└─> BenchmarkController.wait_for_completion()
Step 4: Collect Results
└─> Extract metrics from benchmark output
└─> Calculate objective score (minimize_latency / maximize_throughput)
Cleanup: Remove experiment resources
└─> Delete InferenceService
└─> Delete BenchmarkJob (if applicable)
3.2 OME Controller: src/controllers/ome_controller.py¶
Class: OMEController
Kubernetes API Initialization:
from kubernetes import client, config
# Try kubeconfig, then in-cluster config
config.load_kube_config(config_file=kubeconfig_path)
# Or: config.load_incluster_config()
self.custom_api = client.CustomObjectsApi() # For CRDs
self.core_api = client.CoreV1Api() # For core resources
Key Methods:
deploy_inference_service()Renders Jinja2 template with parameters
Creates namespace if needed
Posts InferenceService CRD to Kubernetes
Returns InferenceService name
CRD group:
ome.io, version:v1beta1, plural:inferenceservices
wait_for_ready()Polls
.status.conditions[].type == "Ready"fieldChecks
.status.conditions[].status == "True"Logs status messages and reasons
Returns True on Ready, False on timeout
delete_inference_service()Deletes InferenceService CRD
Ignores 404 (already deleted)
Returns True on success
create_namespace()/get_service_url()Helper methods for namespace management
3.3 Benchmark Controllers¶
BenchmarkController (K8s Mode): src/controllers/benchmark_controller.py¶
Key Methods:
create_benchmark_job()Renders BenchmarkJob template
Posts to Kubernetes
CRD group:
ome.io, version:v1beta1, plural:benchmarkjobs
wait_for_completion()Polls
.status.statefieldReturns True when state == “Complete”
Returns False when state == “Failed” or timeout
get_benchmark_results()Reads
.status.resultsfrom BenchmarkJobReturns metrics dictionary
DirectBenchmarkController (CLI Mode): src/controllers/direct_benchmark_controller.py¶
Key Functionality:
setup_port_forward()kubectl port-forward [pod|svc]/[name] 8080:8000 -n [namespace]
Finds pod via label selector:
serving.kserve.io/inferenceservice={service_name}Falls back to service name if no pods found
Subprocess-based port-forward in background
Returns
http://localhost:8080endpoint URL
run_benchmark()Builds genai-bench command with all parameters
Executes:
env/bin/genai-bench benchmark --api-backend openai --api-base {endpoint} ...Captures output and parses JSON results
Cleanup port-forward in finally block
_parse_results()Reads JSON result files from benchmark output directory
Extracts metrics like latency, throughput, TTFT, TPOT
cleanup_results()Removes local benchmark result directories
4. Configuration Files¶
4.1 Task Configuration Schema¶
File: examples/simple_task.json (User-provided)
{
"task_name": "simple-tune",
"description": "Description",
"model": {
"name": "llama-3-2-1b-instruct",
"namespace": "autotuner"
},
"base_runtime": "llama-3-2-1b-instruct-rt",
"parameters": {
"tp_size": {"type": "choice", "values": [1, 2]},
"mem_frac": {"type": "choice", "values": [0.8, 0.9]},
"max_total_tokens": {"type": "choice", "values": [4096, 8192]},
"schedule_policy": {"type": "choice", "values": ["lpm", "fcfs"]}
},
"optimization": {
"strategy": "grid_search",
"objective": "minimize_latency",
"max_iterations": 4,
"timeout_per_iteration": 600
},
"benchmark": {
"task": "text-to-text",
"model_name": "llama-3-2-1b-instruct",
"model_tokenizer": "meta-llama/Llama-3.2-1B-Instruct",
"traffic_scenarios": ["D(100,100)"],
"num_concurrency": [1, 4],
"max_time_per_iteration": 10,
"max_requests_per_iteration": 50,
"additional_params": {"temperature": "0.0"}
}
}
Key Constraints:
model.name: Must exist as a ClusterBaseModelbase_runtime: Must exist as a ClusterServingRuntimeparameters: Only “choice” type supported (Cartesian product)optimization.strategy: Only “grid_search” supportedoptimization.objective: “minimize_latency” or “maximize_throughput”
4.2 Example Resource Configurations¶
ClusterBaseModel Example:
File: /config/examples/clusterbasemodel-llama-3.2-1b.yaml
apiVersion: ome.io/v1beta1
kind: ClusterBaseModel
metadata:
name: llama-3-2-1b-instruct
spec:
vendor: meta
version: "3.2"
modelType: llama
modelParameterSize: "1B"
maxTokens: 8192
modelCapabilities: [text-to-text]
modelFormat:
name: safetensors
version: "1.0.0"
storage:
storageUri: hf://meta-llama/Llama-3.2-1B-Instruct
path: /mnt/data/models/llama-3.2-1b-instruct
ClusterServingRuntime Example:
File: /config/examples/clusterservingruntime-sglang.yaml
apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
name: llama-3-2-1b-instruct-rt
spec:
engineConfig:
runner:
name: ome-container
image: docker.io/lmsysorg/sglang:v0.5.2-cu126
command: [python3, -m, sglang.launch_server]
args:
- --host=0.0.0.0
- --port=8080
- --model-path=/mnt/data/models/llama-3.2-1b-instruct
- --tp-size=1
- --enable-metrics
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
PVC Configuration:
File: /config/benchmark-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: benchmark-results-pvc
namespace: autotuner
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 1Gi
5. Deployment Entry Points¶
5.1 Primary Entry Point: run_autotuner.py¶
CLI Interface:
python src/run_autotuner.py <task_json_file> [kubeconfig_path] [--direct]
Arguments:
task_json_file - Path to task configuration JSON
kubeconfig_path - Optional: Path to kubeconfig (defaults to ~/.kube/config)
--direct - Optional: Use direct genai-bench CLI instead of K8s BenchmarkJob
Examples:
python src/run_autotuner.py examples/simple_task.json
python src/run_autotuner.py examples/simple_task.json /path/to/kubeconfig
python src/run_autotuner.py examples/simple_task.json --direct
python src/run_autotuner.py examples/simple_task.json ~/.kube/config --direct
Environment Activation:
# Must activate virtual environment first
source env/bin/activate
# Then run
python src/run_autotuner.py examples/simple_task.json --direct
5.2 Installation and Setup: install.sh¶
CLI Interface:
./install.sh [OPTIONS]
Options:
--skip-venv Skip virtual environment creation
--skip-k8s Skip Kubernetes resource creation
--install-ome Install OME operator automatically
--venv-path PATH Custom virtual environment path
--help Show help message
Installation Flow:
1. Verify prerequisites (Python, pip, kubectl, git)
2. Initialize git submodules (OME, genai-bench)
3. Create Python virtual environment
4. Install dependencies from requirements.txt
5. Install genai-bench in editable mode
6. Create result/benchmark_results directories
7. Create K8s namespace and PVC
8. Install OME (if --install-ome flag)
9. Verify OME installation and CRDs
Key Dependencies Installed:
kubernetes>=28.1.0
pyyaml>=6.0
jinja2>=3.1.0
genai-bench (from third_party/genai-bench)
6. Deployment Assumptions and Constraints¶
6.1 Hardcoded Assumptions¶
Kubernetes Assumptions:
Cluster version: v1.28+
Namespace for autotuner:
autotunerOME namespace:
omeKubernetes API available via kubeconfig or in-cluster config
OME Assumptions:
OME CRDs installed: InferenceService, BenchmarkJob, ClusterBaseModel, ClusterServingRuntime
OME controller pod running in
omenamespaceKEDA and cert-manager installed as OME dependencies
Model/Runtime Assumptions:
Model stored in K8s cluster accessible path
Model path:
/mnt/data/models/{model_name}Runtime image:
docker.io/lmsysorg/sglang:v0.5.2-cu126(SGLang)Port: 8080 (InferenceService)
Port: 8000 (SGLang server before port-forward)
Hardware Assumptions:
GPU available (nvidia.com/gpu resource)
GPU count specified by
tp_sizeparameterModel can fit in GPU memory with
mem_fracsetting
Storage Assumptions:
PVC named
benchmark-results-pvcexists inautotunernamespaceStorage backend supports 1Gi allocation
AccessMode: ReadWriteOnce
6.2 Deployment Mode Selection Logic¶
# Determined by CLI flag
if "--direct" in sys.argv:
use_direct_benchmark = True
benchmark_controller = DirectBenchmarkController()
print("Using direct genai-bench CLI execution")
else:
use_direct_benchmark = False
benchmark_controller = BenchmarkController(kubeconfig_path)
print("Using Kubernetes BenchmarkJob CRD")
Default Behavior: K8s BenchmarkJob mode (requires working Docker image and PVC)
6.3 Status Polling and Timeouts¶
InferenceService Ready Check:
Poll interval: 10 seconds
Default timeout: 600 seconds (10 minutes)
Condition checked:
.status.conditions[].type == "Ready"Status field:
.status.conditions[].status(must be “True”)
BenchmarkJob Completion Check:
Poll interval: 15 seconds
Default timeout: 1800 seconds (30 minutes)
Status field:
.status.stateStates: “Complete”, “Failed”, or polling continues
Configurable Per Task:
"optimization": {
"timeout_per_iteration": 600 // seconds
}
7. Results Output and Storage¶
7.1 Results Persistence¶
Location: results/{task_name}_results.json
Structure:
{
"task_name": "simple-tune",
"total_experiments": 4,
"successful_experiments": 4,
"elapsed_time": 1245.3,
"best_result": {
"experiment_id": 2,
"parameters": {"tp_size": 1, "mem_frac": 0.9},
"status": "success",
"metrics": {...},
"objective_score": 89.2
},
"all_results": [
{
"experiment_id": 1,
"parameters": {"tp_size": 1, "mem_frac": 0.8},
"status": "success",
"metrics": {...},
"objective_score": 125.3
},
...
]
}
7.2 Direct CLI Mode Results¶
Storage Location: benchmark_results/{task_name}-exp{id}/
File Pattern: Genai-bench creates various output files:
*_results.json- Benchmark metricsVarious metadata and logging files
8. Deployment Workflow Diagram¶
User Provides:
examples/simple_task.json
│
▼
┌──────────────────────────────────────┐
│ run_autotuner.py main() │
│ - Parse CLI args (kubeconfig, --direct) │
│ - Create AutotunerOrchestrator │
└──────────────────┬───────────────────┘
│
▼
load_task(task_file)
Parse JSON config
│
▼
generate_parameter_grid()
All combinations from "choice" params
│
┌──────────────┴──────────────┐
│ For each parameter set: │
│ │
▼ ▼
┌────────────────────┐ ┌────────────────┐
│ run_experiment() │───▶│ Step 1: │
│ │ │ Deploy │
│ │ │ InferenceService
└────────────────────┘ └────────┬───────┘
▲ │
│ ┌───────────────┘
│ │
│ ▼
│ ┌─────────────────┐
│ │ Step 2: │
│ │ Wait for Ready │
│ └────────┬────────┘
│ │
│ ▼
│ ┌──────────────────────┐
│ │ Step 3: │
│ │ Run Benchmark │
│ │ (Two modes) │
│ └─────────┬────────────┘
│ │
│ ┌─────────┴─────────┐
│ │ │
│ ┌──▼──────────┐ ┌─────▼──────────┐
│ │Direct Mode: │ │K8s Mode: │
│ │- Port Fwd │ │- BenchmarkJob │
│ │- Local CLI │ │- Wait complete │
│ └─────────────┘ │- Read PVC │
│ └────────────────┘
│ │
▼ ▼
┌────────────────────────────────────┐
│ Step 4: Collect Results │
│ - Parse benchmark output │
│ - Calculate objective_score │
└────────────────┬───────────────────┘
│
▼
┌────────────────────────────────────┐
│ Cleanup: Delete InferenceService │
│ Delete BenchmarkJob (if K8s)
└────────────────┬───────────────────┘
│
└─────────────────┐
│ Repeat for
│ next parameter
│ combination
│
(Loop continues)
│
▼
┌────────────────────────────────────┐
│ Find Best Result │
│ - Compare objective_scores │
│ - Select minimum (or inverted max) │
└────────────────┬───────────────────┘
│
▼
┌────────────────────────────────────┐
│ Save Results │
│ results/{task_name}_results.json │
└────────────────────────────────────┘
9. Key Files and Their Roles¶
File |
Lines |
Purpose |
|---|---|---|
|
305 |
Main orchestrator, experiment flow control |
|
232 |
InferenceService CRUD operations |
|
229 |
BenchmarkJob CRD management (K8s mode) |
|
335 |
Direct genai-bench CLI execution |
|
36 |
InferenceService manifest template |
|
41 |
BenchmarkJob manifest template |
|
67 |
Parameter grid generation, objective scoring |
|
460 |
Full environment setup and validation |
|
37 |
Example task configuration |
|
12 |
PersistentVolumeClaim for K8s mode |
|
4 |
Python dependencies |
|
705 |
Complete user documentation |
10. Deployment Scenarios¶
Scenario 1: Direct CLI Mode (Recommended)¶
prerequisites:
✓ Kubernetes cluster with OME
✓ OME InferenceService CRD
✓ genai-bench CLI installed locally
✓ kubectl port-forward capability
flow:
1. Deploy InferenceService (OME)
2. Wait for pod ready
3. kubectl port-forward to expose service
4. Run genai-bench CLI → http://localhost:8080
5. Parse JSON results from local filesystem
6. Clean up port-forward
7. Delete InferenceService
Scenario 2: Kubernetes BenchmarkJob Mode¶
prerequisites:
✓ Kubernetes cluster with OME
✓ OME BenchmarkJob CRD
✓ PersistentVolumeClaim for results
✓ Working genai-bench Docker image
✓ Kubernetes RBAC permissions
flow:
1. Deploy InferenceService (OME)
2. Wait for pod ready
3. Create BenchmarkJob CRD
4. Wait for BenchmarkJob to complete
5. Read results from PVC/BenchmarkJob status
6. Delete BenchmarkJob
7. Delete InferenceService
11. Summary of Deployment Mechanisms¶
Current Deployment Stack:¶
Orchestration: Python (run_autotuner.py)
Deployment Engine: Kubernetes + OME CRDs
Model Serving: SGLang (via OME InferenceService)
Benchmarking: genai-bench (CLI or K8s Pod)
Configuration: JSON task files, YAML templates
Storage: Kubernetes PersistentVolumeClaim
Deployment Triggers:¶
Manual:
python src/run_autotuner.py <task.json> [--direct]Installation:
./install.sh [--install-ome]No automated triggers (deployment is manual or scheduled externally)
Infrastructure Code:¶
Kubernetes API calls (via kubernetes-client Python library)
kubectl port-forward (subprocess-based)
Helm charts for OME installation (shell script)
No Docker build or Docker Compose
Resource Lifecycle:¶
Created: InferenceService per experiment, BenchmarkJob per experiment (K8s mode)
Managed: Namespace
autotuner, PVCbenchmark-results-pvcDeleted: Each resource cleaned up after experiment completes
12. Hardcoded Configuration Values¶
Setting |
Value |
Location |
Modifiable |
|---|---|---|---|
OME CRD Group |
|
controllers/*.py |
Code change |
OME CRD Version |
|
controllers/*.py |
Code change |
Namespace |
|
install.sh |
Yes (shell var) |
OME Namespace |
|
install.sh |
Code change |
PVC Name |
|
benchmark_job.yaml.j2 |
Code change |
SGLang Port |
8080 |
multiple files |
Code change |
Port Forward Remote |
8000 |
direct_benchmark_controller.py |
Code change |
Port Forward Local |
8080 |
direct_benchmark_controller.py |
Code change |
Poll Interval (ISVC) |
10s |
ome_controller.py |
Code change |
Poll Interval (Benchmark) |
15s |
benchmark_controller.py |
Code change |
Default Timeout |
600s, 1800s |
run_autotuner.py |
Task JSON |
Genai-bench Image |
|
benchmark_job.yaml.j2 |
Code change |
SGLang Image |
|
config examples |
YAML |