# Autotuner: Comprehensive Deployment Architecture Analysis ## Executive Summary The autotuner is a prototype LLM inference parameter optimization system that uses **Kubernetes-native deployment** through the OME (Open Model Engine) framework. It supports two benchmark execution modes and dynamically deploys InferenceServices with different parameter configurations for automated tuning. --- ## 1. Current Deployment Architecture ### 1.1 Overall Architecture Pattern ``` ┌─────────────────────────────────────────────────────────────────┐ │ Autotuner (Python) │ │ src/run_autotuner.py (Orchestrator) │ └────────────────────────────────────────────────────────────────┘ │ ┌─────────────┴────────────────┐ │ │ ┌──────▼──────┐ ┌──────────▼─────────┐ │ OMEController │ │ BenchmarkController │ │ (K8s Native) │ │ (Two Modes) │ └──────┬──────┘ └──────────┬──────────┘ │ │ ┌───────────┴──────────┐ ┌──────────┴────────────┐ │ │ │ │ ┌───▼────────┐ ┌──────▼───┐ │ ┌──────────▼────┐ │InferenceService │Model│Runtime │ │ Direct CLI │ │(Custom CRD) │(CRD)│ (CRD) │ │(Port Forward)│ └───────────┘ └─────┴──────┘ └──────────────┘ │ │ │ └───────────┬───────┴───────────────────────────┘ │ ┌───────▼───────┐ │ Kubernetes │ │ (K8s Cluster)│ └───────────────┘ │ ┌───────────┴──────────────┐ │ │ ┌───▼──────────┐ ┌──────▼───────┐ │SGLang Runtime│ │Genai-Bench │ │ Pods │ │ Benchmarking │ └──────────────┘ └──────────────┘ ``` ### 1.2 Deployment Modes: Two Benchmark Execution Pathways #### Mode 1: Kubernetes BenchmarkJob (Default) - Uses OME's BenchmarkJob Custom Resource Definition - Runs genai-bench inside Kubernetes pods - Results stored in PersistentVolumeClaim (PVC) - Command: `python src/run_autotuner.py examples/simple_task.json` #### Mode 2: Direct CLI Mode (Recommended) - Uses locally installed genai-bench executable - Automatic `kubectl port-forward` to InferenceService - Bypasses Docker image dependencies - Command: `python src/run_autotuner.py examples/simple_task.json --direct` --- ## 2. Key Deployment Components ### 2.1 OME (Open Model Engine) - The Core Infrastructure OME is a **required prerequisite** that provides: **Custom Resource Definitions (CRDs):** - `InferenceService.ome.io/v1beta1` - Model serving endpoints - `BenchmarkJob.ome.io/v1beta1` - Performance testing jobs - `ClusterBaseModel.ome.io/v1beta1` - Model metadata and storage - `ClusterServingRuntime.ome.io/v1beta1` - Runtime configurations **Supporting Components:** - OME Controller Manager (Kubernetes Deployment) - Model Agent DaemonSet (model distribution) - RBAC configurations - Webhooks (validation and mutation) **Installation:** - Via Helm: `oci://ghcr.io/moirai-internal/charts/ome-crd` and `charts/ome-resources` - Dependencies: cert-manager, KEDA - Namespace: `ome` ### 2.2 InferenceService Deployment **What It Is:** - OME custom resource that wraps SGLang inference engines - Dynamically created per experiment with unique parameter configurations **Template-Based Generation:** - File: `/src/templates/inference_service.yaml.j2` (Jinja2 template) - Contains: - Namespace declaration - InferenceService metadata with experiment labels - Model specification - SGLang engine configuration with tunable parameters - GPU resource requests **Template Variables:** ``` {{ namespace }} - K8s namespace (e.g., "autotuner") {{ isvc_name }} - Unique service name (e.g., "simple-tune-exp1") {{ task_name }} - Task identifier (e.g., "simple-tune") {{ experiment_id }} - Experiment number (1, 2, 3, ...) {{ model_name }} - Base model (e.g., "llama-3-2-1b-instruct") {{ runtime_name }} - Runtime config (e.g., "llama-3-2-1b-instruct-rt") {{ tp_size }} - Tensor parallelism (GPU count) {{ mem_frac }} - Memory fraction (0.6-0.95) {{ max_total_tokens }} - Optional token limit {{ schedule_policy }} - Optional scheduling (lpm, random, fcfs) ``` **SGLang Container Args:** ```yaml - --host=0.0.0.0 - --port=8080 - --model-path=/mnt/data/models/{model_name} - --tp-size={tp_size} - --mem-frac={mem_frac} ``` **GPU Resources:** ```yaml limits: nvidia.com/gpu: {{ tp_size }} requests: nvidia.com/gpu: {{ tp_size }} ``` ### 2.3 BenchmarkJob Deployment **When Used (K8s Mode Only):** - After InferenceService becomes ready - For performance evaluation **Template-Based Generation:** - File: `/src/templates/benchmark_job.yaml.j2` - Unique per experiment **Key Configuration:** ```yaml podOverride: image: "kllambda/genai-bench:v251014" endpoint: url: "http://{isvc_name}.{namespace}.svc.cluster.local" apiFormat: "openai" outputLocation: storageUri: "pvc://benchmark-results-pvc/{benchmark_name}" ``` ### 2.4 Persistent Storage **PersistentVolumeClaim:** - File: `/config/benchmark-pvc.yaml` - Name: `benchmark-results-pvc` - Namespace: `autotuner` - Size: 1Gi - Access: ReadWriteOnce - Purpose: Store benchmark results from BenchmarkJobs --- ## 3. Deployment Logic Implementation ### 3.1 Main Orchestrator: `src/run_autotuner.py` **Class: `AutotunerOrchestrator`** **Constructor (`__init__`):** ```python def __init__(self, kubeconfig_path: str = None, use_direct_benchmark: bool = False) ``` - Initializes OMEController for InferenceService management - Selects benchmark controller based on mode: - DirectBenchmarkController (--direct flag) - BenchmarkController (default) - Prints active mode to console **Main Execution Flow (`run_task`):** 1. Load JSON task configuration 2. Generate parameter grid (Cartesian product of all parameter combinations) 3. For each parameter combination: - Call `run_experiment()` with parameters - Append results to list 4. Find best result by objective score 5. Save summary to `results/{task_name}_results.json` **Per-Experiment Flow (`run_experiment`):** ``` Step 1: Deploy InferenceService └─> OMEController.deploy_inference_service() └─> Creates unique InferenceService with parameters Step 2: Wait for Ready └─> OMEController.wait_for_ready() └─> Polls status.conditions for Ready=True └─> Timeout: configurable (default 600s) Step 3: Run Benchmark ├─> If --direct mode: │ └─> DirectBenchmarkController.run_benchmark() │ └─> Sets up kubectl port-forward │ └─> Runs genai-bench CLI locally │ └─> Else (K8s mode): └─> BenchmarkController.create_benchmark_job() └─> BenchmarkController.wait_for_completion() Step 4: Collect Results └─> Extract metrics from benchmark output └─> Calculate objective score (minimize_latency / maximize_throughput) Cleanup: Remove experiment resources └─> Delete InferenceService └─> Delete BenchmarkJob (if applicable) ``` ### 3.2 OME Controller: `src/controllers/ome_controller.py` **Class: `OMEController`** **Kubernetes API Initialization:** ```python from kubernetes import client, config # Try kubeconfig, then in-cluster config config.load_kube_config(config_file=kubeconfig_path) # Or: config.load_incluster_config() self.custom_api = client.CustomObjectsApi() # For CRDs self.core_api = client.CoreV1Api() # For core resources ``` **Key Methods:** 1. **`deploy_inference_service()`** - Renders Jinja2 template with parameters - Creates namespace if needed - Posts InferenceService CRD to Kubernetes - Returns InferenceService name - CRD group: `ome.io`, version: `v1beta1`, plural: `inferenceservices` 2. **`wait_for_ready()`** - Polls `.status.conditions[].type == "Ready"` field - Checks `.status.conditions[].status == "True"` - Logs status messages and reasons - Returns True on Ready, False on timeout 3. **`delete_inference_service()`** - Deletes InferenceService CRD - Ignores 404 (already deleted) - Returns True on success 4. **`create_namespace()` / `get_service_url()`** - Helper methods for namespace management ### 3.3 Benchmark Controllers #### BenchmarkController (K8s Mode): `src/controllers/benchmark_controller.py` **Key Methods:** 1. **`create_benchmark_job()`** - Renders BenchmarkJob template - Posts to Kubernetes - CRD group: `ome.io`, version: `v1beta1`, plural: `benchmarkjobs` 2. **`wait_for_completion()`** - Polls `.status.state` field - Returns True when state == "Complete" - Returns False when state == "Failed" or timeout 3. **`get_benchmark_results()`** - Reads `.status.results` from BenchmarkJob - Returns metrics dictionary #### DirectBenchmarkController (CLI Mode): `src/controllers/direct_benchmark_controller.py` **Key Functionality:** 1. **`setup_port_forward()`** ```bash kubectl port-forward [pod|svc]/[name] 8080:8000 -n [namespace] ``` - Finds pod via label selector: `serving.kserve.io/inferenceservice={service_name}` - Falls back to service name if no pods found - Subprocess-based port-forward in background - Returns `http://localhost:8080` endpoint URL 2. **`run_benchmark()`** - Builds genai-bench command with all parameters - Executes: `env/bin/genai-bench benchmark --api-backend openai --api-base {endpoint} ...` - Captures output and parses JSON results - Cleanup port-forward in finally block 3. **`_parse_results()`** - Reads JSON result files from benchmark output directory - Extracts metrics like latency, throughput, TTFT, TPOT 4. **`cleanup_results()`** - Removes local benchmark result directories --- ## 4. Configuration Files ### 4.1 Task Configuration Schema **File:** `examples/simple_task.json` (User-provided) ```json { "task_name": "simple-tune", "description": "Description", "model": { "name": "llama-3-2-1b-instruct", "namespace": "autotuner" }, "base_runtime": "llama-3-2-1b-instruct-rt", "parameters": { "tp_size": {"type": "choice", "values": [1, 2]}, "mem_frac": {"type": "choice", "values": [0.8, 0.9]}, "max_total_tokens": {"type": "choice", "values": [4096, 8192]}, "schedule_policy": {"type": "choice", "values": ["lpm", "fcfs"]} }, "optimization": { "strategy": "grid_search", "objective": "minimize_latency", "max_iterations": 4, "timeout_per_iteration": 600 }, "benchmark": { "task": "text-to-text", "model_name": "llama-3-2-1b-instruct", "model_tokenizer": "meta-llama/Llama-3.2-1B-Instruct", "traffic_scenarios": ["D(100,100)"], "num_concurrency": [1, 4], "max_time_per_iteration": 10, "max_requests_per_iteration": 50, "additional_params": {"temperature": "0.0"} } } ``` **Key Constraints:** - `model.name`: Must exist as a ClusterBaseModel - `base_runtime`: Must exist as a ClusterServingRuntime - `parameters`: Only "choice" type supported (Cartesian product) - `optimization.strategy`: Only "grid_search" supported - `optimization.objective`: "minimize_latency" or "maximize_throughput" ### 4.2 Example Resource Configurations **ClusterBaseModel Example:** File: `/config/examples/clusterbasemodel-llama-3.2-1b.yaml` ```yaml apiVersion: ome.io/v1beta1 kind: ClusterBaseModel metadata: name: llama-3-2-1b-instruct spec: vendor: meta version: "3.2" modelType: llama modelParameterSize: "1B" maxTokens: 8192 modelCapabilities: [text-to-text] modelFormat: name: safetensors version: "1.0.0" storage: storageUri: hf://meta-llama/Llama-3.2-1B-Instruct path: /mnt/data/models/llama-3.2-1b-instruct ``` **ClusterServingRuntime Example:** File: `/config/examples/clusterservingruntime-sglang.yaml` ```yaml apiVersion: ome.io/v1beta1 kind: ClusterServingRuntime metadata: name: llama-3-2-1b-instruct-rt spec: engineConfig: runner: name: ome-container image: docker.io/lmsysorg/sglang:v0.5.2-cu126 command: [python3, -m, sglang.launch_server] args: - --host=0.0.0.0 - --port=8080 - --model-path=/mnt/data/models/llama-3.2-1b-instruct - --tp-size=1 - --enable-metrics resources: requests: nvidia.com/gpu: 1 limits: nvidia.com/gpu: 1 ``` **PVC Configuration:** File: `/config/benchmark-pvc.yaml` ```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: benchmark-results-pvc namespace: autotuner spec: accessModes: [ReadWriteOnce] resources: requests: storage: 1Gi ``` --- ## 5. Deployment Entry Points ### 5.1 Primary Entry Point: `run_autotuner.py` **CLI Interface:** ```bash python src/run_autotuner.py [kubeconfig_path] [--direct] Arguments: task_json_file - Path to task configuration JSON kubeconfig_path - Optional: Path to kubeconfig (defaults to ~/.kube/config) --direct - Optional: Use direct genai-bench CLI instead of K8s BenchmarkJob Examples: python src/run_autotuner.py examples/simple_task.json python src/run_autotuner.py examples/simple_task.json /path/to/kubeconfig python src/run_autotuner.py examples/simple_task.json --direct python src/run_autotuner.py examples/simple_task.json ~/.kube/config --direct ``` **Environment Activation:** ```bash # Must activate virtual environment first source env/bin/activate # Then run python src/run_autotuner.py examples/simple_task.json --direct ``` ### 5.2 Installation and Setup: `install.sh` **CLI Interface:** ```bash ./install.sh [OPTIONS] Options: --skip-venv Skip virtual environment creation --skip-k8s Skip Kubernetes resource creation --install-ome Install OME operator automatically --venv-path PATH Custom virtual environment path --help Show help message Installation Flow: 1. Verify prerequisites (Python, pip, kubectl, git) 2. Initialize git submodules (OME, genai-bench) 3. Create Python virtual environment 4. Install dependencies from requirements.txt 5. Install genai-bench in editable mode 6. Create result/benchmark_results directories 7. Create K8s namespace and PVC 8. Install OME (if --install-ome flag) 9. Verify OME installation and CRDs ``` **Key Dependencies Installed:** ``` kubernetes>=28.1.0 pyyaml>=6.0 jinja2>=3.1.0 genai-bench (from third_party/genai-bench) ``` --- ## 6. Deployment Assumptions and Constraints ### 6.1 Hardcoded Assumptions **Kubernetes Assumptions:** - Cluster version: v1.28+ - Namespace for autotuner: `autotuner` - OME namespace: `ome` - Kubernetes API available via kubeconfig or in-cluster config **OME Assumptions:** - OME CRDs installed: InferenceService, BenchmarkJob, ClusterBaseModel, ClusterServingRuntime - OME controller pod running in `ome` namespace - KEDA and cert-manager installed as OME dependencies **Model/Runtime Assumptions:** - Model stored in K8s cluster accessible path - Model path: `/mnt/data/models/{model_name}` - Runtime image: `docker.io/lmsysorg/sglang:v0.5.2-cu126` (SGLang) - Port: 8080 (InferenceService) - Port: 8000 (SGLang server before port-forward) **Hardware Assumptions:** - GPU available (nvidia.com/gpu resource) - GPU count specified by `tp_size` parameter - Model can fit in GPU memory with `mem_frac` setting **Storage Assumptions:** - PVC named `benchmark-results-pvc` exists in `autotuner` namespace - Storage backend supports 1Gi allocation - AccessMode: ReadWriteOnce ### 6.2 Deployment Mode Selection Logic ```python # Determined by CLI flag if "--direct" in sys.argv: use_direct_benchmark = True benchmark_controller = DirectBenchmarkController() print("Using direct genai-bench CLI execution") else: use_direct_benchmark = False benchmark_controller = BenchmarkController(kubeconfig_path) print("Using Kubernetes BenchmarkJob CRD") ``` **Default Behavior:** K8s BenchmarkJob mode (requires working Docker image and PVC) ### 6.3 Status Polling and Timeouts **InferenceService Ready Check:** - Poll interval: 10 seconds - Default timeout: 600 seconds (10 minutes) - Condition checked: `.status.conditions[].type == "Ready"` - Status field: `.status.conditions[].status` (must be "True") **BenchmarkJob Completion Check:** - Poll interval: 15 seconds - Default timeout: 1800 seconds (30 minutes) - Status field: `.status.state` - States: "Complete", "Failed", or polling continues **Configurable Per Task:** ```json "optimization": { "timeout_per_iteration": 600 // seconds } ``` --- ## 7. Results Output and Storage ### 7.1 Results Persistence **Location:** `results/{task_name}_results.json` **Structure:** ```json { "task_name": "simple-tune", "total_experiments": 4, "successful_experiments": 4, "elapsed_time": 1245.3, "best_result": { "experiment_id": 2, "parameters": {"tp_size": 1, "mem_frac": 0.9}, "status": "success", "metrics": {...}, "objective_score": 89.2 }, "all_results": [ { "experiment_id": 1, "parameters": {"tp_size": 1, "mem_frac": 0.8}, "status": "success", "metrics": {...}, "objective_score": 125.3 }, ... ] } ``` ### 7.2 Direct CLI Mode Results **Storage Location:** `benchmark_results/{task_name}-exp{id}/` **File Pattern:** Genai-bench creates various output files: - `*_results.json` - Benchmark metrics - Various metadata and logging files --- ## 8. Deployment Workflow Diagram ``` User Provides: examples/simple_task.json │ ▼ ┌──────────────────────────────────────┐ │ run_autotuner.py main() │ │ - Parse CLI args (kubeconfig, --direct) │ │ - Create AutotunerOrchestrator │ └──────────────────┬───────────────────┘ │ ▼ load_task(task_file) Parse JSON config │ ▼ generate_parameter_grid() All combinations from "choice" params │ ┌──────────────┴──────────────┐ │ For each parameter set: │ │ │ ▼ ▼ ┌────────────────────┐ ┌────────────────┐ │ run_experiment() │───▶│ Step 1: │ │ │ │ Deploy │ │ │ │ InferenceService └────────────────────┘ └────────┬───────┘ ▲ │ │ ┌───────────────┘ │ │ │ ▼ │ ┌─────────────────┐ │ │ Step 2: │ │ │ Wait for Ready │ │ └────────┬────────┘ │ │ │ ▼ │ ┌──────────────────────┐ │ │ Step 3: │ │ │ Run Benchmark │ │ │ (Two modes) │ │ └─────────┬────────────┘ │ │ │ ┌─────────┴─────────┐ │ │ │ │ ┌──▼──────────┐ ┌─────▼──────────┐ │ │Direct Mode: │ │K8s Mode: │ │ │- Port Fwd │ │- BenchmarkJob │ │ │- Local CLI │ │- Wait complete │ │ └─────────────┘ │- Read PVC │ │ └────────────────┘ │ │ ▼ ▼ ┌────────────────────────────────────┐ │ Step 4: Collect Results │ │ - Parse benchmark output │ │ - Calculate objective_score │ └────────────────┬───────────────────┘ │ ▼ ┌────────────────────────────────────┐ │ Cleanup: Delete InferenceService │ │ Delete BenchmarkJob (if K8s) └────────────────┬───────────────────┘ │ └─────────────────┐ │ Repeat for │ next parameter │ combination │ (Loop continues) │ ▼ ┌────────────────────────────────────┐ │ Find Best Result │ │ - Compare objective_scores │ │ - Select minimum (or inverted max) │ └────────────────┬───────────────────┘ │ ▼ ┌────────────────────────────────────┐ │ Save Results │ │ results/{task_name}_results.json │ └────────────────────────────────────┘ ``` --- ## 9. Key Files and Their Roles | File | Lines | Purpose | |------|-------|---------| | `src/run_autotuner.py` | 305 | Main orchestrator, experiment flow control | | `src/controllers/ome_controller.py` | 232 | InferenceService CRUD operations | | `src/controllers/benchmark_controller.py` | 229 | BenchmarkJob CRD management (K8s mode) | | `src/controllers/direct_benchmark_controller.py` | 335 | Direct genai-bench CLI execution | | `src/templates/inference_service.yaml.j2` | 36 | InferenceService manifest template | | `src/templates/benchmark_job.yaml.j2` | 41 | BenchmarkJob manifest template | | `src/utils/optimizer.py` | 67 | Parameter grid generation, objective scoring | | `install.sh` | 460 | Full environment setup and validation | | `examples/simple_task.json` | 37 | Example task configuration | | `config/benchmark-pvc.yaml` | 12 | PersistentVolumeClaim for K8s mode | | `requirements.txt` | 4 | Python dependencies | | `README.md` | 705 | Complete user documentation | --- ## 10. Deployment Scenarios ### Scenario 1: Direct CLI Mode (Recommended) ``` prerequisites: ✓ Kubernetes cluster with OME ✓ OME InferenceService CRD ✓ genai-bench CLI installed locally ✓ kubectl port-forward capability flow: 1. Deploy InferenceService (OME) 2. Wait for pod ready 3. kubectl port-forward to expose service 4. Run genai-bench CLI → http://localhost:8080 5. Parse JSON results from local filesystem 6. Clean up port-forward 7. Delete InferenceService ``` ### Scenario 2: Kubernetes BenchmarkJob Mode ``` prerequisites: ✓ Kubernetes cluster with OME ✓ OME BenchmarkJob CRD ✓ PersistentVolumeClaim for results ✓ Working genai-bench Docker image ✓ Kubernetes RBAC permissions flow: 1. Deploy InferenceService (OME) 2. Wait for pod ready 3. Create BenchmarkJob CRD 4. Wait for BenchmarkJob to complete 5. Read results from PVC/BenchmarkJob status 6. Delete BenchmarkJob 7. Delete InferenceService ``` --- ## 11. Summary of Deployment Mechanisms ### Current Deployment Stack: 1. **Orchestration:** Python (run_autotuner.py) 2. **Deployment Engine:** Kubernetes + OME CRDs 3. **Model Serving:** SGLang (via OME InferenceService) 4. **Benchmarking:** genai-bench (CLI or K8s Pod) 5. **Configuration:** JSON task files, YAML templates 6. **Storage:** Kubernetes PersistentVolumeClaim ### Deployment Triggers: 1. **Manual:** `python src/run_autotuner.py [--direct]` 2. **Installation:** `./install.sh [--install-ome]` 3. **No automated triggers** (deployment is manual or scheduled externally) ### Infrastructure Code: - Kubernetes API calls (via kubernetes-client Python library) - kubectl port-forward (subprocess-based) - Helm charts for OME installation (shell script) - No Docker build or Docker Compose ### Resource Lifecycle: - **Created:** InferenceService per experiment, BenchmarkJob per experiment (K8s mode) - **Managed:** Namespace `autotuner`, PVC `benchmark-results-pvc` - **Deleted:** Each resource cleaned up after experiment completes --- ## 12. Hardcoded Configuration Values | Setting | Value | Location | Modifiable | |---------|-------|----------|-----------| | OME CRD Group | `ome.io` | controllers/*.py | Code change | | OME CRD Version | `v1beta1` | controllers/*.py | Code change | | Namespace | `autotuner` | install.sh | Yes (shell var) | | OME Namespace | `ome` | install.sh | Code change | | PVC Name | `benchmark-results-pvc` | benchmark_job.yaml.j2 | Code change | | SGLang Port | 8080 | multiple files | Code change | | Port Forward Remote | 8000 | direct_benchmark_controller.py | Code change | | Port Forward Local | 8080 | direct_benchmark_controller.py | Code change | | Poll Interval (ISVC) | 10s | ome_controller.py | Code change | | Poll Interval (Benchmark) | 15s | benchmark_controller.py | Code change | | Default Timeout | 600s, 1800s | run_autotuner.py | Task JSON | | Genai-bench Image | `kllambda/genai-bench:v251014` | benchmark_job.yaml.j2 | Code change | | SGLang Image | `docker.io/lmsysorg/sglang:v0.5.2-cu126` | config examples | YAML |