Autotuner: Comprehensive Deployment Architecture Analysis¶

Executive Summary¶

The autotuner is a prototype LLM inference parameter optimization system that uses Kubernetes-native deployment through the OME (Open Model Engine) framework. It supports two benchmark execution modes and dynamically deploys InferenceServices with different parameter configurations for automated tuning.

1. Current Deployment Architecture¶

1.1 Overall Architecture Pattern¶

┌─────────────────────────────────────────────────────────────────┐
│                    Autotuner (Python)                 │
│              src/run_autotuner.py (Orchestrator)                │
└────────────────────────────────────────────────────────────────┘
                              │
                ┌─────────────┴────────────────┐
                │                              │
         ┌──────▼──────┐           ┌──────────▼─────────┐
         │ OMEController │         │ BenchmarkController │
         │ (K8s Native) │         │  (Two Modes)        │
         └──────┬──────┘           └──────────┬──────────┘
                │                             │
    ┌───────────┴──────────┐      ┌──────────┴────────────┐
    │                      │      │                       │
┌───▼────────┐      ┌──────▼───┐ │             ┌──────────▼────┐
│InferenceService  │Model│Runtime │             │ Direct CLI   │
│(Custom CRD)      │(CRD)│ (CRD)  │             │(Port Forward)│
└───────────┘      └─────┴──────┘              └──────────────┘
    │                   │                           │
    └───────────┬───────┴───────────────────────────┘
                │
        ┌───────▼───────┐
        │  Kubernetes   │
        │  (K8s Cluster)│
        └───────────────┘
                │
    ┌───────────┴──────────────┐
    │                          │
┌───▼──────────┐        ┌──────▼───────┐
│SGLang Runtime│        │Genai-Bench   │
│   Pods       │        │ Benchmarking │
└──────────────┘        └──────────────┘

1.2 Deployment Modes: Two Benchmark Execution Pathways¶

Mode 1: Kubernetes BenchmarkJob (Default)¶

Uses OME’s BenchmarkJob Custom Resource Definition
Runs genai-bench inside Kubernetes pods
Results stored in PersistentVolumeClaim (PVC)
Command: python src/run_autotuner.py examples/simple_task.json

Mode 2: Direct CLI Mode (Recommended)¶

Uses locally installed genai-bench executable
Automatic kubectl port-forward to InferenceService
Bypasses Docker image dependencies
Command: python src/run_autotuner.py examples/simple_task.json --direct

2. Key Deployment Components¶

2.1 OME (Open Model Engine) - The Core Infrastructure¶

OME is a required prerequisite that provides:

Custom Resource Definitions (CRDs):

InferenceService.ome.io/v1beta1 - Model serving endpoints
BenchmarkJob.ome.io/v1beta1 - Performance testing jobs
ClusterBaseModel.ome.io/v1beta1 - Model metadata and storage
ClusterServingRuntime.ome.io/v1beta1 - Runtime configurations

Supporting Components:

OME Controller Manager (Kubernetes Deployment)
Model Agent DaemonSet (model distribution)
RBAC configurations
Webhooks (validation and mutation)

Installation:

Via Helm: oci://ghcr.io/moirai-internal/charts/ome-crd and charts/ome-resources
Dependencies: cert-manager, KEDA
Namespace: ome

2.2 InferenceService Deployment¶

What It Is:

OME custom resource that wraps SGLang inference engines
Dynamically created per experiment with unique parameter configurations

Template-Based Generation:

File: /src/templates/inference_service.yaml.j2 (Jinja2 template)
Contains:
- Namespace declaration
- InferenceService metadata with experiment labels
- Model specification
- SGLang engine configuration with tunable parameters
- GPU resource requests

Template Variables:

{{ namespace }}          - K8s namespace (e.g., "autotuner")
{{ isvc_name }}         - Unique service name (e.g., "simple-tune-exp1")
{{ task_name }}         - Task identifier (e.g., "simple-tune")
{{ experiment_id }}     - Experiment number (1, 2, 3, ...)
{{ model_name }}        - Base model (e.g., "llama-3-2-1b-instruct")
{{ runtime_name }}      - Runtime config (e.g., "llama-3-2-1b-instruct-rt")
{{ tp_size }}           - Tensor parallelism (GPU count)
{{ mem_frac }}          - Memory fraction (0.6-0.95)
{{ max_total_tokens }}  - Optional token limit
{{ schedule_policy }}   - Optional scheduling (lpm, random, fcfs)

SGLang Container Args:

- --host=0.0.0.0
- --port=8080
- --model-path=/mnt/data/models/{model_name}
- --tp-size={tp_size}
- --mem-frac={mem_frac}

GPU Resources:

limits:
  nvidia.com/gpu: {{ tp_size }}
requests:
  nvidia.com/gpu: {{ tp_size }}

2.3 BenchmarkJob Deployment¶

When Used (K8s Mode Only):

After InferenceService becomes ready
For performance evaluation

Template-Based Generation:

File: /src/templates/benchmark_job.yaml.j2
Unique per experiment

Key Configuration:

podOverride:
  image: "kllambda/genai-bench:v251014"
endpoint:
  url: "http://{isvc_name}.{namespace}.svc.cluster.local"
  apiFormat: "openai"
outputLocation:
  storageUri: "pvc://benchmark-results-pvc/{benchmark_name}"

2.4 Persistent Storage¶

PersistentVolumeClaim:

File: /config/benchmark-pvc.yaml
Name: benchmark-results-pvc
Namespace: autotuner
Size: 1Gi
Access: ReadWriteOnce
Purpose: Store benchmark results from BenchmarkJobs

3. Deployment Logic Implementation¶

3.1 Main Orchestrator: `src/run_autotuner.py`¶

Class: AutotunerOrchestrator

Constructor (__init__):

def __init__(self, kubeconfig_path: str = None, use_direct_benchmark: bool = False)

Initializes OMEController for InferenceService management
Selects benchmark controller based on mode:
- DirectBenchmarkController (–direct flag)
- BenchmarkController (default)
Prints active mode to console

Main Execution Flow (run_task):

Load JSON task configuration
Generate parameter grid (Cartesian product of all parameter combinations)
For each parameter combination:
- Call run_experiment() with parameters
- Append results to list
Find best result by objective score
Save summary to results/{task_name}_results.json

Per-Experiment Flow (run_experiment):

Step 1: Deploy InferenceService
  └─> OMEController.deploy_inference_service()
  └─> Creates unique InferenceService with parameters

Step 2: Wait for Ready
  └─> OMEController.wait_for_ready()
  └─> Polls status.conditions for Ready=True
  └─> Timeout: configurable (default 600s)

Step 3: Run Benchmark
  ├─> If --direct mode:
  │   └─> DirectBenchmarkController.run_benchmark()
  │       └─> Sets up kubectl port-forward
  │       └─> Runs genai-bench CLI locally
  │
  └─> Else (K8s mode):
      └─> BenchmarkController.create_benchmark_job()
      └─> BenchmarkController.wait_for_completion()

Step 4: Collect Results
  └─> Extract metrics from benchmark output
  └─> Calculate objective score (minimize_latency / maximize_throughput)

Cleanup: Remove experiment resources
  └─> Delete InferenceService
  └─> Delete BenchmarkJob (if applicable)

3.2 OME Controller: `src/controllers/ome_controller.py`¶

Class: OMEController

Kubernetes API Initialization:

from kubernetes import client, config

# Try kubeconfig, then in-cluster config
config.load_kube_config(config_file=kubeconfig_path)
# Or: config.load_incluster_config()

self.custom_api = client.CustomObjectsApi()  # For CRDs
self.core_api = client.CoreV1Api()            # For core resources

Key Methods:

deploy_inference_service()
- Renders Jinja2 template with parameters
- Creates namespace if needed
- Posts InferenceService CRD to Kubernetes
- Returns InferenceService name
- CRD group: ome.io, version: v1beta1, plural: inferenceservices
wait_for_ready()
- Polls .status.conditions[].type == "Ready" field
- Checks .status.conditions[].status == "True"
- Logs status messages and reasons
- Returns True on Ready, False on timeout
delete_inference_service()
- Deletes InferenceService CRD
- Ignores 404 (already deleted)
- Returns True on success
create_namespace() / get_service_url()
- Helper methods for namespace management

3.3 Benchmark Controllers¶

BenchmarkController (K8s Mode): `src/controllers/benchmark_controller.py`¶

Key Methods:

create_benchmark_job()
- Renders BenchmarkJob template
- Posts to Kubernetes
- CRD group: ome.io, version: v1beta1, plural: benchmarkjobs
wait_for_completion()
- Polls .status.state field
- Returns True when state == “Complete”
- Returns False when state == “Failed” or timeout
get_benchmark_results()
- Reads .status.results from BenchmarkJob
- Returns metrics dictionary

DirectBenchmarkController (CLI Mode): `src/controllers/direct_benchmark_controller.py`¶

Key Functionality:

setup_port_forward()
```
kubectl port-forward [pod|svc]/[name] 8080:8000 -n [namespace]
```
- Finds pod via label selector: serving.kserve.io/inferenceservice={service_name}
- Falls back to service name if no pods found
- Subprocess-based port-forward in background
- Returns http://localhost:8080 endpoint URL
run_benchmark()
- Builds genai-bench command with all parameters
- Executes: env/bin/genai-bench benchmark --api-backend openai --api-base {endpoint} ...
- Captures output and parses JSON results
- Cleanup port-forward in finally block
_parse_results()
- Reads JSON result files from benchmark output directory
- Extracts metrics like latency, throughput, TTFT, TPOT
cleanup_results()
- Removes local benchmark result directories

4. Configuration Files¶

4.1 Task Configuration Schema¶

File: examples/simple_task.json (User-provided)

{
  "task_name": "simple-tune",
  "description": "Description",
  "model": {
    "name": "llama-3-2-1b-instruct",
    "namespace": "autotuner"
  },
  "base_runtime": "llama-3-2-1b-instruct-rt",
  "parameters": {
    "tp_size": {"type": "choice", "values": [1, 2]},
    "mem_frac": {"type": "choice", "values": [0.8, 0.9]},
    "max_total_tokens": {"type": "choice", "values": [4096, 8192]},
    "schedule_policy": {"type": "choice", "values": ["lpm", "fcfs"]}
  },
  "optimization": {
    "strategy": "grid_search",
    "objective": "minimize_latency",
    "max_iterations": 4,
    "timeout_per_iteration": 600
  },
  "benchmark": {
    "task": "text-to-text",
    "model_name": "llama-3-2-1b-instruct",
    "model_tokenizer": "meta-llama/Llama-3.2-1B-Instruct",
    "traffic_scenarios": ["D(100,100)"],
    "num_concurrency": [1, 4],
    "max_time_per_iteration": 10,
    "max_requests_per_iteration": 50,
    "additional_params": {"temperature": "0.0"}
  }
}

Key Constraints:

model.name: Must exist as a ClusterBaseModel
base_runtime: Must exist as a ClusterServingRuntime
parameters: Only “choice” type supported (Cartesian product)
optimization.strategy: Only “grid_search” supported
optimization.objective: “minimize_latency” or “maximize_throughput”

4.2 Example Resource Configurations¶

ClusterBaseModel Example: File: /config/examples/clusterbasemodel-llama-3.2-1b.yaml

apiVersion: ome.io/v1beta1
kind: ClusterBaseModel
metadata:
  name: llama-3-2-1b-instruct
spec:
  vendor: meta
  version: "3.2"
  modelType: llama
  modelParameterSize: "1B"
  maxTokens: 8192
  modelCapabilities: [text-to-text]
  modelFormat:
    name: safetensors
    version: "1.0.0"
  storage:
    storageUri: hf://meta-llama/Llama-3.2-1B-Instruct
    path: /mnt/data/models/llama-3.2-1b-instruct

ClusterServingRuntime Example: File: /config/examples/clusterservingruntime-sglang.yaml

apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
  name: llama-3-2-1b-instruct-rt
spec:
  engineConfig:
    runner:
      name: ome-container
      image: docker.io/lmsysorg/sglang:v0.5.2-cu126
      command: [python3, -m, sglang.launch_server]
      args:
        - --host=0.0.0.0
        - --port=8080
        - --model-path=/mnt/data/models/llama-3.2-1b-instruct
        - --tp-size=1
        - --enable-metrics
      resources:
        requests:
          nvidia.com/gpu: 1
        limits:
          nvidia.com/gpu: 1

PVC Configuration: File: /config/benchmark-pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: benchmark-results-pvc
  namespace: autotuner
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 1Gi

5. Deployment Entry Points¶

5.1 Primary Entry Point: `run_autotuner.py`¶

CLI Interface:

python src/run_autotuner.py <task_json_file> [kubeconfig_path] [--direct]

Arguments:
  task_json_file     - Path to task configuration JSON
  kubeconfig_path    - Optional: Path to kubeconfig (defaults to ~/.kube/config)
  --direct           - Optional: Use direct genai-bench CLI instead of K8s BenchmarkJob

Examples:
  python src/run_autotuner.py examples/simple_task.json
  python src/run_autotuner.py examples/simple_task.json /path/to/kubeconfig
  python src/run_autotuner.py examples/simple_task.json --direct
  python src/run_autotuner.py examples/simple_task.json ~/.kube/config --direct

Environment Activation:

# Must activate virtual environment first
source env/bin/activate

# Then run
python src/run_autotuner.py examples/simple_task.json --direct

5.2 Installation and Setup: `install.sh`¶

CLI Interface:

./install.sh [OPTIONS]

Options:
  --skip-venv         Skip virtual environment creation
  --skip-k8s          Skip Kubernetes resource creation
  --install-ome       Install OME operator automatically
  --venv-path PATH    Custom virtual environment path
  --help              Show help message

Installation Flow:
1. Verify prerequisites (Python, pip, kubectl, git)
2. Initialize git submodules (OME, genai-bench)
3. Create Python virtual environment
4. Install dependencies from requirements.txt
5. Install genai-bench in editable mode
6. Create result/benchmark_results directories
7. Create K8s namespace and PVC
8. Install OME (if --install-ome flag)
9. Verify OME installation and CRDs

Key Dependencies Installed:

kubernetes>=28.1.0
pyyaml>=6.0
jinja2>=3.1.0
genai-bench (from third_party/genai-bench)

6. Deployment Assumptions and Constraints¶

6.1 Hardcoded Assumptions¶

Kubernetes Assumptions:

Cluster version: v1.28+
Namespace for autotuner: autotuner
OME namespace: ome
Kubernetes API available via kubeconfig or in-cluster config

OME Assumptions:

OME CRDs installed: InferenceService, BenchmarkJob, ClusterBaseModel, ClusterServingRuntime
OME controller pod running in ome namespace
KEDA and cert-manager installed as OME dependencies

Model/Runtime Assumptions:

Model stored in K8s cluster accessible path
Model path: /mnt/data/models/{model_name}
Runtime image: docker.io/lmsysorg/sglang:v0.5.2-cu126 (SGLang)
Port: 8080 (InferenceService)
Port: 8000 (SGLang server before port-forward)

Hardware Assumptions:

GPU available (nvidia.com/gpu resource)
GPU count specified by tp_size parameter
Model can fit in GPU memory with mem_frac setting

Storage Assumptions:

PVC named benchmark-results-pvc exists in autotuner namespace
Storage backend supports 1Gi allocation
AccessMode: ReadWriteOnce

6.2 Deployment Mode Selection Logic¶

# Determined by CLI flag
if "--direct" in sys.argv:
    use_direct_benchmark = True
    benchmark_controller = DirectBenchmarkController()
    print("Using direct genai-bench CLI execution")
else:
    use_direct_benchmark = False
    benchmark_controller = BenchmarkController(kubeconfig_path)
    print("Using Kubernetes BenchmarkJob CRD")

Default Behavior: K8s BenchmarkJob mode (requires working Docker image and PVC)

6.3 Status Polling and Timeouts¶

InferenceService Ready Check:

Poll interval: 10 seconds
Default timeout: 600 seconds (10 minutes)
Condition checked: .status.conditions[].type == "Ready"
Status field: .status.conditions[].status (must be “True”)

BenchmarkJob Completion Check:

Poll interval: 15 seconds
Default timeout: 1800 seconds (30 minutes)
Status field: .status.state
States: “Complete”, “Failed”, or polling continues

Configurable Per Task:

"optimization": {
  "timeout_per_iteration": 600  // seconds
}

7. Results Output and Storage¶

7.1 Results Persistence¶

Location: results/{task_name}_results.json

Structure:

{
  "task_name": "simple-tune",
  "total_experiments": 4,
  "successful_experiments": 4,
  "elapsed_time": 1245.3,
  "best_result": {
    "experiment_id": 2,
    "parameters": {"tp_size": 1, "mem_frac": 0.9},
    "status": "success",
    "metrics": {...},
    "objective_score": 89.2
  },
  "all_results": [
    {
      "experiment_id": 1,
      "parameters": {"tp_size": 1, "mem_frac": 0.8},
      "status": "success",
      "metrics": {...},
      "objective_score": 125.3
    },
    ...
  ]
}

7.2 Direct CLI Mode Results¶

Storage Location: benchmark_results/{task_name}-exp{id}/

File Pattern: Genai-bench creates various output files:

*_results.json - Benchmark metrics
Various metadata and logging files

8. Deployment Workflow Diagram¶

User Provides:
  examples/simple_task.json
           │
           ▼
┌──────────────────────────────────────┐
│ run_autotuner.py main()              │
│ - Parse CLI args (kubeconfig, --direct) │
│ - Create AutotunerOrchestrator      │
└──────────────────┬───────────────────┘
                   │
                   ▼
         load_task(task_file)
           Parse JSON config
                   │
                   ▼
      generate_parameter_grid()
      All combinations from "choice" params
                   │
    ┌──────────────┴──────────────┐
    │ For each parameter set:     │
    │                             │
    ▼                             ▼
┌────────────────────┐    ┌────────────────┐
│ run_experiment()   │───▶│ Step 1:        │
│                    │    │ Deploy         │
│                    │    │ InferenceService
└────────────────────┘    └────────┬───────┘
         ▲                         │
         │         ┌───────────────┘
         │         │
         │         ▼
         │    ┌─────────────────┐
         │    │ Step 2:         │
         │    │ Wait for Ready  │
         │    └────────┬────────┘
         │             │
         │             ▼
         │    ┌──────────────────────┐
         │    │ Step 3:              │
         │    │ Run Benchmark        │
         │    │ (Two modes)          │
         │    └─────────┬────────────┘
         │              │
         │    ┌─────────┴─────────┐
         │    │                   │
         │ ┌──▼──────────┐  ┌─────▼──────────┐
         │ │Direct Mode: │  │K8s Mode:       │
         │ │- Port Fwd   │  │- BenchmarkJob  │
         │ │- Local CLI  │  │- Wait complete │
         │ └─────────────┘  │- Read PVC      │
         │                   └────────────────┘
         │                          │
         ▼                          ▼
    ┌────────────────────────────────────┐
    │ Step 4: Collect Results            │
    │ - Parse benchmark output           │
    │ - Calculate objective_score        │
    └────────────────┬───────────────────┘
                     │
                     ▼
    ┌────────────────────────────────────┐
    │ Cleanup: Delete InferenceService   │
    │         Delete BenchmarkJob (if K8s)
    └────────────────┬───────────────────┘
                     │
                     └─────────────────┐
                                       │ Repeat for
                                       │ next parameter
                                       │ combination
                                       │
                                    (Loop continues)
                                       │
                                       ▼
    ┌────────────────────────────────────┐
    │ Find Best Result                   │
    │ - Compare objective_scores         │
    │ - Select minimum (or inverted max) │
    └────────────────┬───────────────────┘
                     │
                     ▼
    ┌────────────────────────────────────┐
    │ Save Results                       │
    │ results/{task_name}_results.json   │
    └────────────────────────────────────┘

9. Key Files and Their Roles¶

File	Lines	Purpose
`src/run_autotuner.py`	305	Main orchestrator, experiment flow control
`src/controllers/ome_controller.py`	232	InferenceService CRUD operations
`src/controllers/benchmark_controller.py`	229	BenchmarkJob CRD management (K8s mode)
`src/controllers/direct_benchmark_controller.py`	335	Direct genai-bench CLI execution
`src/templates/inference_service.yaml.j2`	36	InferenceService manifest template
`src/templates/benchmark_job.yaml.j2`	41	BenchmarkJob manifest template
`src/utils/optimizer.py`	67	Parameter grid generation, objective scoring
`install.sh`	460	Full environment setup and validation
`examples/simple_task.json`	37	Example task configuration
`config/benchmark-pvc.yaml`	12	PersistentVolumeClaim for K8s mode
`requirements.txt`	4	Python dependencies
`README.md`	705	Complete user documentation

10. Deployment Scenarios¶

Scenario 1: Direct CLI Mode (Recommended)¶

prerequisites:
  ✓ Kubernetes cluster with OME
  ✓ OME InferenceService CRD
  ✓ genai-bench CLI installed locally
  ✓ kubectl port-forward capability

flow:
  1. Deploy InferenceService (OME)
  2. Wait for pod ready
  3. kubectl port-forward to expose service
  4. Run genai-bench CLI → http://localhost:8080
  5. Parse JSON results from local filesystem
  6. Clean up port-forward
  7. Delete InferenceService

Scenario 2: Kubernetes BenchmarkJob Mode¶

prerequisites:
  ✓ Kubernetes cluster with OME
  ✓ OME BenchmarkJob CRD
  ✓ PersistentVolumeClaim for results
  ✓ Working genai-bench Docker image
  ✓ Kubernetes RBAC permissions

flow:
  1. Deploy InferenceService (OME)
  2. Wait for pod ready
  3. Create BenchmarkJob CRD
  4. Wait for BenchmarkJob to complete
  5. Read results from PVC/BenchmarkJob status
  6. Delete BenchmarkJob
  7. Delete InferenceService

11. Summary of Deployment Mechanisms¶

Current Deployment Stack:¶

Orchestration: Python (run_autotuner.py)
Deployment Engine: Kubernetes + OME CRDs
Model Serving: SGLang (via OME InferenceService)
Benchmarking: genai-bench (CLI or K8s Pod)
Configuration: JSON task files, YAML templates
Storage: Kubernetes PersistentVolumeClaim

Deployment Triggers:¶

Manual: python src/run_autotuner.py <task.json> [--direct]
Installation: ./install.sh [--install-ome]
No automated triggers (deployment is manual or scheduled externally)

Infrastructure Code:¶

Kubernetes API calls (via kubernetes-client Python library)
kubectl port-forward (subprocess-based)
Helm charts for OME installation (shell script)
No Docker build or Docker Compose

Resource Lifecycle:¶

Created: InferenceService per experiment, BenchmarkJob per experiment (K8s mode)
Managed: Namespace autotuner, PVC benchmark-results-pvc
Deleted: Each resource cleaned up after experiment completes

12. Hardcoded Configuration Values¶

Setting	Value	Location	Modifiable
OME CRD Group	`ome.io`	controllers/*.py	Code change
OME CRD Version	`v1beta1`	controllers/*.py	Code change
Namespace	`autotuner`	install.sh	Yes (shell var)
OME Namespace	`ome`	install.sh	Code change
PVC Name	`benchmark-results-pvc`	benchmark_job.yaml.j2	Code change
SGLang Port	8080	multiple files	Code change
Port Forward Remote	8000	direct_benchmark_controller.py	Code change
Port Forward Local	8080	direct_benchmark_controller.py	Code change
Poll Interval (ISVC)	10s	ome_controller.py	Code change
Poll Interval (Benchmark)	15s	benchmark_controller.py	Code change
Default Timeout	600s, 1800s	run_autotuner.py	Task JSON
Genai-bench Image	`kllambda/genai-bench:v251014`	benchmark_job.yaml.j2	Code change
SGLang Image	`docker.io/lmsysorg/sglang:v0.5.2-cu126`	config examples	YAML