Autotuner: Comprehensive Deployment Architecture Analysis

Executive Summary

The autotuner is a prototype LLM inference parameter optimization system that uses Kubernetes-native deployment through the OME (Open Model Engine) framework. It supports two benchmark execution modes and dynamically deploys InferenceServices with different parameter configurations for automated tuning.


1. Current Deployment Architecture

1.1 Overall Architecture Pattern

┌─────────────────────────────────────────────────────────────────┐
│                    Autotuner (Python)                 │
│              src/run_autotuner.py (Orchestrator)                │
└────────────────────────────────────────────────────────────────┘
                              │
                ┌─────────────┴────────────────┐
                │                              │
         ┌──────▼──────┐           ┌──────────▼─────────┐
         │ OMEController │         │ BenchmarkController │
         │ (K8s Native) │         │  (Two Modes)        │
         └──────┬──────┘           └──────────┬──────────┘
                │                             │
    ┌───────────┴──────────┐      ┌──────────┴────────────┐
    │                      │      │                       │
┌───▼────────┐      ┌──────▼───┐ │             ┌──────────▼────┐
│InferenceService  │Model│Runtime │             │ Direct CLI   │
│(Custom CRD)      │(CRD)│ (CRD)  │             │(Port Forward)│
└───────────┘      └─────┴──────┘              └──────────────┘
    │                   │                           │
    └───────────┬───────┴───────────────────────────┘
                │
        ┌───────▼───────┐
        │  Kubernetes   │
        │  (K8s Cluster)│
        └───────────────┘
                │
    ┌───────────┴──────────────┐
    │                          │
┌───▼──────────┐        ┌──────▼───────┐
│SGLang Runtime│        │Genai-Bench   │
│   Pods       │        │ Benchmarking │
└──────────────┘        └──────────────┘

1.2 Deployment Modes: Two Benchmark Execution Pathways

Mode 1: Kubernetes BenchmarkJob (Default)

  • Uses OME’s BenchmarkJob Custom Resource Definition

  • Runs genai-bench inside Kubernetes pods

  • Results stored in PersistentVolumeClaim (PVC)

  • Command: python src/run_autotuner.py examples/simple_task.json


2. Key Deployment Components

2.1 OME (Open Model Engine) - The Core Infrastructure

OME is a required prerequisite that provides:

Custom Resource Definitions (CRDs):

  • InferenceService.ome.io/v1beta1 - Model serving endpoints

  • BenchmarkJob.ome.io/v1beta1 - Performance testing jobs

  • ClusterBaseModel.ome.io/v1beta1 - Model metadata and storage

  • ClusterServingRuntime.ome.io/v1beta1 - Runtime configurations

Supporting Components:

  • OME Controller Manager (Kubernetes Deployment)

  • Model Agent DaemonSet (model distribution)

  • RBAC configurations

  • Webhooks (validation and mutation)

Installation:

  • Via Helm: oci://ghcr.io/moirai-internal/charts/ome-crd and charts/ome-resources

  • Dependencies: cert-manager, KEDA

  • Namespace: ome

2.2 InferenceService Deployment

What It Is:

  • OME custom resource that wraps SGLang inference engines

  • Dynamically created per experiment with unique parameter configurations

Template-Based Generation:

  • File: /src/templates/inference_service.yaml.j2 (Jinja2 template)

  • Contains:

    • Namespace declaration

    • InferenceService metadata with experiment labels

    • Model specification

    • SGLang engine configuration with tunable parameters

    • GPU resource requests

Template Variables:

{{ namespace }}          - K8s namespace (e.g., "autotuner")
{{ isvc_name }}         - Unique service name (e.g., "simple-tune-exp1")
{{ task_name }}         - Task identifier (e.g., "simple-tune")
{{ experiment_id }}     - Experiment number (1, 2, 3, ...)
{{ model_name }}        - Base model (e.g., "llama-3-2-1b-instruct")
{{ runtime_name }}      - Runtime config (e.g., "llama-3-2-1b-instruct-rt")
{{ tp_size }}           - Tensor parallelism (GPU count)
{{ mem_frac }}          - Memory fraction (0.6-0.95)
{{ max_total_tokens }}  - Optional token limit
{{ schedule_policy }}   - Optional scheduling (lpm, random, fcfs)

SGLang Container Args:

- --host=0.0.0.0
- --port=8080
- --model-path=/mnt/data/models/{model_name}
- --tp-size={tp_size}
- --mem-frac={mem_frac}

GPU Resources:

limits:
  nvidia.com/gpu: {{ tp_size }}
requests:
  nvidia.com/gpu: {{ tp_size }}

2.3 BenchmarkJob Deployment

When Used (K8s Mode Only):

  • After InferenceService becomes ready

  • For performance evaluation

Template-Based Generation:

  • File: /src/templates/benchmark_job.yaml.j2

  • Unique per experiment

Key Configuration:

podOverride:
  image: "kllambda/genai-bench:v251014"
endpoint:
  url: "http://{isvc_name}.{namespace}.svc.cluster.local"
  apiFormat: "openai"
outputLocation:
  storageUri: "pvc://benchmark-results-pvc/{benchmark_name}"

2.4 Persistent Storage

PersistentVolumeClaim:

  • File: /config/benchmark-pvc.yaml

  • Name: benchmark-results-pvc

  • Namespace: autotuner

  • Size: 1Gi

  • Access: ReadWriteOnce

  • Purpose: Store benchmark results from BenchmarkJobs


3. Deployment Logic Implementation

3.1 Main Orchestrator: src/run_autotuner.py

Class: AutotunerOrchestrator

Constructor (__init__):

def __init__(self, kubeconfig_path: str = None, use_direct_benchmark: bool = False)
  • Initializes OMEController for InferenceService management

  • Selects benchmark controller based on mode:

    • DirectBenchmarkController (–direct flag)

    • BenchmarkController (default)

  • Prints active mode to console

Main Execution Flow (run_task):

  1. Load JSON task configuration

  2. Generate parameter grid (Cartesian product of all parameter combinations)

  3. For each parameter combination:

    • Call run_experiment() with parameters

    • Append results to list

  4. Find best result by objective score

  5. Save summary to results/{task_name}_results.json

Per-Experiment Flow (run_experiment):

Step 1: Deploy InferenceService
  └─> OMEController.deploy_inference_service()
  └─> Creates unique InferenceService with parameters

Step 2: Wait for Ready
  └─> OMEController.wait_for_ready()
  └─> Polls status.conditions for Ready=True
  └─> Timeout: configurable (default 600s)

Step 3: Run Benchmark
  ├─> If --direct mode:
  │   └─> DirectBenchmarkController.run_benchmark()
  │       └─> Sets up kubectl port-forward
  │       └─> Runs genai-bench CLI locally
  │
  └─> Else (K8s mode):
      └─> BenchmarkController.create_benchmark_job()
      └─> BenchmarkController.wait_for_completion()

Step 4: Collect Results
  └─> Extract metrics from benchmark output
  └─> Calculate objective score (minimize_latency / maximize_throughput)

Cleanup: Remove experiment resources
  └─> Delete InferenceService
  └─> Delete BenchmarkJob (if applicable)

3.2 OME Controller: src/controllers/ome_controller.py

Class: OMEController

Kubernetes API Initialization:

from kubernetes import client, config

# Try kubeconfig, then in-cluster config
config.load_kube_config(config_file=kubeconfig_path)
# Or: config.load_incluster_config()

self.custom_api = client.CustomObjectsApi()  # For CRDs
self.core_api = client.CoreV1Api()            # For core resources

Key Methods:

  1. deploy_inference_service()

    • Renders Jinja2 template with parameters

    • Creates namespace if needed

    • Posts InferenceService CRD to Kubernetes

    • Returns InferenceService name

    • CRD group: ome.io, version: v1beta1, plural: inferenceservices

  2. wait_for_ready()

    • Polls .status.conditions[].type == "Ready" field

    • Checks .status.conditions[].status == "True"

    • Logs status messages and reasons

    • Returns True on Ready, False on timeout

  3. delete_inference_service()

    • Deletes InferenceService CRD

    • Ignores 404 (already deleted)

    • Returns True on success

  4. create_namespace() / get_service_url()

    • Helper methods for namespace management

3.3 Benchmark Controllers

BenchmarkController (K8s Mode): src/controllers/benchmark_controller.py

Key Methods:

  1. create_benchmark_job()

    • Renders BenchmarkJob template

    • Posts to Kubernetes

    • CRD group: ome.io, version: v1beta1, plural: benchmarkjobs

  2. wait_for_completion()

    • Polls .status.state field

    • Returns True when state == “Complete”

    • Returns False when state == “Failed” or timeout

  3. get_benchmark_results()

    • Reads .status.results from BenchmarkJob

    • Returns metrics dictionary

DirectBenchmarkController (CLI Mode): src/controllers/direct_benchmark_controller.py

Key Functionality:

  1. setup_port_forward()

    kubectl port-forward [pod|svc]/[name] 8080:8000 -n [namespace]
    
    • Finds pod via label selector: serving.kserve.io/inferenceservice={service_name}

    • Falls back to service name if no pods found

    • Subprocess-based port-forward in background

    • Returns http://localhost:8080 endpoint URL

  2. run_benchmark()

    • Builds genai-bench command with all parameters

    • Executes: env/bin/genai-bench benchmark --api-backend openai --api-base {endpoint} ...

    • Captures output and parses JSON results

    • Cleanup port-forward in finally block

  3. _parse_results()

    • Reads JSON result files from benchmark output directory

    • Extracts metrics like latency, throughput, TTFT, TPOT

  4. cleanup_results()

    • Removes local benchmark result directories


4. Configuration Files

4.1 Task Configuration Schema

File: examples/simple_task.json (User-provided)

{
  "task_name": "simple-tune",
  "description": "Description",
  "model": {
    "name": "llama-3-2-1b-instruct",
    "namespace": "autotuner"
  },
  "base_runtime": "llama-3-2-1b-instruct-rt",
  "parameters": {
    "tp_size": {"type": "choice", "values": [1, 2]},
    "mem_frac": {"type": "choice", "values": [0.8, 0.9]},
    "max_total_tokens": {"type": "choice", "values": [4096, 8192]},
    "schedule_policy": {"type": "choice", "values": ["lpm", "fcfs"]}
  },
  "optimization": {
    "strategy": "grid_search",
    "objective": "minimize_latency",
    "max_iterations": 4,
    "timeout_per_iteration": 600
  },
  "benchmark": {
    "task": "text-to-text",
    "model_name": "llama-3-2-1b-instruct",
    "model_tokenizer": "meta-llama/Llama-3.2-1B-Instruct",
    "traffic_scenarios": ["D(100,100)"],
    "num_concurrency": [1, 4],
    "max_time_per_iteration": 10,
    "max_requests_per_iteration": 50,
    "additional_params": {"temperature": "0.0"}
  }
}

Key Constraints:

  • model.name: Must exist as a ClusterBaseModel

  • base_runtime: Must exist as a ClusterServingRuntime

  • parameters: Only “choice” type supported (Cartesian product)

  • optimization.strategy: Only “grid_search” supported

  • optimization.objective: “minimize_latency” or “maximize_throughput”

4.2 Example Resource Configurations

ClusterBaseModel Example: File: /config/examples/clusterbasemodel-llama-3.2-1b.yaml

apiVersion: ome.io/v1beta1
kind: ClusterBaseModel
metadata:
  name: llama-3-2-1b-instruct
spec:
  vendor: meta
  version: "3.2"
  modelType: llama
  modelParameterSize: "1B"
  maxTokens: 8192
  modelCapabilities: [text-to-text]
  modelFormat:
    name: safetensors
    version: "1.0.0"
  storage:
    storageUri: hf://meta-llama/Llama-3.2-1B-Instruct
    path: /mnt/data/models/llama-3.2-1b-instruct

ClusterServingRuntime Example: File: /config/examples/clusterservingruntime-sglang.yaml

apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
  name: llama-3-2-1b-instruct-rt
spec:
  engineConfig:
    runner:
      name: ome-container
      image: docker.io/lmsysorg/sglang:v0.5.2-cu126
      command: [python3, -m, sglang.launch_server]
      args:
        - --host=0.0.0.0
        - --port=8080
        - --model-path=/mnt/data/models/llama-3.2-1b-instruct
        - --tp-size=1
        - --enable-metrics
      resources:
        requests:
          nvidia.com/gpu: 1
        limits:
          nvidia.com/gpu: 1

PVC Configuration: File: /config/benchmark-pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: benchmark-results-pvc
  namespace: autotuner
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 1Gi

5. Deployment Entry Points

5.1 Primary Entry Point: run_autotuner.py

CLI Interface:

python src/run_autotuner.py <task_json_file> [kubeconfig_path] [--direct]

Arguments:
  task_json_file     - Path to task configuration JSON
  kubeconfig_path    - Optional: Path to kubeconfig (defaults to ~/.kube/config)
  --direct           - Optional: Use direct genai-bench CLI instead of K8s BenchmarkJob

Examples:
  python src/run_autotuner.py examples/simple_task.json
  python src/run_autotuner.py examples/simple_task.json /path/to/kubeconfig
  python src/run_autotuner.py examples/simple_task.json --direct
  python src/run_autotuner.py examples/simple_task.json ~/.kube/config --direct

Environment Activation:

# Must activate virtual environment first
source env/bin/activate

# Then run
python src/run_autotuner.py examples/simple_task.json --direct

5.2 Installation and Setup: install.sh

CLI Interface:

./install.sh [OPTIONS]

Options:
  --skip-venv         Skip virtual environment creation
  --skip-k8s          Skip Kubernetes resource creation
  --install-ome       Install OME operator automatically
  --venv-path PATH    Custom virtual environment path
  --help              Show help message

Installation Flow:
1. Verify prerequisites (Python, pip, kubectl, git)
2. Initialize git submodules (OME, genai-bench)
3. Create Python virtual environment
4. Install dependencies from requirements.txt
5. Install genai-bench in editable mode
6. Create result/benchmark_results directories
7. Create K8s namespace and PVC
8. Install OME (if --install-ome flag)
9. Verify OME installation and CRDs

Key Dependencies Installed:

kubernetes>=28.1.0
pyyaml>=6.0
jinja2>=3.1.0
genai-bench (from third_party/genai-bench)

6. Deployment Assumptions and Constraints

6.1 Hardcoded Assumptions

Kubernetes Assumptions:

  • Cluster version: v1.28+

  • Namespace for autotuner: autotuner

  • OME namespace: ome

  • Kubernetes API available via kubeconfig or in-cluster config

OME Assumptions:

  • OME CRDs installed: InferenceService, BenchmarkJob, ClusterBaseModel, ClusterServingRuntime

  • OME controller pod running in ome namespace

  • KEDA and cert-manager installed as OME dependencies

Model/Runtime Assumptions:

  • Model stored in K8s cluster accessible path

  • Model path: /mnt/data/models/{model_name}

  • Runtime image: docker.io/lmsysorg/sglang:v0.5.2-cu126 (SGLang)

  • Port: 8080 (InferenceService)

  • Port: 8000 (SGLang server before port-forward)

Hardware Assumptions:

  • GPU available (nvidia.com/gpu resource)

  • GPU count specified by tp_size parameter

  • Model can fit in GPU memory with mem_frac setting

Storage Assumptions:

  • PVC named benchmark-results-pvc exists in autotuner namespace

  • Storage backend supports 1Gi allocation

  • AccessMode: ReadWriteOnce

6.2 Deployment Mode Selection Logic

# Determined by CLI flag
if "--direct" in sys.argv:
    use_direct_benchmark = True
    benchmark_controller = DirectBenchmarkController()
    print("Using direct genai-bench CLI execution")
else:
    use_direct_benchmark = False
    benchmark_controller = BenchmarkController(kubeconfig_path)
    print("Using Kubernetes BenchmarkJob CRD")

Default Behavior: K8s BenchmarkJob mode (requires working Docker image and PVC)

6.3 Status Polling and Timeouts

InferenceService Ready Check:

  • Poll interval: 10 seconds

  • Default timeout: 600 seconds (10 minutes)

  • Condition checked: .status.conditions[].type == "Ready"

  • Status field: .status.conditions[].status (must be “True”)

BenchmarkJob Completion Check:

  • Poll interval: 15 seconds

  • Default timeout: 1800 seconds (30 minutes)

  • Status field: .status.state

  • States: “Complete”, “Failed”, or polling continues

Configurable Per Task:

"optimization": {
  "timeout_per_iteration": 600  // seconds
}

7. Results Output and Storage

7.1 Results Persistence

Location: results/{task_name}_results.json

Structure:

{
  "task_name": "simple-tune",
  "total_experiments": 4,
  "successful_experiments": 4,
  "elapsed_time": 1245.3,
  "best_result": {
    "experiment_id": 2,
    "parameters": {"tp_size": 1, "mem_frac": 0.9},
    "status": "success",
    "metrics": {...},
    "objective_score": 89.2
  },
  "all_results": [
    {
      "experiment_id": 1,
      "parameters": {"tp_size": 1, "mem_frac": 0.8},
      "status": "success",
      "metrics": {...},
      "objective_score": 125.3
    },
    ...
  ]
}

7.2 Direct CLI Mode Results

Storage Location: benchmark_results/{task_name}-exp{id}/

File Pattern: Genai-bench creates various output files:

  • *_results.json - Benchmark metrics

  • Various metadata and logging files


8. Deployment Workflow Diagram

User Provides:
  examples/simple_task.json
           │
           ▼
┌──────────────────────────────────────┐
│ run_autotuner.py main()              │
│ - Parse CLI args (kubeconfig, --direct) │
│ - Create AutotunerOrchestrator      │
└──────────────────┬───────────────────┘
                   │
                   ▼
         load_task(task_file)
           Parse JSON config
                   │
                   ▼
      generate_parameter_grid()
      All combinations from "choice" params
                   │
    ┌──────────────┴──────────────┐
    │ For each parameter set:     │
    │                             │
    ▼                             ▼
┌────────────────────┐    ┌────────────────┐
│ run_experiment()   │───▶│ Step 1:        │
│                    │    │ Deploy         │
│                    │    │ InferenceService
└────────────────────┘    └────────┬───────┘
         ▲                         │
         │         ┌───────────────┘
         │         │
         │         ▼
         │    ┌─────────────────┐
         │    │ Step 2:         │
         │    │ Wait for Ready  │
         │    └────────┬────────┘
         │             │
         │             ▼
         │    ┌──────────────────────┐
         │    │ Step 3:              │
         │    │ Run Benchmark        │
         │    │ (Two modes)          │
         │    └─────────┬────────────┘
         │              │
         │    ┌─────────┴─────────┐
         │    │                   │
         │ ┌──▼──────────┐  ┌─────▼──────────┐
         │ │Direct Mode: │  │K8s Mode:       │
         │ │- Port Fwd   │  │- BenchmarkJob  │
         │ │- Local CLI  │  │- Wait complete │
         │ └─────────────┘  │- Read PVC      │
         │                   └────────────────┘
         │                          │
         ▼                          ▼
    ┌────────────────────────────────────┐
    │ Step 4: Collect Results            │
    │ - Parse benchmark output           │
    │ - Calculate objective_score        │
    └────────────────┬───────────────────┘
                     │
                     ▼
    ┌────────────────────────────────────┐
    │ Cleanup: Delete InferenceService   │
    │         Delete BenchmarkJob (if K8s)
    └────────────────┬───────────────────┘
                     │
                     └─────────────────┐
                                       │ Repeat for
                                       │ next parameter
                                       │ combination
                                       │
                                    (Loop continues)
                                       │
                                       ▼
    ┌────────────────────────────────────┐
    │ Find Best Result                   │
    │ - Compare objective_scores         │
    │ - Select minimum (or inverted max) │
    └────────────────┬───────────────────┘
                     │
                     ▼
    ┌────────────────────────────────────┐
    │ Save Results                       │
    │ results/{task_name}_results.json   │
    └────────────────────────────────────┘

9. Key Files and Their Roles

File

Lines

Purpose

src/run_autotuner.py

305

Main orchestrator, experiment flow control

src/controllers/ome_controller.py

232

InferenceService CRUD operations

src/controllers/benchmark_controller.py

229

BenchmarkJob CRD management (K8s mode)

src/controllers/direct_benchmark_controller.py

335

Direct genai-bench CLI execution

src/templates/inference_service.yaml.j2

36

InferenceService manifest template

src/templates/benchmark_job.yaml.j2

41

BenchmarkJob manifest template

src/utils/optimizer.py

67

Parameter grid generation, objective scoring

install.sh

460

Full environment setup and validation

examples/simple_task.json

37

Example task configuration

config/benchmark-pvc.yaml

12

PersistentVolumeClaim for K8s mode

requirements.txt

4

Python dependencies

README.md

705

Complete user documentation


10. Deployment Scenarios

Scenario 2: Kubernetes BenchmarkJob Mode

prerequisites:
  ✓ Kubernetes cluster with OME
  ✓ OME BenchmarkJob CRD
  ✓ PersistentVolumeClaim for results
  ✓ Working genai-bench Docker image
  ✓ Kubernetes RBAC permissions

flow:
  1. Deploy InferenceService (OME)
  2. Wait for pod ready
  3. Create BenchmarkJob CRD
  4. Wait for BenchmarkJob to complete
  5. Read results from PVC/BenchmarkJob status
  6. Delete BenchmarkJob
  7. Delete InferenceService

11. Summary of Deployment Mechanisms

Current Deployment Stack:

  1. Orchestration: Python (run_autotuner.py)

  2. Deployment Engine: Kubernetes + OME CRDs

  3. Model Serving: SGLang (via OME InferenceService)

  4. Benchmarking: genai-bench (CLI or K8s Pod)

  5. Configuration: JSON task files, YAML templates

  6. Storage: Kubernetes PersistentVolumeClaim

Deployment Triggers:

  1. Manual: python src/run_autotuner.py <task.json> [--direct]

  2. Installation: ./install.sh [--install-ome]

  3. No automated triggers (deployment is manual or scheduled externally)

Infrastructure Code:

  • Kubernetes API calls (via kubernetes-client Python library)

  • kubectl port-forward (subprocess-based)

  • Helm charts for OME installation (shell script)

  • No Docker build or Docker Compose

Resource Lifecycle:

  • Created: InferenceService per experiment, BenchmarkJob per experiment (K8s mode)

  • Managed: Namespace autotuner, PVC benchmark-results-pvc

  • Deleted: Each resource cleaned up after experiment completes


12. Hardcoded Configuration Values

Setting

Value

Location

Modifiable

OME CRD Group

ome.io

controllers/*.py

Code change

OME CRD Version

v1beta1

controllers/*.py

Code change

Namespace

autotuner

install.sh

Yes (shell var)

OME Namespace

ome

install.sh

Code change

PVC Name

benchmark-results-pvc

benchmark_job.yaml.j2

Code change

SGLang Port

8080

multiple files

Code change

Port Forward Remote

8000

direct_benchmark_controller.py

Code change

Port Forward Local

8080

direct_benchmark_controller.py

Code change

Poll Interval (ISVC)

10s

ome_controller.py

Code change

Poll Interval (Benchmark)

15s

benchmark_controller.py

Code change

Default Timeout

600s, 1800s

run_autotuner.py

Task JSON

Genai-bench Image

kllambda/genai-bench:v251014

benchmark_job.yaml.j2

Code change

SGLang Image

docker.io/lmsysorg/sglang:v0.5.2-cu126

config examples

YAML