API Reference¶
Auto-generated API documentation from Python docstrings.
Core Components¶
Orchestrator¶
Autotuner Orchestrator
Main orchestration logic for running parameter tuning experiments. Coordinates deployment controllers and benchmark execution.
- class orchestrator.AutotunerOrchestrator[source]¶
Bases:
objectMain orchestrator for the autotuning process.
- __init__(deployment_mode='ome', kubeconfig_path=None, use_direct_benchmark=False, docker_model_path='/mnt/data/models', verbose=False, http_proxy='', https_proxy='', no_proxy='', hf_token='')[source]¶
Initialize the orchestrator.
- Parameters:
deployment_mode (
str) – Deployment mode - ‘ome’ (Kubernetes), ‘docker’ (standalone), or ‘local’ (subprocess)kubeconfig_path (
str) – Path to kubeconfig file (for OME mode)use_direct_benchmark (
bool) – If True, use direct genai-bench CLI instead of K8s BenchmarkJobdocker_model_path (
str) – Base path for models in Docker/Local modeverbose (
bool) – If True, stream genai-bench output in real-timehttp_proxy (
str) – HTTP proxy URL for containers (optional)https_proxy (
str) – HTTPS proxy URL for containers (optional)no_proxy (
str) – Comma-separated list of hosts to bypass proxy (optional)hf_token (str)
- run_experiment(task, experiment_id, parameters, on_benchmark_start=None)[source]¶
Run a single tuning experiment.
- Parameters:
- Return type:
- Returns:
Experiment results dictionary
Controllers¶
Base Controller¶
Base Controller Interface
Abstract base class for model deployment controllers. Supports multiple deployment modes (OME/Kubernetes, Docker, etc.)
- class controllers.base_controller.BaseModelController[source]¶
Bases:
ABCAbstract base class for model deployment controllers.
- abstractmethod deploy_inference_service(task_name, experiment_id, namespace, model_name, runtime_name, parameters)[source]¶
Deploy a model inference service with specified parameters.
- Parameters:
task_name (
str) – Autotuning task nameexperiment_id (
int) – Unique experiment identifiernamespace (
str) – Namespace/resource group identifiermodel_name (
str) – Model name/pathruntime_name (
str) – Runtime identifier (e.g., ‘sglang’)parameters (
Dict[str,Any]) – Deployment parameters (tp_size, mem_frac, etc.)
- Return type:
- Returns:
Service identifier (name/ID) if successful, None otherwise
- abstractmethod wait_for_ready(service_id, namespace, timeout=600, poll_interval=10)[source]¶
Wait for the inference service to become ready.
- Parameters:
- Return type:
- Returns:
True if service is ready, False if timeout or error
- abstractmethod delete_inference_service(service_id, namespace)[source]¶
Delete an inference service.
Docker Controller¶
Docker Deployment Controller
Manages the lifecycle of model inference services using standalone Docker containers. No Kubernetes required - direct Docker container management.
- class controllers.docker_controller.DockerController[source]¶
Bases:
BaseModelControllerController for managing standalone Docker container deployments.
- __init__(model_base_path='/mnt/data/models', http_proxy='', https_proxy='', no_proxy='', hf_token='')[source]¶
Initialize the Docker controller.
- Parameters:
model_base_path (
str) – Base path where models are stored on the hosthttp_proxy (
str) – HTTP proxy URL (optional)https_proxy (
str) – HTTPS proxy URL (optional)no_proxy (
str) – Comma-separated list of hosts to bypass proxy (optional)hf_token (
str) – HuggingFace access token for gated models (optional)
Note
Container logs are retrieved before deletion and saved to task log file. Containers are manually removed during cleanup phase.
- deploy_inference_service(task_name, experiment_id, namespace, model_name, runtime_name, parameters, image_tag=None)[source]¶
Deploy a model inference service using Docker.
- Parameters:
task_name (
str) – Autotuning task nameexperiment_id (
int) – Unique experiment identifiernamespace (
str) – Namespace identifier (used for container naming)model_name (
str) – Model name (HuggingFace model ID or local path)runtime_name (
str) – Runtime identifier (e.g., ‘sglang’, ‘vllm’)parameters (
Dict[str,Any]) – SGLang/runtime parameters (tp_size, mem_frac, etc.)image_tag (
Optional[str]) – Optional Docker image tag (e.g., ‘v0.5.2-cu126’)
- Return type:
- Returns:
Container ID if successful, None otherwise
- wait_for_ready(service_id, namespace, timeout=600, poll_interval=5)[source]¶
Wait for the Docker container service to become ready.
OME Controller¶
OME Deployment Controller
Manages the lifecycle of InferenceService resources for autotuning experiments.
- class controllers.ome_controller.OMEController[source]¶
Bases:
BaseModelControllerController for managing OME InferenceService deployments.
- deploy_inference_service(task_name, experiment_id, namespace, model_name, runtime_name, parameters, storage=None, enable_gpu_selection=True)[source]¶
Deploy an InferenceService with specified parameters.
- Parameters:
task_name (
str) – Autotuning task nameexperiment_id (
int) – Unique experiment identifier (will be converted to string internally)namespace (
str) – K8s namespacemodel_name (
str) – Model nameruntime_name (
str) – ServingRuntime nameparameters (
Dict[str,Any]) – SGLang parameters (tp_size, mem_frac, etc.)storage (
Optional[Dict[str,Any]]) –Optional storage configuration for PVC support {
’type’: ‘pvc’, ‘pvc_name’: ‘model-storage-pvc’, ‘pvc_subpath’: ‘meta/llama-3-2-1b-instruct’, ‘mount_path’: ‘/raid/models/meta/llama-3-2-1b-instruct’
}
enable_gpu_selection (
bool) – If True, intelligently select node with idle GPUs (default: True)
- Return type:
- Returns:
InferenceService name if successful, None otherwise
- wait_for_ready(isvc_name, namespace, timeout=600, poll_interval=10)[source]¶
Wait for InferenceService to become ready.
- ensure_clusterbasemodel(name, spec, labels=None, annotations=None)[source]¶
Ensure ClusterBaseModel exists, create if missing.
- Parameters:
- Return type:
- Returns:
True if exists or created successfully, False otherwise
Local Controller¶
Local Subprocess Controller
Manages the lifecycle of model inference services using local subprocess. No Docker or Kubernetes required - direct process management.
- class controllers.local_controller.LocalController[source]¶
Bases:
BaseModelControllerController for managing local subprocess deployments.
- __init__(model_base_path='/mnt/data/models', python_path='python3', http_proxy='', https_proxy='', no_proxy='', hf_token='')[source]¶
Initialize the local subprocess controller.
- Parameters:
model_base_path (
str) – Base path where models are storedpython_path (
str) – Path to python executable with sglang installedhttp_proxy (
str) – HTTP proxy URL (optional)https_proxy (
str) – HTTPS proxy URL (optional)no_proxy (
str) – Comma-separated list of hosts to bypass proxy (optional)hf_token (
str) – HuggingFace access token for gated models (optional)
- deploy_inference_service(task_name, experiment_id, namespace, model_name, runtime_name, parameters, image_tag=None)[source]¶
Deploy a model inference service using local subprocess.
- Parameters:
task_name (
str) – Autotuning task nameexperiment_id (
int) – Unique experiment identifiernamespace (
str) – Namespace identifier (used for naming)model_name (
str) – Model name (HuggingFace model ID or local path)runtime_name (
str) – Runtime identifier (e.g., ‘sglang’, ‘vllm’)parameters (
Dict[str,Any]) – Runtime parameters (tp_size, mem_frac, etc.)image_tag (
Optional[str]) – Unused in local mode, kept for compatibility
- Return type:
- Returns:
Service ID if successful, None otherwise
- wait_for_ready(service_id, namespace, timeout=600, poll_interval=5)[source]¶
Wait for the local subprocess service to become ready.
Benchmark Controllers¶
Direct GenAI-Bench Controller
Runs genai-bench directly using the CLI instead of Kubernetes BenchmarkJob CRDs. This bypasses the genai-bench v251014 image issues by using the local installation.
- class controllers.direct_benchmark_controller.DirectBenchmarkController[source]¶
Bases:
objectController for running genai-bench directly via CLI.
- __init__(genai_bench_path='env/bin/genai-bench', verbose=False)[source]¶
Initialize the direct benchmark controller.
- setup_port_forward(service_name, namespace, remote_port=8080, local_port=8080)[source]¶
Setup kubectl port-forward for accessing InferenceService.
- Parameters:
- Return type:
- Returns:
Local endpoint URL if successful, None otherwise
- run_benchmark(task_name, experiment_id, service_name, namespace, benchmark_config, timeout=1800, local_port=8080, endpoint_url=None, gpu_indices=None)[source]¶
Run benchmark against an inference endpoint with automatic port forwarding.
- Parameters:
task_name (
str) – Autotuning task nameexperiment_id (
int) – Unique experiment identifierservice_name (
str) – K8s service name (or Docker container name)namespace (
str) – K8s namespace (ignored in Docker mode)benchmark_config (
Dict[str,Any]) – Benchmark configuration from input JSONtimeout (
int) – Maximum execution time in secondslocal_port (
int) – Local port for port forwarding (ignored if endpoint_url is provided)endpoint_url (
Optional[str]) – Optional direct endpoint URL (skips port-forward setup for Docker mode)gpu_indices (
Optional[List[int]]) – Optional list of GPU indices to monitor during benchmark
- Return type:
- Returns:
Dict containing benchmark metrics and GPU statistics, or None if failed
GenAI-Bench Wrapper
Manages BenchmarkJob resources and collects metrics.
- class controllers.benchmark_controller.BenchmarkController[source]¶
Bases:
objectController for managing OME BenchmarkJob resources.
- create_benchmark_job(task_name, experiment_id, namespace, isvc_name, benchmark_config)[source]¶
Create a BenchmarkJob to evaluate an InferenceService.
- Parameters:
- Return type:
- Returns:
BenchmarkJob name if successful, None otherwise
- wait_for_completion(benchmark_name, namespace, timeout=1800, poll_interval=15)[source]¶
Wait for BenchmarkJob to complete.
Utilities¶
Optimizer¶
Utility functions and classes for parameter optimization.
- utils.optimizer.generate_parameter_grid(parameter_spec)[source]¶
Generate all parameter combinations for grid search.
Supports two formats: 1. Simple format: {“param_name”: [value1, value2]}
Direct list of values for each parameter
Structured format: {“param_name”: {“type”: “choice”, “values”: [value1, value2]}} - Legacy format with explicit type specification
- Parameters:
parameter_spec (
Dict[str,Any]) – Dict mapping parameter names to their values or specifications Simple: {“tp-size”: [1, 2], “mem-fraction-static”: [0.7, 0.8]} Structured: {“tp_size”: {“type”: “choice”, “values”: [1, 2]}}- Return type:
- Returns:
List of parameter dictionaries, one for each combination
- utils.optimizer.calculate_slo_penalty(metrics, slo_config=None)[source]¶
Calculate SLO penalty with exponential curve near boundaries.
Implements tiered enforcement: - Minor violations: exponential penalty only - Severe violations: mark as hard failure
- Parameters:
slo_config (
Optional[Dict[str,Any]]) –SLO configuration from task JSON Format: {
- ”latency”: {
“p50”: {“threshold”: 2.0, “weight”: 1.0, “hard_fail”: false}, “p90”: {“threshold”: 5.0, “weight”: 2.0, “hard_fail”: true, “fail_ratio”: 0.2}
}, “ttft”: {“threshold”: 1.0, “weight”: 2.0, “hard_fail”: false}, “steepness”: 0.1 # Controls exponential slope (lower = steeper)
}
- Return type:
- Returns:
Tuple of (total_penalty_multiplier, is_hard_failure, violation_details) - penalty_multiplier: Value to multiply base score by (1.0 = no penalty) - is_hard_failure: True if experiment should be marked as failed - violation_details: Dict with per-metric violation info
- utils.optimizer.check_batch_slo_compliance(batch_metrics, slo_config=None)[source]¶
Check if a single batch (concurrency level) meets SLO requirements.
This is used to filter out batches that violate SLO constraints before aggregation.
- Parameters:
- Return type:
- Returns:
Tuple of (is_compliant, violation_details) - is_compliant: True if batch meets all SLO requirements - violation_details: Dict with violation information for logging
- utils.optimizer.calculate_objective_score(results, objective='minimize_latency', slo_config=None)[source]¶
Calculate objective score from benchmark results with optional SLO penalties.
- Parameters:
- Return type:
- Returns:
Objective score with SLO penalties applied (lower is better for minimization, higher for maximization) Note: For hard SLO violations, returns worst possible score (inf or -inf)
- class utils.optimizer.OptimizationStrategy[source]¶
Bases:
ABCAbstract base class for optimization strategies.
- abstractmethod tell_result(parameters, objective_score, metrics)[source]¶
Update strategy with experiment result.
- should_stop()[source]¶
Check if optimization should stop early.
- Return type:
- Returns:
True if strategy has converged or no more suggestions
- class utils.optimizer.GridSearchStrategy[source]¶
Bases:
OptimizationStrategyGrid search optimization - exhaustive evaluation of all combinations.
- __init__(parameter_spec, objective='minimize_latency', max_iterations=None)[source]¶
Initialize grid search strategy.
- class utils.optimizer.BayesianStrategy[source]¶
Bases:
OptimizationStrategyBayesian optimization using Optuna’s TPE sampler.
- __init__(parameter_spec, objective='minimize_latency', max_iterations=100, n_initial_random=5, study_name=None, storage=None)[source]¶
Initialize Bayesian optimization strategy.
- Parameters:
parameter_spec (
Dict[str,Any]) – Parameter specification dictionaryobjective (
str) – Optimization objectivemax_iterations (
int) – Maximum number of trialsn_initial_random (
int) – Number of random trials before Bayesian optimizationstorage (
Optional[str]) – Optional Optuna storage URL (e.g., sqlite:///optuna.db)
- suggest_parameters()[source]¶
Suggest next parameter configuration using Optuna.
Ensures no duplicate parameter combinations are tried by adding random perturbation if sampler suggests a duplicate.
- class utils.optimizer.RandomSearchStrategy[source]¶
Bases:
OptimizationStrategyRandom search - random sampling from parameter space.
- __init__(parameter_spec, objective='minimize_latency', max_iterations=100, seed=None)[source]¶
Initialize random search strategy.
GPU Discovery¶
GPU Discovery Utility
Provides functions to discover and select idle GPUs across the Kubernetes cluster.
- class utils.gpu_discovery.NodeGPUSummary[source]¶
Bases:
objectSummary of GPU resources on a node.
- __init__(node_name, total_gpus, allocatable_gpus, gpus_with_metrics, avg_utilization, avg_memory_usage, idle_gpu_count)¶
- utils.gpu_discovery.get_cluster_gpu_status()[source]¶
Query cluster-wide GPU status using kubectl and nvidia-smi.
- Return type:
List[ClusterGPUInfo]- Returns:
List of GPUInfo objects for all GPUs in the cluster
- utils.gpu_discovery.get_node_gpu_summaries()[source]¶
Get GPU summaries grouped by node.
- Return type:
- Returns:
Dictionary mapping node name to NodeGPUSummary
- utils.gpu_discovery.find_best_node_for_deployment(required_gpus=1, utilization_threshold=30.0, memory_threshold=50.0)[source]¶
Find the best node for deploying a new inference service.
Selection criteria (in order): 1. Must have enough allocatable GPUs 2. Prefer nodes with idle GPUs (low utilization and memory usage) 3. Among idle nodes, prefer the one with most idle GPUs 4. If no idle nodes, prefer node with lowest average utilization
- Parameters:
- Return type:
- Returns:
Node name to deploy to, or None if no suitable node found
GPU Monitor¶
GPU Monitoring and Management Utilities
Provides comprehensive GPU tracking, allocation, and monitoring capabilities for the LLM Autotuner.
- class utils.gpu_monitor.GPUMonitor[source]¶
Bases:
objectMonitor and manage GPU resources.
- query_gpus(use_cache=True)[source]¶
Query all GPU information.
- Parameters:
use_cache (
bool) – Whether to use cached results if available- Return type:
- Returns:
GPUSnapshot or None if query fails
- get_available_gpus(min_memory_mb=None, max_utilization=50)[source]¶
Get list of available GPU indices.
GPU Pool¶
GPU Resource Pool for parallel experiment execution.
This module provides a GPU resource pool that manages allocation and deallocation of GPU resources across concurrent experiments. It ensures: - No GPU allocation conflicts - Fair FIFO allocation order - Automatic cleanup on failure - Integration with existing gpu_monitor infrastructure
- class utils.gpu_pool.GPUResourcePool[source]¶
Bases:
objectGPU resource pool for managing concurrent experiment execution.
Features: - FIFO queue for fair allocation - Atomic acquire/release operations - Integration with gpu_monitor for availability checking - Automatic cleanup via context manager
Example
- async with GPUResourcePool(max_parallel=3) as pool:
allocation = await pool.acquire(required_gpus=2, experiment_id=1) try:
# Run experiment with allocation.gpu_indices pass
- finally:
await pool.release(allocation)
- __init__(max_parallel=1)[source]¶
Initialize GPU resource pool.
- Parameters:
max_parallel (
int) – Maximum number of concurrent experiments
- async acquire(required_gpus, min_memory_mb=8000, experiment_id=None, params=None, timeout=None)[source]¶
Acquire GPU resources for an experiment.
This method: 1. Waits for available slot if at max_parallel capacity 2. Selects optimal GPUs using availability scoring 3. Returns GPUAllocation object
- Parameters:
- Return type:
- Returns:
GPUAllocation object with selected GPU indices
- Raises:
asyncio.TimeoutError – If timeout expires
RuntimeError – If insufficient GPUs available
- async release(allocation)[source]¶
Release GPU resources.
- Parameters:
allocation (
GPUAllocation) – GPUAllocation object from acquire()- Return type:
- async utils.gpu_pool.estimate_and_acquire(pool, task_config, experiment_id=None, params=None, timeout=None)[source]¶
Helper function to estimate GPU requirements and acquire resources.
This combines estimate_gpu_requirements() from gpu_scheduler with the resource pool acquisition.
- Parameters:
- Return type:
- Returns:
GPUAllocation object
Example
- async with GPUResourcePool(max_parallel=3) as pool:
- allocation = await estimate_and_acquire(
pool, task_config, experiment_id=1
) try:
# Run experiment pass
- finally:
await pool.release(allocation)
Web Application¶
FastAPI Application¶
Database Models¶
Database models for tasks, experiments, and parameter presets.
- class web.db.models.TaskStatus[source]¶
-
Task status enum.
- PENDING = 'pending'¶
- RUNNING = 'running'¶
- COMPLETED = 'completed'¶
- FAILED = 'failed'¶
- CANCELLED = 'cancelled'¶
- __new__(value)¶
- class web.db.models.Task[source]¶
Bases:
BaseAutotuning task model.
- id¶
- task_name¶
- description¶
- status¶
- model_config¶
- base_runtime¶
- runtime_image_tag¶
- parameters¶
- optimization_config¶
- benchmark_config¶
- slo_config¶
- quant_config¶
- parallel_config¶
- clusterbasemodel_config¶
- clusterservingruntime_config¶
- created_clusterbasemodel¶
- created_clusterservingruntime¶
- deployment_mode¶
- task_metadata¶
- total_experiments¶
- successful_experiments¶
- best_experiment_id¶
- created_at¶
- started_at¶
- completed_at¶
- elapsed_time¶
- experiments¶
- best_experiment¶
- to_dict(include_full_config=False)[source]¶
Convert task to dictionary.
- Parameters:
include_full_config – If True, include all configuration details. If False, return summary view (for list endpoints).
- __init__(**kwargs)¶
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- class web.db.models.ExperimentStatus[source]¶
-
Experiment status enum.
- PENDING = 'pending'¶
- DEPLOYING = 'deploying'¶
- BENCHMARKING = 'benchmarking'¶
- SUCCESS = 'success'¶
- FAILED = 'failed'¶
- __new__(value)¶
- class web.db.models.Experiment[source]¶
Bases:
BaseIndividual experiment (single parameter configuration) model.
- id¶
- task_id¶
- experiment_id¶
- parameters¶
- status¶
- error_message¶
- metrics¶
- objective_score¶
- gpu_info¶
- service_name¶
- service_url¶
- created_at¶
- started_at¶
- completed_at¶
- elapsed_time¶
- task¶
- to_dict(include_logs=False)[source]¶
Convert experiment to dictionary.
- Parameters:
include_logs – If True, include benchmark_logs (can be large).
- __init__(**kwargs)¶
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- class web.db.models.ParameterPreset[source]¶
Bases:
BaseParameter preset model for reusable parameter configurations.
- id¶
- name¶
- description¶
- category¶
- runtime¶
- is_system¶
- parameters¶
- preset_metadata¶
- created_at¶
- updated_at¶
- __init__(**kwargs)¶
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- class web.db.models.MessageRole[source]¶
-
Chat message role enum.
- USER = 'user'¶
- ASSISTANT = 'assistant'¶
- SYSTEM = 'system'¶
- __new__(value)¶
- class web.db.models.ChatSession[source]¶
Bases:
BaseAgent chat session model.
- id¶
- session_id¶
- user_id¶
- title¶
- context_summary¶
- is_active¶
- session_metadata¶
- created_at¶
- updated_at¶
- messages¶
- subscriptions¶
- __init__(**kwargs)¶
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- class web.db.models.ChatMessage[source]¶
Bases:
BaseAgent chat message model.
- id¶
- session_id¶
- role¶
- content¶
- tool_calls¶
- message_metadata¶
- token_count¶
- created_at¶
- session¶
- __init__(**kwargs)¶
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- class web.db.models.AgentEventSubscription[source]¶
Bases:
BaseAgent event subscription model for auto-triggering analysis.
- id¶
- session_id¶
- task_id¶
- event_types¶
- is_active¶
- created_at¶
- expires_at¶
- session¶
- task¶
- __init__(**kwargs)¶
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.