Docker Deployment Mode¶

This document describes the standalone Docker deployment mode for the Autotuner.

Overview¶

The Docker mode allows you to run autotuning experiments using standalone Docker containers instead of Kubernetes. This is useful for:

Development and testing without full Kubernetes setup
Single-node deployments where Kubernetes overhead is unnecessary
Quick prototyping with direct GPU access
CI/CD pipelines where Docker is available but Kubernetes is not

Architecture¶

Docker Mode¶

Autotuner Orchestrator
    ↓
Docker Controller
    ↓
Docker Containers (with GPUs)
    ↓
Direct Benchmark (genai-bench CLI)

Comparison: OME vs Docker Mode¶

Feature	OME Mode	Docker Mode
Infrastructure	Kubernetes + OME	Docker only
Setup Complexity	High	Low
Resource Requirements	K8s cluster	Docker + GPU
Model Deployment	InferenceService CRD	Docker containers
Benchmark Execution	BenchmarkJob or Direct CLI	Direct CLI only
Use Case	Production, multi-node	Development, single-node

Prerequisites¶

Required¶

Docker with GPU support

docker --version
# Docker version 20.10+

NVIDIA Docker Runtime

docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Python 3.8+ with dependencies

pip install docker requests
pip install genai-bench  # For benchmarking

Model files downloaded locally

# Example: Download Llama model
huggingface-cli download meta-llama/Llama-3.2-1B-Instruct \
  --local-dir /mnt/data/models/llama-3-2-1b-instruct

Optional¶

Multiple GPUs for tensor parallelism testing

Installation¶

1. Install Docker Dependencies¶

# Install Docker SDK for Python
pip install docker

# Or use the virtual environment
source env/bin/activate
pip install docker

2. Verify GPU Access¶

# Check GPU availability
nvidia-smi

# Test Docker GPU access
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

3. Download Models¶

# Create model directory
mkdir -p /mnt/data/models

# Download model (example)
huggingface-cli download meta-llama/Llama-3.2-1B-Instruct \
  --local-dir /mnt/data/models/llama-3-2-1b-instruct

Usage¶

Basic Usage¶

# Run autotuning in Docker mode
python src/run_autotuner.py examples/docker_task.json --mode docker

Advanced Options¶

# Custom model path
python src/run_autotuner.py examples/docker_task.json \
  --mode docker \
  --model-path /data/models

# With verbose logging (future enhancement)
python src/run_autotuner.py examples/docker_task.json \
  --mode docker \
  --verbose

Task Configuration¶

Create a task JSON file (see examples/docker_task.json):

{
  "task_name": "docker-simple-tune",
  "description": "Docker deployment test",
  "deployment_mode": "docker",
  "model": {
    "name": "llama-3-2-1b-instruct",
    "namespace": "autotuner"
  },
  "base_runtime": "sglang",
  "parameters": {
    "tp_size": {"type": "choice", "values": [1, 2]},
    "mem_frac": {"type": "choice", "values": [0.7, 0.8]}
  },
  "optimization": {
    "strategy": "grid_search",
    "objective": "minimize_latency",
    "max_iterations": 4,
    "timeout_per_iteration": 600
  },
  "benchmark": {
    "task": "text-to-text",
    "traffic_scenarios": ["D(100,100)"],
    "num_concurrency": [1, 4],
    "max_time_per_iteration": 10,
    "max_requests_per_iteration": 50
  }
}

Workflow¶

For Each Experiment:¶

Deploy Model Container
- Pull Docker image (e.g., lmsysorg/sglang:v0.5.2-cu126)
- Mount model volume
- Allocate GPUs
- Start container with tuning parameters
Wait for Service Ready
- Poll health endpoint
- Timeout after 600s (configurable)
Run Benchmark
- Execute genai-bench CLI
- Target: http://localhost:<port>
- Collect metrics
Cleanup
- Stop and remove container
- Release GPU resources

Supported Runtimes¶

SGLang (Default)¶

{
  "base_runtime": "sglang"
}

Docker Image: lmsysorg/sglang:v0.5.2-cu126

Parameters:

tp_size: Tensor parallelism size
mem_frac: GPU memory fraction

vLLM¶

{
  "base_runtime": "vllm"
}

Docker Image: vllm/vllm-openai:latest

Parameters:

tp_size: Tensor parallel size
mem_frac: GPU memory utilization

Configuration Details¶

Model Path Mapping¶

Host path → Container path:

/mnt/data/models/llama-3-2-1b-instruct → /model

The controller automatically:

Resolves model name to host path
Mounts as read-only volume
Sets MODEL_PATH=/model environment variable

GPU Selection¶

The controller automatically:

Queries GPU availability using nvidia-smi
Selects GPUs with most free memory
Sets CUDA_VISIBLE_DEVICES environment variable

Example for tp_size=2:

# Selects GPUs 0 and 1 (most free memory)
CUDA_VISIBLE_DEVICES=0,1

Port Management¶

The controller:

Scans ports 8000-8100 for availability
Assigns first available port
Maps container port 8080 to host port

Example:

Container port 8080 → Host port 8001
Service URL: http://localhost:8001

Troubleshooting¶

1. Docker Connection Failed¶

Error:

Failed to connect to Docker daemon

Solution:

# Check Docker service
sudo systemctl status docker

# Start Docker
sudo systemctl start docker

# Add user to docker group
sudo usermod -aG docker $USER
newgrp docker

2. GPU Not Available¶

Error:

Failed to allocate N GPU(s)

Solution:

# Check GPU status
nvidia-smi

# Check Docker GPU access
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

# Install/configure NVIDIA Docker runtime
# See: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

3. Container Startup Failed¶

Error:

Container status: exited

Solution:

# Check container logs (preserved for debugging)
docker logs <container-name>

# Common issues:
# - Model path not found
# - Insufficient GPU memory
# - Image pull errors

4. Model Path Not Found¶

Error:

Model path /mnt/data/models/llama-3-2-1b-instruct does not exist

Solution:

# Verify model path
ls -l /mnt/data/models/llama-3-2-1b-instruct

# Update model path in CLI
python src/run_autotuner.py task.json \
  --mode docker \
  --model-path /correct/path/to/models

5. Port Already in Use¶

Error:

No available ports in range 8000-8100

Solution:

# Check port usage
netstat -tulpn | grep 800

# Kill conflicting processes or wait for cleanup
docker ps | grep autotuner
docker stop <container-id>

6. Out of Memory¶

Error:

torch.OutOfMemoryError: CUDA out of memory

Solution:

{
  "parameters": {
    "mem_frac": {"type": "choice", "values": [0.6]}
  }
}

Adjust mem_frac based on GPU memory and model size.

Performance Tips¶

1. Pre-pull Images¶

# Pre-pull Docker images to avoid delays
docker pull lmsysorg/sglang:v0.5.2-cu126
docker pull vllm/vllm-openai:latest

2. Use Local Models¶

Store models on fast SSD
Avoid network-mounted storage for better performance

3. GPU Selection¶

For multi-GPU systems, the controller automatically selects GPUs with most free memory. You can influence selection by:

Running cleanup between experiments
Monitoring GPU usage with nvidia-smi

4. Reduce Timeout for Testing¶

{
  "optimization": {
    "timeout_per_iteration": 300
  }
}

Limitations¶

Current Limitations¶

Single-node only - No distributed deployment
Sequential execution - One experiment at a time
Basic GPU allocation - No advanced scheduling
Limited runtime support - SGLang and vLLM only

Future Enhancements¶

Parallel experiment execution
Advanced GPU scheduling and allocation
Support for more runtimes (TensorRT-LLM, etc.)
Container resource limits (CPU, memory)
Better error recovery and retry logic
Docker Compose integration for complex setups

Comparison with OME Mode¶

When to Use Docker Mode¶

✅ Use Docker mode when:

Developing and testing locally
Running on a single node
Quick prototyping without K8s overhead
CI/CD pipeline with Docker
Direct GPU access needed

❌ Don’t use Docker mode when:

Need multi-node distributed deployment
Require Kubernetes orchestration features
Production deployment with HA requirements
Complex networking or service mesh needed

Migration Path¶

Development → Production:

Develop with Docker mode

python src/run_autotuner.py task.json --mode docker

Test with OME mode locally

minikube start
./install.sh --install-ome
python src/run_autotuner.py task.json --mode ome --direct

Deploy to production with OME

python src/run_autotuner.py task.json --mode ome

The task configuration remains compatible across modes!

Examples¶

Example 1: Basic Test¶

# Download model
mkdir -p /tmp/models
huggingface-cli download meta-llama/Llama-3.2-1B-Instruct \
  --local-dir /tmp/models/llama-3-2-1b-instruct

# Run autotuning
python src/run_autotuner.py examples/docker_task.json \
  --mode docker \
  --model-path /tmp/models

Example 2: Multi-GPU Test¶

{
  "parameters": {
    "tp_size": {"type": "choice", "values": [1, 2, 4]},
    "mem_frac": {"type": "choice", "values": [0.8]}
  }
}

python src/run_autotuner.py task.json --mode docker

Example 3: Quick Iteration¶

Minimal config for fast testing:

{
  "parameters": {
    "mem_frac": {"type": "choice", "values": [0.7, 0.8]}
  },
  "optimization": {
    "timeout_per_iteration": 300
  },
  "benchmark": {
    "max_requests_per_iteration": 20
  }
}

Docker Deployment Mode¶

Overview¶

Architecture¶

Docker Mode¶

Comparison: OME vs Docker Mode¶

Prerequisites¶

Required¶

Optional¶

Installation¶

1. Install Docker Dependencies¶

2. Verify GPU Access¶

3. Download Models¶

Usage¶

Basic Usage¶

Advanced Options¶

Task Configuration¶

Workflow¶

For Each Experiment:¶

Supported Runtimes¶

SGLang (Default)¶

vLLM¶

Configuration Details¶

Model Path Mapping¶

GPU Selection¶

Port Management¶

Troubleshooting¶

1. Docker Connection Failed¶

2. GPU Not Available¶

3. Container Startup Failed¶

4. Model Path Not Found¶

5. Port Already in Use¶

6. Out of Memory¶

Performance Tips¶

1. Pre-pull Images¶

2. Use Local Models¶

3. GPU Selection¶

4. Reduce Timeout for Testing¶

Limitations¶

Current Limitations¶

Future Enhancements¶

Comparison with OME Mode¶

When to Use Docker Mode¶

Migration Path¶

Examples¶

Example 1: Basic Test¶

Example 2: Multi-GPU Test¶

Example 3: Quick Iteration¶

See Also¶