Quick Start Guide¶
Get started with the LLM Autotuner in 5 minutes.
Option 1: Docker Demo Image (Recommended)¶
The fastest way to get started is using the pre-built Docker image.
Prerequisites¶
Docker with GPU support - Verify NVIDIA Container Toolkit is working:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
If this fails, install nvidia-container-toolkit first.
A model downloaded locally:
pip install huggingface_hub huggingface-cli download meta-llama/Llama-3.2-1B-Instruct \ --local-dir /mnt/data/models/llama-3-2-1b-instruct
Run the Container¶
docker run -d --name autotuner \
-p 8000:8000 \
-v /mnt/data/models:/mnt/data/models \
-v autotuner-data:/data \
-e HF_TOKEN=your_huggingface_token \
--gpus all \
novitalabs/autotuner-demo:v0.2.1
Environment Variables:
Variable |
Description |
Required |
|---|---|---|
|
HuggingFace token (for gated models like Llama) |
For gated models |
|
Proxy for HuggingFace downloads |
Optional |
|
Proxy for HTTPS connections |
Optional |
|
Timezone (e.g., |
Optional |
Volume Mounts:
Mount |
Description |
|---|---|
|
Directory containing your models |
|
Persistent storage for database and results |
Access the Web UI¶
Open http://localhost:8000 in your browser.
Create Your First Task¶
Click Tasks in the sidebar
Click Create Task
Fill in the configuration:
Task Name:
my-tuneModel ID:
llama-3-2-1b-instructRuntime: SGLang
Add parameter:
mem-fraction-staticwith values0.8, 0.9
Click Create Task
Click Start Task to begin autotuning
Or create via API:
curl -X POST http://localhost:8000/api/tasks/ \
-H "Content-Type: application/json" \
-d '{
"task_name": "my-tune",
"model": {"id_or_path": "llama-3-2-1b-instruct"},
"base_runtime": "sglang",
"parameters": {"mem-fraction-static": [0.8, 0.9]},
"optimization": {"strategy": "grid_search", "objective": "maximize_throughput"},
"benchmark": {"num_concurrency": [1, 4]}
}'
# Start the task
curl -X POST http://localhost:8000/api/tasks/1/start
Option 2: Development Installation¶
For development or customization, install from source.
Prerequisites¶
Docker with GPU support - Verify NVIDIA Container Toolkit is working:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
If this fails, install nvidia-container-toolkit first.
Python 3.8+ and Redis
docker run -d -p 6379:6379 redis:alpine
A model downloaded locally:
pip install huggingface_hub huggingface-cli download meta-llama/Llama-3.2-1B-Instruct \ --local-dir /mnt/data/models/llama-3-2-1b-instruct
Installation¶
git clone <repository-url>
cd autotuner
pip install -r requirements.txt
pip install genai-bench
Configuration¶
Copy the example environment file and customize as needed:
cp .env.example .env
Key settings in .env:
Variable |
Description |
Default |
|---|---|---|
|
Backend API port |
8000 |
|
Frontend dev server port |
5173 |
|
Host path to models |
/mnt/data/models |
|
HuggingFace token (for gated models) |
- |
|
Proxy for HuggingFace downloads |
- |
See .env.example for all available options including Agent and GitHub integration settings.
Run Your First Task¶
Command Line:
python src/run_autotuner.py examples/docker_task.yaml --mode docker
Or use the Web UI:
# Terminal 1
./scripts/start_dev.sh
# Terminal 2
cd frontend && npm install && npm run dev
Then open http://localhost:5173 to create and monitor tasks.
Tip: Drag and drop a YAML file onto the New Task page to quickly import a task configuration.
Example Configuration¶
task_name: my-tune
model:
id_or_path: llama-3-2-1b-instruct
base_runtime: sglang
parameters:
mem-fraction-static: [0.8, 0.9]
optimization:
strategy: grid_search
objective: maximize_throughput
benchmark:
num_concurrency: [1, 4]
Results¶
View results in the Web UI Dashboard or via API:
# Get task status
curl http://localhost:8000/api/tasks/1
# Get best experiment results
curl http://localhost:8000/api/experiments/{best_experiment_id}
Next Steps¶
Docker Mode - Full Docker documentation
SLO Scoring - Add SLO constraints
Bayesian Optimization - Smarter optimization