Troubleshooting Guide¶
This document covers common issues and their solutions when using the LLM Autotuner.
Common Issues and Solutions¶
1. InferenceService Creation Fails: “cannot unmarshal number into Go struct field”¶
Error:
cannot unmarshal number into Go struct field ObjectMeta.labels of type string
Cause: Label values in Kubernetes must be strings, but numeric values were provided.
Solution: Already fixed in templates. Labels are now quoted:
labels:
autotuner.io/experiment-id: "{{ experiment_id }}" # Quoted
2. Deployment Fails: “spec.template.spec.containers[0].name: Required value”¶
Error:
spec.template.spec.containers[0].name: Required value
Cause: Container name was missing in the runner specification.
Solution: Already fixed. Template now includes:
runner:
name: ome-container # Required field
3. SGLang Fails: “Can’t load the configuration of ‘$MODEL_PATH’”¶
Error:
OSError: Can't load the configuration of '$MODEL_PATH'
Cause: Environment variable not being expanded in the args list.
Solution: Already fixed. Template uses K8s env var syntax:
args:
- --model-path
- $(MODEL_PATH) # Proper K8s env var expansion
4. BenchmarkJob Creation Fails: “spec.outputLocation: Required value”¶
Error:
spec.outputLocation: Required value
Cause: OME BenchmarkJob CRD requires an output storage location.
Solution: Already fixed. Template includes:
outputLocation:
storageUri: "pvc://benchmark-results-pvc/{{ benchmark_name }}"
Make sure the PVC exists:
kubectl apply -f config/benchmark-pvc.yaml
5. BenchmarkJob Fails: “unknown storage type for URI: local:///”¶
Error:
unknown storage type for URI: local:///tmp/...
Cause: OME only supports pvc:// (Persistent Volume Claims) and oci:// (Object Storage).
Solution: Use PVC storage (already configured):
# Create PVC first
kubectl apply -f config/benchmark-pvc.yaml
# Template automatically uses pvc://benchmark-results-pvc/
6. InferenceService Not Becoming Ready¶
Symptoms:
InferenceService shows
Ready=FalseStatus: “ComponentNotReady: Target service not ready for ingress creation”
Debugging Steps:
# Check pod status
kubectl get pods -n autotuner
# Check pod logs
kubectl logs <pod-name> -n autotuner --tail=50
# Check InferenceService events
kubectl describe inferenceservice <name> -n autotuner
Common Causes:
Model not found or not ready
Runtime mismatch with model
Insufficient GPU resources
Container image pull errors
Typical Wait Time: 60-90 seconds for model loading and CUDA graph capture
7. GPU Resource Issues in Minikube¶
Problem: Minikube with Docker driver cannot access host GPUs
Symptoms:
Pods pending with:
Insufficient nvidia.com/gpuNVIDIA device plugin shows:
"No devices found. Waiting indefinitely"Even with
minikube start --gpus=allflag
Root Cause: Nested containerization architecture prevents GPU access:
Host (with GPUs) → Docker → Minikube Container → Inner Docker → K8s Pods
The inner Docker daemon cannot see host GPUs even when outer Docker has GPU access.
Solutions:
Option A: Use Minikube with –driver=none (Requires bare metal)
# CAUTION: This runs Kubernetes directly on host (no container isolation)
minikube start --driver=none
Option B: Use proper Kubernetes cluster
Production K8s with NVIDIA GPU Operator
Kind with GPU support
K3s with proper GPU configuration
Option C: Direct Docker deployment (Development/Testing) For quick testing without Kubernetes orchestration:
# Download model
huggingface-cli download meta-llama/Llama-3.2-1B-Instruct \
--local-dir /tmp/llama-3.2-1b-instruct
# Run SGLang directly with Docker
docker run --gpus '"device=0"' -d --name sglang-llama \
-p 8000:8080 \
-v /tmp/llama-3.2-1b-instruct:/model \
lmsysorg/sglang:v0.5.2-cu126 \
python3 -m sglang.launch_server \
--model-path /model \
--host 0.0.0.0 \
--port 8080 \
--mem-frac 0.6
# Verify deployment
curl http://localhost:8000/health
Important Notes:
Check GPU availability first:
nvidia-smiSelect a GPU with sufficient free memory
Adjust
--mem-fracbased on available GPU memoryUse
device=Nto select specific GPU (0-7)
8. SGLang CPU Backend Issues¶
Problem: SGLang CPU version crashes in containers
Symptoms:
Pod logs stop at “Load weight end”
Scheduler subprocess becomes defunct (zombie process)
Server never starts or responds
Root Cause:
SGLang CPU backend (lmsysorg/sglang:v0.5.3.post3-xeon) has subprocess management issues in containerized environments.
Solution: Use GPU-based deployment instead. CPU inference is not recommended for production or testing.
9. Model Download and Transfer Issues¶
Problem A: Gated Model Access Denied¶
Error:
401 Client Error: Unauthorized
Access to model meta-llama/Llama-3.2-1B-Instruct is restricted
Solution:
# 1. Accept license on HuggingFace website
# Visit: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
# 2. Create access token on HuggingFace
# Visit: https://huggingface.co/settings/tokens
# 3. Create Kubernetes secret (for OME)
kubectl create secret generic hf-token \
--from-literal=token=<your-token> \
-n ome
# 4. Or login locally (for direct download)
huggingface-cli login --token <your-token>
Problem B: Transferring Large Model Files to Minikube¶
Failed Methods:
minikube cp /dir→ “Is a directory” errorminikube cp large.tar.gz→ “scp: Broken pipe” (files > 1GB)cat file | minikube ssh→ Signal INTrsync→ “protocol version mismatch”
Working Solution:
# Compress model files
tar czf /tmp/model.tar.gz -C /tmp llama-3.2-1b-instruct
# Transfer using SCP with Minikube SSH key
scp -i $(minikube ssh-key) /tmp/model.tar.gz \
docker@$(minikube ip):~/
# Extract inside Minikube
minikube ssh "sudo mkdir -p /mnt/data/models && \
sudo tar xzf ~/model.tar.gz -C /mnt/data/models/"
# Verify
minikube ssh "ls -lh /mnt/data/models/llama-3.2-1b-instruct"
Size Reference:
Llama 3.2 1B: ~2.4GB uncompressed, ~887MB compressed
Transfer time: ~30-60 seconds depending on disk speed
10. Docker GPU Out of Memory¶
Symptoms:
Container starts but crashes during model loading
Error:
torch.OutOfMemoryError: CUDA out of memoryCUDA graph capture fails
Debugging:
# Check GPU status and memory usage
nvidia-smi
# Look for existing workloads
nvidia-smi --query-compute-apps=pid,process_name,used_memory \
--format=csv
Solutions:
A. Select a different GPU:
# Use GPU 1 instead of GPU 0
docker run --gpus '"device=1"' ...
B. Reduce memory allocation:
# Reduce --mem-frac parameter
--mem-frac 0.6 # Instead of 0.8
C. Stop competing workloads:
# Identify process using GPU
ps aux | grep <pid-from-nvidia-smi>
# Kill if safe to do so
kill <pid>
Memory Allocation Guide:
Small models (1-3B):
--mem-frac 0.6-0.7Medium models (7-13B):
--mem-frac 0.8-0.85Large models (70B+):
--mem-frac 0.9-0.95
Always leave 10-20% GPU memory free for activations and temporary tensors.
11. Wrong Model or Runtime Name¶
Symptoms:
InferenceService fails to create
Error about model or runtime not found
Solution:
# List available models
kubectl get clusterbasemodels
# List available runtimes
kubectl get clusterservingruntimes
# Update examples/simple_task.json with correct names
12. Network and Proxy Configuration¶
Problem: Images can’t be pulled, models can’t be downloaded in Minikube
Symptoms:
ImagePullBackOfferrorsOME model agent can’t download models
Connection timeouts
Solution: Configure Docker proxy in Minikube
# SSH into Minikube
minikube ssh
# Create proxy configuration
sudo mkdir -p /etc/systemd/system/docker.service.d
sudo tee /etc/systemd/system/docker.service.d/http-proxy.conf <<EOF
[Service]
Environment="HTTP_PROXY=http://YOUR_PROXY:PORT"
Environment="HTTPS_PROXY=http://YOUR_PROXY:PORT"
Environment="NO_PROXY=localhost,127.0.0.1,10.96.0.0/12"
EOF
# Reload and restart Docker
sudo systemctl daemon-reload
sudo systemctl restart docker
# Exit Minikube SSH
exit
# Restart Minikube to apply changes
minikube stop && minikube start
Configure OME Model Agent:
# Patch model agent DaemonSet
kubectl set env daemonset/ome-model-agent \
-n ome \
HTTP_PROXY=http://YOUR_PROXY:PORT \
HTTPS_PROXY=http://YOUR_PROXY:PORT \
NO_PROXY=localhost,127.0.0.1,10.96.0.0/12
# Wait for pods to restart
kubectl rollout status daemonset/ome-model-agent -n ome
13. BenchmarkJob Stays in “Running” Status¶
Symptoms:
BenchmarkJob doesn’t complete
No error messages
Debugging:
# Check benchmark pod logs
kubectl get pods -n autotuner | grep bench
kubectl logs <benchmark-pod> -n autotuner
# Check BenchmarkJob status
kubectl describe benchmarkjob <name> -n autotuner
Common Causes:
InferenceService endpoint not reachable
Traffic scenarios too demanding
Timeout settings too low
Monitoring Tips¶
Watch Resources in Real-Time¶
# All resources in autotuner namespace
watch kubectl get inferenceservices,benchmarkjobs,pods -n autotuner
# Just InferenceServices
kubectl get inferenceservices -n autotuner -w
# Pod logs
kubectl logs -f <pod-name> -n autotuner
Check OME Controller Logs¶
kubectl logs -n ome deployment/ome-controller-manager --tail=100
Performance Tips¶
Reduce timeout values for faster iteration during development:
"optimization": { "timeout_per_iteration": 300 // 5 minutes instead of 10 }
Use smaller benchmark workloads for testing:
"benchmark": { "traffic_scenarios": ["D(100,100)"], // Lighter load "max_requests_per_iteration": 50 // Fewer requests }
Limit parameter grid for initial testing:
"parameters": { "mem_frac": {"type": "choice", "values": [0.85, 0.9]} // Just 2 values }
Viewing GenAI-Bench Logs¶
Verbose Mode for Real-Time Output¶
Use the --verbose or -v flag to stream genai-bench output in real-time:
python src/run_autotuner.py examples/docker_task.json --mode docker --direct --verbose
Benefits:
Real-time feedback during benchmarks
See progress during long runs
Useful for debugging connection/API issues
Detect problems early
Default mode (no flag) shows output only after completion.
Usage Examples¶
Debugging connection issues:
python src/run_autotuner.py examples/docker_task.json --mode docker --direct --verbose
Long-running benchmarks with log file:
python src/run_autotuner.py examples/docker_task.json --mode docker --direct --verbose 2>&1 | tee autotuner.log
Manual Log Inspection¶
View benchmark results directly:
# List all benchmark results
ls -R benchmark_results/
# View specific experiment metadata
cat benchmark_results/docker-simple-tune-exp1/experiment_metadata.json
When to Use Verbose Mode¶
Use verbose for:
Initial testing of new configurations
Debugging connection/API issues
Long-running benchmarks (>5 minutes)
Monitoring progress
Use default mode for:
Production runs
CI/CD pipelines
Multiple parallel experiments