Troubleshooting Guide¶

This document covers common issues and their solutions when using the LLM Autotuner.

Common Issues and Solutions¶

1. InferenceService Creation Fails: “cannot unmarshal number into Go struct field”¶

Error:

cannot unmarshal number into Go struct field ObjectMeta.labels of type string

Cause: Label values in Kubernetes must be strings, but numeric values were provided.

Solution: Already fixed in templates. Labels are now quoted:

labels:
  autotuner.io/experiment-id: "{{ experiment_id }}"  # Quoted

2. Deployment Fails: “spec.template.spec.containers[0].name: Required value”¶

Error:

spec.template.spec.containers[0].name: Required value

Cause: Container name was missing in the runner specification.

Solution: Already fixed. Template now includes:

runner:
  name: ome-container  # Required field

3. SGLang Fails: “Can’t load the configuration of ‘$MODEL_PATH’”¶

Error:

OSError: Can't load the configuration of '$MODEL_PATH'

Cause: Environment variable not being expanded in the args list.

Solution: Already fixed. Template uses K8s env var syntax:

args:
  - --model-path
  - $(MODEL_PATH)  # Proper K8s env var expansion

4. BenchmarkJob Creation Fails: “spec.outputLocation: Required value”¶

Error:

spec.outputLocation: Required value

Cause: OME BenchmarkJob CRD requires an output storage location.

Solution: Already fixed. Template includes:

outputLocation:
  storageUri: "pvc://benchmark-results-pvc/{{ benchmark_name }}"

Make sure the PVC exists:

kubectl apply -f config/benchmark-pvc.yaml

5. BenchmarkJob Fails: “unknown storage type for URI: local:///”¶

Error:

unknown storage type for URI: local:///tmp/...

Cause: OME only supports pvc:// (Persistent Volume Claims) and oci:// (Object Storage).

Solution: Use PVC storage (already configured):

# Create PVC first
kubectl apply -f config/benchmark-pvc.yaml

# Template automatically uses pvc://benchmark-results-pvc/

6. InferenceService Not Becoming Ready¶

Symptoms:

InferenceService shows Ready=False
Status: “ComponentNotReady: Target service not ready for ingress creation”

Debugging Steps:

# Check pod status
kubectl get pods -n autotuner

# Check pod logs
kubectl logs <pod-name> -n autotuner --tail=50

# Check InferenceService events
kubectl describe inferenceservice <name> -n autotuner

Common Causes:

Model not found or not ready
Runtime mismatch with model
Insufficient GPU resources
Container image pull errors

Typical Wait Time: 60-90 seconds for model loading and CUDA graph capture

7. GPU Resource Issues in Minikube¶

Problem: Minikube with Docker driver cannot access host GPUs

Symptoms:

Pods pending with: Insufficient nvidia.com/gpu
NVIDIA device plugin shows: "No devices found. Waiting indefinitely"
Even with minikube start --gpus=all flag

Root Cause: Nested containerization architecture prevents GPU access:

Host (with GPUs) → Docker → Minikube Container → Inner Docker → K8s Pods

The inner Docker daemon cannot see host GPUs even when outer Docker has GPU access.

Solutions:

Option A: Use Minikube with –driver=none (Requires bare metal)

# CAUTION: This runs Kubernetes directly on host (no container isolation)
minikube start --driver=none

Option B: Use proper Kubernetes cluster

Production K8s with NVIDIA GPU Operator
Kind with GPU support
K3s with proper GPU configuration

Option C: Direct Docker deployment (Development/Testing) For quick testing without Kubernetes orchestration:

# Download model
huggingface-cli download meta-llama/Llama-3.2-1B-Instruct \
  --local-dir /tmp/llama-3.2-1b-instruct

# Run SGLang directly with Docker
docker run --gpus '"device=0"' -d --name sglang-llama \
  -p 8000:8080 \
  -v /tmp/llama-3.2-1b-instruct:/model \
  lmsysorg/sglang:v0.5.2-cu126 \
  python3 -m sglang.launch_server \
  --model-path /model \
  --host 0.0.0.0 \
  --port 8080 \
  --mem-frac 0.6

# Verify deployment
curl http://localhost:8000/health

Important Notes:

Check GPU availability first: nvidia-smi
Select a GPU with sufficient free memory
Adjust --mem-frac based on available GPU memory
Use device=N to select specific GPU (0-7)

8. SGLang CPU Backend Issues¶

Problem: SGLang CPU version crashes in containers

Symptoms:

Pod logs stop at “Load weight end”
Scheduler subprocess becomes defunct (zombie process)
Server never starts or responds

Root Cause: SGLang CPU backend (lmsysorg/sglang:v0.5.3.post3-xeon) has subprocess management issues in containerized environments.

Solution: Use GPU-based deployment instead. CPU inference is not recommended for production or testing.

9. Model Download and Transfer Issues¶

Problem A: Gated Model Access Denied¶

Error:

401 Client Error: Unauthorized
Access to model meta-llama/Llama-3.2-1B-Instruct is restricted

Solution:

# 1. Accept license on HuggingFace website
# Visit: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

# 2. Create access token on HuggingFace
# Visit: https://huggingface.co/settings/tokens

# 3. Create Kubernetes secret (for OME)
kubectl create secret generic hf-token \
  --from-literal=token=<your-token> \
  -n ome

# 4. Or login locally (for direct download)
huggingface-cli login --token <your-token>

Problem B: Transferring Large Model Files to Minikube¶

Failed Methods:

minikube cp /dir → “Is a directory” error
minikube cp large.tar.gz → “scp: Broken pipe” (files > 1GB)
cat file | minikube ssh → Signal INT
rsync → “protocol version mismatch”

Working Solution:

# Compress model files
tar czf /tmp/model.tar.gz -C /tmp llama-3.2-1b-instruct

# Transfer using SCP with Minikube SSH key
scp -i $(minikube ssh-key) /tmp/model.tar.gz \
  docker@$(minikube ip):~/

# Extract inside Minikube
minikube ssh "sudo mkdir -p /mnt/data/models && \
  sudo tar xzf ~/model.tar.gz -C /mnt/data/models/"

# Verify
minikube ssh "ls -lh /mnt/data/models/llama-3.2-1b-instruct"

Size Reference:

Llama 3.2 1B: ~2.4GB uncompressed, ~887MB compressed
Transfer time: ~30-60 seconds depending on disk speed

10. Docker GPU Out of Memory¶

Symptoms:

Container starts but crashes during model loading
Error: torch.OutOfMemoryError: CUDA out of memory
CUDA graph capture fails

Debugging:

# Check GPU status and memory usage
nvidia-smi

# Look for existing workloads
nvidia-smi --query-compute-apps=pid,process_name,used_memory \
  --format=csv

Solutions:

A. Select a different GPU:

# Use GPU 1 instead of GPU 0
docker run --gpus '"device=1"' ...

B. Reduce memory allocation:

# Reduce --mem-frac parameter
--mem-frac 0.6  # Instead of 0.8

C. Stop competing workloads:

# Identify process using GPU
ps aux | grep <pid-from-nvidia-smi>

# Kill if safe to do so
kill <pid>

Memory Allocation Guide:

Small models (1-3B): --mem-frac 0.6-0.7
Medium models (7-13B): --mem-frac 0.8-0.85
Large models (70B+): --mem-frac 0.9-0.95

Always leave 10-20% GPU memory free for activations and temporary tensors.

11. Wrong Model or Runtime Name¶

Symptoms:

InferenceService fails to create
Error about model or runtime not found

Solution:

# List available models
kubectl get clusterbasemodels

# List available runtimes
kubectl get clusterservingruntimes

# Update examples/simple_task.json with correct names

12. Network and Proxy Configuration¶

Problem: Images can’t be pulled, models can’t be downloaded in Minikube

Symptoms:

ImagePullBackOff errors
OME model agent can’t download models
Connection timeouts

Solution: Configure Docker proxy in Minikube

# SSH into Minikube
minikube ssh

# Create proxy configuration
sudo mkdir -p /etc/systemd/system/docker.service.d
sudo tee /etc/systemd/system/docker.service.d/http-proxy.conf <<EOF
[Service]
Environment="HTTP_PROXY=http://YOUR_PROXY:PORT"
Environment="HTTPS_PROXY=http://YOUR_PROXY:PORT"
Environment="NO_PROXY=localhost,127.0.0.1,10.96.0.0/12"
EOF

# Reload and restart Docker
sudo systemctl daemon-reload
sudo systemctl restart docker

# Exit Minikube SSH
exit

# Restart Minikube to apply changes
minikube stop && minikube start

Configure OME Model Agent:

# Patch model agent DaemonSet
kubectl set env daemonset/ome-model-agent \
  -n ome \
  HTTP_PROXY=http://YOUR_PROXY:PORT \
  HTTPS_PROXY=http://YOUR_PROXY:PORT \
  NO_PROXY=localhost,127.0.0.1,10.96.0.0/12

# Wait for pods to restart
kubectl rollout status daemonset/ome-model-agent -n ome

13. BenchmarkJob Stays in “Running” Status¶

Symptoms:

BenchmarkJob doesn’t complete
No error messages

Debugging:

# Check benchmark pod logs
kubectl get pods -n autotuner | grep bench
kubectl logs <benchmark-pod> -n autotuner

# Check BenchmarkJob status
kubectl describe benchmarkjob <name> -n autotuner

Common Causes:

InferenceService endpoint not reachable
Traffic scenarios too demanding
Timeout settings too low

Monitoring Tips¶

Watch Resources in Real-Time¶

# All resources in autotuner namespace
watch kubectl get inferenceservices,benchmarkjobs,pods -n autotuner

# Just InferenceServices
kubectl get inferenceservices -n autotuner -w

# Pod logs
kubectl logs -f <pod-name> -n autotuner

Check OME Controller Logs¶

kubectl logs -n ome deployment/ome-controller-manager --tail=100

Performance Tips¶

Reduce timeout values for faster iteration during development:

"optimization": {
  "timeout_per_iteration": 300  // 5 minutes instead of 10
}

Use smaller benchmark workloads for testing:

"benchmark": {
  "traffic_scenarios": ["D(100,100)"],  // Lighter load
  "max_requests_per_iteration": 50      // Fewer requests
}

Limit parameter grid for initial testing:

"parameters": {
  "mem_frac": {"type": "choice", "values": [0.85, 0.9]}  // Just 2 values
}

Viewing GenAI-Bench Logs¶

Verbose Mode for Real-Time Output¶

Use the --verbose or -v flag to stream genai-bench output in real-time:

python src/run_autotuner.py examples/docker_task.json --mode docker --direct --verbose

Benefits:

Real-time feedback during benchmarks
See progress during long runs
Useful for debugging connection/API issues
Detect problems early

Default mode (no flag) shows output only after completion.

Usage Examples¶

Debugging connection issues:

python src/run_autotuner.py examples/docker_task.json --mode docker --direct --verbose

Long-running benchmarks with log file:

python src/run_autotuner.py examples/docker_task.json --mode docker --direct --verbose 2>&1 | tee autotuner.log

Manual Log Inspection¶

View benchmark results directly:

# List all benchmark results
ls -R benchmark_results/

# View specific experiment metadata
cat benchmark_results/docker-simple-tune-exp1/experiment_metadata.json

When to Use Verbose Mode¶

Use verbose for:

Initial testing of new configurations
Debugging connection/API issues
Long-running benchmarks (>5 minutes)
Monitoring progress

Use default mode for:

Production runs
CI/CD pipelines
Multiple parallel experiments