Troubleshooting Guide

This document covers common issues and their solutions when using the LLM Autotuner.

Common Issues and Solutions

1. InferenceService Creation Fails: “cannot unmarshal number into Go struct field”

Error:

cannot unmarshal number into Go struct field ObjectMeta.labels of type string

Cause: Label values in Kubernetes must be strings, but numeric values were provided.

Solution: Already fixed in templates. Labels are now quoted:

labels:
  autotuner.io/experiment-id: "{{ experiment_id }}"  # Quoted

2. Deployment Fails: “spec.template.spec.containers[0].name: Required value”

Error:

spec.template.spec.containers[0].name: Required value

Cause: Container name was missing in the runner specification.

Solution: Already fixed. Template now includes:

runner:
  name: ome-container  # Required field

3. SGLang Fails: “Can’t load the configuration of ‘$MODEL_PATH’”

Error:

OSError: Can't load the configuration of '$MODEL_PATH'

Cause: Environment variable not being expanded in the args list.

Solution: Already fixed. Template uses K8s env var syntax:

args:
  - --model-path
  - $(MODEL_PATH)  # Proper K8s env var expansion

4. BenchmarkJob Creation Fails: “spec.outputLocation: Required value”

Error:

spec.outputLocation: Required value

Cause: OME BenchmarkJob CRD requires an output storage location.

Solution: Already fixed. Template includes:

outputLocation:
  storageUri: "pvc://benchmark-results-pvc/{{ benchmark_name }}"

Make sure the PVC exists:

kubectl apply -f config/benchmark-pvc.yaml

5. BenchmarkJob Fails: “unknown storage type for URI: local:///”

Error:

unknown storage type for URI: local:///tmp/...

Cause: OME only supports pvc:// (Persistent Volume Claims) and oci:// (Object Storage).

Solution: Use PVC storage (already configured):

# Create PVC first
kubectl apply -f config/benchmark-pvc.yaml

# Template automatically uses pvc://benchmark-results-pvc/

6. InferenceService Not Becoming Ready

Symptoms:

  • InferenceService shows Ready=False

  • Status: “ComponentNotReady: Target service not ready for ingress creation”

Debugging Steps:

# Check pod status
kubectl get pods -n autotuner

# Check pod logs
kubectl logs <pod-name> -n autotuner --tail=50

# Check InferenceService events
kubectl describe inferenceservice <name> -n autotuner

Common Causes:

  • Model not found or not ready

  • Runtime mismatch with model

  • Insufficient GPU resources

  • Container image pull errors

Typical Wait Time: 60-90 seconds for model loading and CUDA graph capture

7. GPU Resource Issues in Minikube

Problem: Minikube with Docker driver cannot access host GPUs

Symptoms:

  • Pods pending with: Insufficient nvidia.com/gpu

  • NVIDIA device plugin shows: "No devices found. Waiting indefinitely"

  • Even with minikube start --gpus=all flag

Root Cause: Nested containerization architecture prevents GPU access:

Host (with GPUs) → Docker → Minikube Container → Inner Docker → K8s Pods

The inner Docker daemon cannot see host GPUs even when outer Docker has GPU access.

Solutions:

Option A: Use Minikube with –driver=none (Requires bare metal)

# CAUTION: This runs Kubernetes directly on host (no container isolation)
minikube start --driver=none

Option B: Use proper Kubernetes cluster

  • Production K8s with NVIDIA GPU Operator

  • Kind with GPU support

  • K3s with proper GPU configuration

Option C: Direct Docker deployment (Development/Testing) For quick testing without Kubernetes orchestration:

# Download model
huggingface-cli download meta-llama/Llama-3.2-1B-Instruct \
  --local-dir /tmp/llama-3.2-1b-instruct

# Run SGLang directly with Docker
docker run --gpus '"device=0"' -d --name sglang-llama \
  -p 8000:8080 \
  -v /tmp/llama-3.2-1b-instruct:/model \
  lmsysorg/sglang:v0.5.2-cu126 \
  python3 -m sglang.launch_server \
  --model-path /model \
  --host 0.0.0.0 \
  --port 8080 \
  --mem-frac 0.6

# Verify deployment
curl http://localhost:8000/health

Important Notes:

  • Check GPU availability first: nvidia-smi

  • Select a GPU with sufficient free memory

  • Adjust --mem-frac based on available GPU memory

  • Use device=N to select specific GPU (0-7)

8. SGLang CPU Backend Issues

Problem: SGLang CPU version crashes in containers

Symptoms:

  • Pod logs stop at “Load weight end”

  • Scheduler subprocess becomes defunct (zombie process)

  • Server never starts or responds

Root Cause: SGLang CPU backend (lmsysorg/sglang:v0.5.3.post3-xeon) has subprocess management issues in containerized environments.

Solution: Use GPU-based deployment instead. CPU inference is not recommended for production or testing.

9. Model Download and Transfer Issues

Problem A: Gated Model Access Denied

Error:

401 Client Error: Unauthorized
Access to model meta-llama/Llama-3.2-1B-Instruct is restricted

Solution:

# 1. Accept license on HuggingFace website
# Visit: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

# 2. Create access token on HuggingFace
# Visit: https://huggingface.co/settings/tokens

# 3. Create Kubernetes secret (for OME)
kubectl create secret generic hf-token \
  --from-literal=token=<your-token> \
  -n ome

# 4. Or login locally (for direct download)
huggingface-cli login --token <your-token>

Problem B: Transferring Large Model Files to Minikube

Failed Methods:

  • minikube cp /dir → “Is a directory” error

  • minikube cp large.tar.gz → “scp: Broken pipe” (files > 1GB)

  • cat file | minikube ssh → Signal INT

  • rsync → “protocol version mismatch”

Working Solution:

# Compress model files
tar czf /tmp/model.tar.gz -C /tmp llama-3.2-1b-instruct

# Transfer using SCP with Minikube SSH key
scp -i $(minikube ssh-key) /tmp/model.tar.gz \
  docker@$(minikube ip):~/

# Extract inside Minikube
minikube ssh "sudo mkdir -p /mnt/data/models && \
  sudo tar xzf ~/model.tar.gz -C /mnt/data/models/"

# Verify
minikube ssh "ls -lh /mnt/data/models/llama-3.2-1b-instruct"

Size Reference:

  • Llama 3.2 1B: ~2.4GB uncompressed, ~887MB compressed

  • Transfer time: ~30-60 seconds depending on disk speed

10. Docker GPU Out of Memory

Symptoms:

  • Container starts but crashes during model loading

  • Error: torch.OutOfMemoryError: CUDA out of memory

  • CUDA graph capture fails

Debugging:

# Check GPU status and memory usage
nvidia-smi

# Look for existing workloads
nvidia-smi --query-compute-apps=pid,process_name,used_memory \
  --format=csv

Solutions:

A. Select a different GPU:

# Use GPU 1 instead of GPU 0
docker run --gpus '"device=1"' ...

B. Reduce memory allocation:

# Reduce --mem-frac parameter
--mem-frac 0.6  # Instead of 0.8

C. Stop competing workloads:

# Identify process using GPU
ps aux | grep <pid-from-nvidia-smi>

# Kill if safe to do so
kill <pid>

Memory Allocation Guide:

  • Small models (1-3B): --mem-frac 0.6-0.7

  • Medium models (7-13B): --mem-frac 0.8-0.85

  • Large models (70B+): --mem-frac 0.9-0.95

Always leave 10-20% GPU memory free for activations and temporary tensors.

11. Wrong Model or Runtime Name

Symptoms:

  • InferenceService fails to create

  • Error about model or runtime not found

Solution:

# List available models
kubectl get clusterbasemodels

# List available runtimes
kubectl get clusterservingruntimes

# Update examples/simple_task.json with correct names

12. Network and Proxy Configuration

Problem: Images can’t be pulled, models can’t be downloaded in Minikube

Symptoms:

  • ImagePullBackOff errors

  • OME model agent can’t download models

  • Connection timeouts

Solution: Configure Docker proxy in Minikube

# SSH into Minikube
minikube ssh

# Create proxy configuration
sudo mkdir -p /etc/systemd/system/docker.service.d
sudo tee /etc/systemd/system/docker.service.d/http-proxy.conf <<EOF
[Service]
Environment="HTTP_PROXY=http://YOUR_PROXY:PORT"
Environment="HTTPS_PROXY=http://YOUR_PROXY:PORT"
Environment="NO_PROXY=localhost,127.0.0.1,10.96.0.0/12"
EOF

# Reload and restart Docker
sudo systemctl daemon-reload
sudo systemctl restart docker

# Exit Minikube SSH
exit

# Restart Minikube to apply changes
minikube stop && minikube start

Configure OME Model Agent:

# Patch model agent DaemonSet
kubectl set env daemonset/ome-model-agent \
  -n ome \
  HTTP_PROXY=http://YOUR_PROXY:PORT \
  HTTPS_PROXY=http://YOUR_PROXY:PORT \
  NO_PROXY=localhost,127.0.0.1,10.96.0.0/12

# Wait for pods to restart
kubectl rollout status daemonset/ome-model-agent -n ome

13. BenchmarkJob Stays in “Running” Status

Symptoms:

  • BenchmarkJob doesn’t complete

  • No error messages

Debugging:

# Check benchmark pod logs
kubectl get pods -n autotuner | grep bench
kubectl logs <benchmark-pod> -n autotuner

# Check BenchmarkJob status
kubectl describe benchmarkjob <name> -n autotuner

Common Causes:

  • InferenceService endpoint not reachable

  • Traffic scenarios too demanding

  • Timeout settings too low

Monitoring Tips

Watch Resources in Real-Time

# All resources in autotuner namespace
watch kubectl get inferenceservices,benchmarkjobs,pods -n autotuner

# Just InferenceServices
kubectl get inferenceservices -n autotuner -w

# Pod logs
kubectl logs -f <pod-name> -n autotuner

Check OME Controller Logs

kubectl logs -n ome deployment/ome-controller-manager --tail=100

Performance Tips

  1. Reduce timeout values for faster iteration during development:

    "optimization": {
      "timeout_per_iteration": 300  // 5 minutes instead of 10
    }
    
  2. Use smaller benchmark workloads for testing:

    "benchmark": {
      "traffic_scenarios": ["D(100,100)"],  // Lighter load
      "max_requests_per_iteration": 50      // Fewer requests
    }
    
  3. Limit parameter grid for initial testing:

    "parameters": {
      "mem_frac": {"type": "choice", "values": [0.85, 0.9]}  // Just 2 values
    }
    

Viewing GenAI-Bench Logs

Verbose Mode for Real-Time Output

Use the --verbose or -v flag to stream genai-bench output in real-time:

python src/run_autotuner.py examples/docker_task.json --mode docker --direct --verbose

Benefits:

  • Real-time feedback during benchmarks

  • See progress during long runs

  • Useful for debugging connection/API issues

  • Detect problems early

Default mode (no flag) shows output only after completion.

Usage Examples

Debugging connection issues:

python src/run_autotuner.py examples/docker_task.json --mode docker --direct --verbose

Long-running benchmarks with log file:

python src/run_autotuner.py examples/docker_task.json --mode docker --direct --verbose 2>&1 | tee autotuner.log

Manual Log Inspection

View benchmark results directly:

# List all benchmark results
ls -R benchmark_results/

# View specific experiment metadata
cat benchmark_results/docker-simple-tune-exp1/experiment_metadata.json

When to Use Verbose Mode

Use verbose for:

  • Initial testing of new configurations

  • Debugging connection/API issues

  • Long-running benchmarks (>5 minutes)

  • Monitoring progress

Use default mode for:

  • Production runs

  • CI/CD pipelines

  • Multiple parallel experiments