# Troubleshooting Guide This document covers common issues and their solutions when using the LLM Autotuner. ## Common Issues and Solutions ### 1. InferenceService Creation Fails: "cannot unmarshal number into Go struct field" **Error:** ``` cannot unmarshal number into Go struct field ObjectMeta.labels of type string ``` **Cause:** Label values in Kubernetes must be strings, but numeric values were provided. **Solution:** Already fixed in templates. Labels are now quoted: ```yaml labels: autotuner.io/experiment-id: "{{ experiment_id }}" # Quoted ``` ### 2. Deployment Fails: "spec.template.spec.containers[0].name: Required value" **Error:** ``` spec.template.spec.containers[0].name: Required value ``` **Cause:** Container name was missing in the runner specification. **Solution:** Already fixed. Template now includes: ```yaml runner: name: ome-container # Required field ``` ### 3. SGLang Fails: "Can't load the configuration of '$MODEL_PATH'" **Error:** ``` OSError: Can't load the configuration of '$MODEL_PATH' ``` **Cause:** Environment variable not being expanded in the args list. **Solution:** Already fixed. Template uses K8s env var syntax: ```yaml args: - --model-path - $(MODEL_PATH) # Proper K8s env var expansion ``` ### 4. BenchmarkJob Creation Fails: "spec.outputLocation: Required value" **Error:** ``` spec.outputLocation: Required value ``` **Cause:** OME BenchmarkJob CRD requires an output storage location. **Solution:** Already fixed. Template includes: ```yaml outputLocation: storageUri: "pvc://benchmark-results-pvc/{{ benchmark_name }}" ``` Make sure the PVC exists: ```bash kubectl apply -f config/benchmark-pvc.yaml ``` ### 5. BenchmarkJob Fails: "unknown storage type for URI: local:///" **Error:** ``` unknown storage type for URI: local:///tmp/... ``` **Cause:** OME only supports `pvc://` (Persistent Volume Claims) and `oci://` (Object Storage). **Solution:** Use PVC storage (already configured): ```bash # Create PVC first kubectl apply -f config/benchmark-pvc.yaml # Template automatically uses pvc://benchmark-results-pvc/ ``` ### 6. InferenceService Not Becoming Ready **Symptoms:** - InferenceService shows `Ready=False` - Status: "ComponentNotReady: Target service not ready for ingress creation" **Debugging Steps:** ```bash # Check pod status kubectl get pods -n autotuner # Check pod logs kubectl logs -n autotuner --tail=50 # Check InferenceService events kubectl describe inferenceservice -n autotuner ``` **Common Causes:** - Model not found or not ready - Runtime mismatch with model - Insufficient GPU resources - Container image pull errors **Typical Wait Time:** 60-90 seconds for model loading and CUDA graph capture ### 7. GPU Resource Issues in Minikube **Problem:** Minikube with Docker driver cannot access host GPUs **Symptoms:** - Pods pending with: `Insufficient nvidia.com/gpu` - NVIDIA device plugin shows: `"No devices found. Waiting indefinitely"` - Even with `minikube start --gpus=all` flag **Root Cause:** Nested containerization architecture prevents GPU access: ``` Host (with GPUs) → Docker → Minikube Container → Inner Docker → K8s Pods ``` The inner Docker daemon cannot see host GPUs even when outer Docker has GPU access. **Solutions:** **Option A: Use Minikube with --driver=none** (Requires bare metal) ```bash # CAUTION: This runs Kubernetes directly on host (no container isolation) minikube start --driver=none ``` **Option B: Use proper Kubernetes cluster** - Production K8s with NVIDIA GPU Operator - Kind with GPU support - K3s with proper GPU configuration **Option C: Direct Docker deployment** (Development/Testing) For quick testing without Kubernetes orchestration: ```bash # Download model huggingface-cli download meta-llama/Llama-3.2-1B-Instruct \ --local-dir /tmp/llama-3.2-1b-instruct # Run SGLang directly with Docker docker run --gpus '"device=0"' -d --name sglang-llama \ -p 8000:8080 \ -v /tmp/llama-3.2-1b-instruct:/model \ lmsysorg/sglang:v0.5.2-cu126 \ python3 -m sglang.launch_server \ --model-path /model \ --host 0.0.0.0 \ --port 8080 \ --mem-frac 0.6 # Verify deployment curl http://localhost:8000/health ``` **Important Notes:** - Check GPU availability first: `nvidia-smi` - Select a GPU with sufficient free memory - Adjust `--mem-frac` based on available GPU memory - Use `device=N` to select specific GPU (0-7) ### 8. SGLang CPU Backend Issues **Problem:** SGLang CPU version crashes in containers **Symptoms:** - Pod logs stop at "Load weight end" - Scheduler subprocess becomes defunct (zombie process) - Server never starts or responds **Root Cause:** SGLang CPU backend (`lmsysorg/sglang:v0.5.3.post3-xeon`) has subprocess management issues in containerized environments. **Solution:** Use GPU-based deployment instead. CPU inference is not recommended for production or testing. ### 9. Model Download and Transfer Issues #### Problem A: Gated Model Access Denied **Error:** ``` 401 Client Error: Unauthorized Access to model meta-llama/Llama-3.2-1B-Instruct is restricted ``` **Solution:** ```bash # 1. Accept license on HuggingFace website # Visit: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct # 2. Create access token on HuggingFace # Visit: https://huggingface.co/settings/tokens # 3. Create Kubernetes secret (for OME) kubectl create secret generic hf-token \ --from-literal=token= \ -n ome # 4. Or login locally (for direct download) huggingface-cli login --token ``` #### Problem B: Transferring Large Model Files to Minikube **Failed Methods:** - `minikube cp /dir` → "Is a directory" error - `minikube cp large.tar.gz` → "scp: Broken pipe" (files > 1GB) - `cat file | minikube ssh` → Signal INT - `rsync` → "protocol version mismatch" **Working Solution:** ```bash # Compress model files tar czf /tmp/model.tar.gz -C /tmp llama-3.2-1b-instruct # Transfer using SCP with Minikube SSH key scp -i $(minikube ssh-key) /tmp/model.tar.gz \ docker@$(minikube ip):~/ # Extract inside Minikube minikube ssh "sudo mkdir -p /mnt/data/models && \ sudo tar xzf ~/model.tar.gz -C /mnt/data/models/" # Verify minikube ssh "ls -lh /mnt/data/models/llama-3.2-1b-instruct" ``` **Size Reference:** - Llama 3.2 1B: ~2.4GB uncompressed, ~887MB compressed - Transfer time: ~30-60 seconds depending on disk speed ### 10. Docker GPU Out of Memory **Symptoms:** - Container starts but crashes during model loading - Error: `torch.OutOfMemoryError: CUDA out of memory` - CUDA graph capture fails **Debugging:** ```bash # Check GPU status and memory usage nvidia-smi # Look for existing workloads nvidia-smi --query-compute-apps=pid,process_name,used_memory \ --format=csv ``` **Solutions:** **A. Select a different GPU:** ```bash # Use GPU 1 instead of GPU 0 docker run --gpus '"device=1"' ... ``` **B. Reduce memory allocation:** ```bash # Reduce --mem-frac parameter --mem-frac 0.6 # Instead of 0.8 ``` **C. Stop competing workloads:** ```bash # Identify process using GPU ps aux | grep # Kill if safe to do so kill ``` **Memory Allocation Guide:** - Small models (1-3B): `--mem-frac 0.6-0.7` - Medium models (7-13B): `--mem-frac 0.8-0.85` - Large models (70B+): `--mem-frac 0.9-0.95` Always leave 10-20% GPU memory free for activations and temporary tensors. ### 11. Wrong Model or Runtime Name **Symptoms:** - InferenceService fails to create - Error about model or runtime not found **Solution:** ```bash # List available models kubectl get clusterbasemodels # List available runtimes kubectl get clusterservingruntimes # Update examples/simple_task.json with correct names ``` ### 12. Network and Proxy Configuration **Problem:** Images can't be pulled, models can't be downloaded in Minikube **Symptoms:** - `ImagePullBackOff` errors - OME model agent can't download models - Connection timeouts **Solution: Configure Docker proxy in Minikube** ```bash # SSH into Minikube minikube ssh # Create proxy configuration sudo mkdir -p /etc/systemd/system/docker.service.d sudo tee /etc/systemd/system/docker.service.d/http-proxy.conf < -n autotuner # Check BenchmarkJob status kubectl describe benchmarkjob -n autotuner ``` **Common Causes:** - InferenceService endpoint not reachable - Traffic scenarios too demanding - Timeout settings too low ## Monitoring Tips ### Watch Resources in Real-Time ```bash # All resources in autotuner namespace watch kubectl get inferenceservices,benchmarkjobs,pods -n autotuner # Just InferenceServices kubectl get inferenceservices -n autotuner -w # Pod logs kubectl logs -f -n autotuner ``` ### Check OME Controller Logs ```bash kubectl logs -n ome deployment/ome-controller-manager --tail=100 ``` ## Performance Tips 1. **Reduce timeout values** for faster iteration during development: ```json "optimization": { "timeout_per_iteration": 300 // 5 minutes instead of 10 } ``` 2. **Use smaller benchmark workloads** for testing: ```json "benchmark": { "traffic_scenarios": ["D(100,100)"], // Lighter load "max_requests_per_iteration": 50 // Fewer requests } ``` 3. **Limit parameter grid** for initial testing: ```json "parameters": { "mem_frac": {"type": "choice", "values": [0.85, 0.9]} // Just 2 values } ``` --- ## Viewing GenAI-Bench Logs ### Verbose Mode for Real-Time Output Use the `--verbose` or `-v` flag to stream genai-bench output in real-time: ```bash python src/run_autotuner.py examples/docker_task.json --mode docker --direct --verbose ``` **Benefits**: - Real-time feedback during benchmarks - See progress during long runs - Useful for debugging connection/API issues - Detect problems early **Default mode** (no flag) shows output only after completion. ### Usage Examples **Debugging connection issues:** ```bash python src/run_autotuner.py examples/docker_task.json --mode docker --direct --verbose ``` **Long-running benchmarks with log file:** ```bash python src/run_autotuner.py examples/docker_task.json --mode docker --direct --verbose 2>&1 | tee autotuner.log ``` ### Manual Log Inspection View benchmark results directly: ```bash # List all benchmark results ls -R benchmark_results/ # View specific experiment metadata cat benchmark_results/docker-simple-tune-exp1/experiment_metadata.json ``` ### When to Use Verbose Mode **Use verbose for**: - Initial testing of new configurations - Debugging connection/API issues - Long-running benchmarks (>5 minutes) - Monitoring progress **Use default mode for**: - Production runs - CI/CD pipelines - Multiple parallel experiments