# GPU Resource Tracking Comprehensive GPU monitoring, allocation, and scheduling system for the LLM Autotuner. ## Table of Contents - [Overview](#overview) - [Architecture](#architecture) - [GPU Monitoring](#gpu-monitoring) - [GPU Allocation](#gpu-allocation) - [GPU Scheduling](#gpu-scheduling) - [Real-Time Monitoring](#real-time-monitoring) - [Frontend Visualization](#frontend-visualization) - [API Reference](#api-reference) - [Configuration](#configuration) - [Troubleshooting](#troubleshooting) - [Performance Considerations](#performance-considerations) ## Overview The GPU tracking system provides end-to-end visibility and intelligent management of GPU resources throughout the autotuning workflow: - **Monitoring**: Real-time collection of GPU metrics (utilization, memory, temperature, power) - **Allocation**: Intelligent GPU selection for experiments based on availability scoring - **Scheduling**: Task-level GPU requirement estimation and availability checking - **Visualization**: Rich frontend charts and tables for GPU metrics analysis ### Key Features - Automatic GPU detection via nvidia-smi - Smart GPU allocation using composite scoring (memory + utilization) - GPU-aware task scheduling with timeout-based waiting - Real-time GPU monitoring during benchmark execution - Frontend visualization with Recharts - Detailed GPU information in experiment results ### Supported Modes - **Docker Mode**: Full GPU tracking, allocation, and scheduling support - **OME/Kubernetes Mode**: GPU monitoring and visualization only (allocation handled by K8s scheduler) ## Architecture ### System Components ``` ┌─────────────────────────────────────────────────────────────┐ │ Frontend (React) │ │ ┌────────────────┐ ┌──────────────────────────────────┐ │ │ │ GPU Metrics │ │ Experiments Page │ │ │ │ Chart │ │ - GPU count column │ │ │ │ - Utilization │ │ - GPU model info │ │ │ │ - Memory │ │ - Monitoring data indicator │ │ │ │ - Temperature │ │ │ │ │ │ - Power │ │ │ │ │ └────────────────┘ └──────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ ▲ │ REST API (JSON) │ ┌─────────────────────────────────────────────────────────────┐ │ Backend (Python) │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ ARQ Worker (autotuner_worker.py) │ │ │ │ - GPU requirement estimation │ │ │ │ - Availability checking before task start │ │ │ │ - Wait for GPU availability (timeout) │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ │ │ ┌──────────────────────────┼────────────────────────────┐ │ │ │ Orchestrator │ │ │ │ │ - Coordinates experiments│ │ │ │ │ - Passes GPU indices │ │ │ │ └──────────────────────────┼────────────────────────────┘ │ │ │ │ │ ┌──────────────────────────┴────────────────────────────┐ │ │ │ Controllers │ │ │ │ │ │ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ │ │ DockerController │ │ │ │ │ │ - Smart GPU allocation (select_gpus_for_task) │ │ │ │ │ │ - Device requests with specific GPU IDs │ │ │ │ │ └─────────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ │ │ DirectBenchmarkController │ │ │ │ │ │ - Real-time GPU monitoring during benchmark │ │ │ │ │ │ - Aggregates stats (min/max/mean) │ │ │ │ │ │ - Returns monitoring data with metrics │ │ │ │ │ └─────────────────────────────────────────────────┘ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ │ ┌──────────────────────────┴────────────────────────────┐ │ │ │ Utilities │ │ │ │ │ │ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ │ │ gpu_monitor.py │ │ │ │ │ │ - nvidia-smi wrapper │ │ │ │ │ │ - GPU availability scoring │ │ │ │ │ │ - Continuous monitoring thread │ │ │ │ │ └─────────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ │ │ gpu_scheduler.py │ │ │ │ │ │ - GPU requirement estimation │ │ │ │ │ │ - Availability checking │ │ │ │ │ │ - Wait-for-availability with timeout │ │ │ │ │ └─────────────────────────────────────────────────┘ │ │ │ └───────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ ▲ │ nvidia-smi │ ┌────┴────┐ │ GPUs │ └─────────┘ ``` ### Data Flow 1. **Task Submission**: User creates task via frontend or API 2. **Task Scheduling**: ARQ worker checks GPU availability before starting 3. **GPU Allocation**: DockerController selects optimal GPUs for experiment 4. **Experiment Execution**: DirectBenchmarkController monitors GPUs during benchmark 5. **Data Collection**: GPU metrics aggregated and stored with experiment results 6. **Visualization**: Frontend displays GPU info and monitoring charts ## GPU Monitoring ### Overview The `gpu_monitor.py` module provides a singleton-based GPU monitoring system with caching and continuous monitoring capabilities. ### Core Components #### GPUMonitor Singleton ```python from src.utils.gpu_monitor import get_gpu_monitor # Get the global GPU monitor instance gpu_monitor = get_gpu_monitor() # Check if nvidia-smi is available if gpu_monitor.is_available(): print("GPU monitoring available") ``` #### Query GPU Status ```python # Get current GPU snapshot (uses cache if recent) snapshot = gpu_monitor.query_gpus() # Force fresh query (bypass cache) snapshot = gpu_monitor.query_gpus(use_cache=False) # Access GPU data for gpu in snapshot.gpus: print(f"GPU {gpu.index}: {gpu.name}") print(f" Memory: {gpu.memory_used_mb}/{gpu.memory_total_mb} MB") print(f" Utilization: {gpu.utilization_percent}%") print(f" Temperature: {gpu.temperature_c}°C") print(f" Power: {gpu.power_draw_w}W") ``` #### Find Available GPUs ```python # Get GPUs with <50% utilization and at least 8GB free available_gpus = gpu_monitor.get_available_gpus( min_memory_mb=8000, max_utilization=50 ) print(f"Available GPU indices: {available_gpus}") ``` #### GPU Availability Scoring The system uses a composite scoring algorithm to rank GPUs: ```python score = 0.6 × memory_score + 0.4 × utilization_score where: memory_score = memory_free_mb / memory_total_mb utilization_score = (100 - utilization_percent) / 100 ``` This prioritizes GPUs with: - More free memory (60% weight) - Lower utilization (40% weight) ### Continuous Monitoring For real-time monitoring during long-running operations: ```python # Start monitoring thread (samples every 1 second) gpu_monitor.start_monitoring(interval_seconds=1.0) # ... run your workload ... # Stop monitoring and get aggregated stats stats = gpu_monitor.stop_monitoring() # Access aggregated data print(f"Monitoring duration: {stats['monitoring_duration_seconds']}s") print(f"Sample count: {stats['sample_count']}") for gpu_index, gpu_stats in stats['gpu_stats'].items(): print(f"\nGPU {gpu_index}:") print(f" Utilization: {gpu_stats['utilization']['mean']:.1f}%") print(f" Range: {gpu_stats['utilization']['min']:.0f}% - {gpu_stats['utilization']['max']:.0f}%") print(f" Memory: {gpu_stats['memory_used_mb']['mean']:.0f} MB") print(f" Temperature: {gpu_stats['temperature_c']['mean']:.0f}°C") print(f" Power: {gpu_stats['power_draw_w']['mean']:.1f}W") ``` ### Cache Behavior - Default cache TTL: 5 seconds - Cache cleared on manual refresh (`use_cache=False`) - Cache invalidated when monitoring starts/stops ## GPU Allocation ### Overview The DockerController implements intelligent GPU allocation for experiments using the monitoring system. ### Allocation Strategy ```python def select_gpus_for_task(self, required_gpus: int, min_memory_mb: int = 8000) -> List[int]: """ Select optimal GPUs for task execution. Args: required_gpus: Number of GPUs needed min_memory_mb: Minimum free memory per GPU (default: 8GB) Returns: List of GPU indices (e.g., [0, 1]) Raises: RuntimeError: If insufficient GPUs available """ ``` ### Allocation Examples #### Single GPU Allocation ```python from src.controllers.docker_controller import DockerController controller = DockerController( docker_model_path="/mnt/data/models", verbose=True ) # Allocate 1 GPU with at least 8GB free gpu_indices = controller.select_gpus_for_task( required_gpus=1, min_memory_mb=8000 ) # Result: [2] # GPU 2 had the highest availability score ``` #### Multi-GPU Allocation ```python # Allocate 4 GPUs for tensor parallelism gpu_indices = controller.select_gpus_for_task( required_gpus=4, min_memory_mb=10000 ) # Result: [1, 2, 3, 5] # Best 4 GPUs by composite score ``` ### Allocation Process 1. **Query GPUs**: Get current status via `gpu_monitor.query_gpus(use_cache=False)` 2. **Filter GPUs**: Remove GPUs with insufficient memory 3. **Score GPUs**: Calculate composite score (memory 60% + utilization 40%) 4. **Sort & Select**: Return top N GPUs by score 5. **Validate**: Raise error if insufficient GPUs available ### Docker Integration Selected GPUs are passed to Docker via `device_requests`: ```python device_request = docker.types.DeviceRequest( device_ids=[str(idx) for idx in gpu_indices], capabilities=[['gpu']] ) container = client.containers.run( image=image_name, device_requests=[device_request], # ... other params ) ``` **IMPORTANT**: Do NOT set `CUDA_VISIBLE_DEVICES` environment variable when using `device_requests`. Docker handles GPU visibility automatically. ## GPU Scheduling ### Overview The `gpu_scheduler.py` module provides task-level GPU resource management with intelligent waiting. ### GPU Requirement Estimation ```python from src.utils.gpu_scheduler import estimate_gpu_requirements task_config = { "model": {"id_or_path": "llama-3-70b"}, "parameters": { "tp-size": [1, 2, 4], # Tensor parallelism "pp-size": [1], # Pipeline parallelism "dp-size": [1] # Data parallelism } } required_gpus, estimated_memory_mb = estimate_gpu_requirements(task_config) # Result: (4, 20000) # 4 GPUs needed, ~20GB per GPU for 70B model ``` ### World Size Calculation ```python world_size = tp × pp × max(dp, dcp, cp) where: tp = tensor_parallel_size pp = pipeline_parallel_size dp = data_parallel_size cp = context_parallel_size dcp = decode_context_parallel_size ``` ### Memory Estimation Heuristics - **70B/65B models**: 20,000 MB per GPU - **13B/7B models**: 12,000 MB per GPU - **Unknown/small models**: 8,000 MB per GPU (base) ### Parameter Name Formats The estimator supports multiple parameter naming conventions: ```python # All these are equivalent for tensor parallelism: "tensor-parallel-size": [1, 2, 4] "tp-size": [1, 2, 4] "tp_size": [1, 2, 4] "tp": [1, 2, 4] ``` Supported parameters: - Tensor Parallel: `tensor-parallel-size`, `tp-size`, `tp_size`, `tp` - Pipeline Parallel: `pipeline-parallel-size`, `pp-size`, `pp_size`, `pp` - Data Parallel: `data-parallel-size`, `dp-size`, `dp_size`, `dp` - Context Parallel: `context-parallel-size`, `cp-size`, `cp_size`, `cp` - Decode Context Parallel: `decode-context-parallel-size`, `dcp-size`, `dcp_size`, `dcp` ### Availability Checking ```python from src.utils.gpu_scheduler import check_gpu_availability # Check if 4 GPUs with 10GB free are available is_available, message = check_gpu_availability( required_gpus=4, min_memory_mb=10000 ) if is_available: print(f"GPUs available: {message}") else: print(f"GPUs unavailable: {message}") ``` ### Wait for Availability ```python from src.utils.gpu_scheduler import wait_for_gpu_availability # Wait up to 5 minutes for GPUs to become available is_available, message = wait_for_gpu_availability( required_gpus=4, min_memory_mb=10000, timeout_seconds=300, # 5 minutes check_interval=30 # Check every 30 seconds ) if is_available: print(f"GPUs became available: {message}") else: print(f"Timeout: {message}") ``` ### ARQ Worker Integration The GPU scheduler is integrated into the task execution workflow: ```python # In autotuner_worker.py: if task.deployment_mode == "docker": # 1. Estimate GPU requirements required_gpus, estimated_memory_mb = estimate_gpu_requirements(task_config) # 2. Check immediate availability is_available, message = check_gpu_availability( required_gpus=required_gpus, min_memory_mb=estimated_memory_mb ) # 3. Wait if not immediately available if not is_available: is_available, message = wait_for_gpu_availability( required_gpus=required_gpus, min_memory_mb=estimated_memory_mb, timeout_seconds=300, # 5 minutes check_interval=30 ) # 4. Fail task if still unavailable if not is_available: task.status = TaskStatus.FAILED # ... update database and broadcast event return {"status": "failed", "error": message} ``` ## Real-Time Monitoring ### Overview The DirectBenchmarkController monitors GPU metrics during benchmark execution. ### Monitoring Process 1. **Start Monitoring**: Thread begins sampling GPUs every 1 second 2. **Run Benchmark**: genai-bench executes while monitoring collects data 3. **Stop Monitoring**: Thread stops and aggregates statistics 4. **Return Results**: Monitoring data included in experiment metrics ### Implementation ```python # In direct_benchmark_controller.py: def run_benchmark_job(self, endpoint_url: str, benchmark_spec: Dict[str, Any], gpu_indices: Optional[List[int]] = None) -> Dict[str, Any]: # Start GPU monitoring gpu_monitor = get_gpu_monitor() if gpu_monitor.is_available(): gpu_monitor.start_monitoring(interval_seconds=1.0) # Run benchmark result = self._run_genai_bench(endpoint_url, benchmark_spec) # Stop monitoring and get stats monitoring_data = None if gpu_monitor.is_available(): monitoring_data = gpu_monitor.stop_monitoring() # Include in results result["gpu_monitoring"] = monitoring_data return result ``` ### Monitoring Data Structure ```python { "monitoring_duration_seconds": 45.2, "sample_count": 45, "gpu_stats": { "0": { "name": "NVIDIA A100-SXM4-80GB", "utilization": { "min": 78.0, "max": 95.0, "mean": 87.3, "samples": 45 }, "memory_used_mb": { "min": 15234.0, "max": 15678.0, "mean": 15456.2 }, "memory_usage_percent": { "min": 18.5, "max": 19.1, "mean": 18.8 }, "temperature_c": { "min": 56.0, "max": 62.0, "mean": 59.1 }, "power_draw_w": { "min": 245.0, "max": 280.0, "mean": 265.3 } } } } ``` ## Frontend Visualization ### Experiments Page The Experiments page displays GPU information for each experiment: #### GPU Column Shows GPU count and model for experiments: ```tsx