GPU Resource Tracking¶

Comprehensive GPU monitoring, allocation, and scheduling system for the LLM Autotuner.

Table of Contents¶

Overview
Architecture
GPU Monitoring
GPU Allocation
GPU Scheduling
Real-Time Monitoring
Frontend Visualization
API Reference
Configuration
Troubleshooting
Performance Considerations

Overview¶

The GPU tracking system provides end-to-end visibility and intelligent management of GPU resources throughout the autotuning workflow:

Monitoring: Real-time collection of GPU metrics (utilization, memory, temperature, power)
Allocation: Intelligent GPU selection for experiments based on availability scoring
Scheduling: Task-level GPU requirement estimation and availability checking
Visualization: Rich frontend charts and tables for GPU metrics analysis

Key Features¶

Automatic GPU detection via nvidia-smi
Smart GPU allocation using composite scoring (memory + utilization)
GPU-aware task scheduling with timeout-based waiting
Real-time GPU monitoring during benchmark execution
Frontend visualization with Recharts
Detailed GPU information in experiment results

Supported Modes¶

Docker Mode: Full GPU tracking, allocation, and scheduling support
OME/Kubernetes Mode: GPU monitoring and visualization only (allocation handled by K8s scheduler)

Architecture¶

System Components¶

┌─────────────────────────────────────────────────────────────┐
│                      Frontend (React)                        │
│  ┌────────────────┐  ┌──────────────────────────────────┐   │
│  │ GPU Metrics    │  │ Experiments Page                 │   │
│  │ Chart          │  │ - GPU count column               │   │
│  │ - Utilization  │  │ - GPU model info                 │   │
│  │ - Memory       │  │ - Monitoring data indicator      │   │
│  │ - Temperature  │  │                                  │   │
│  │ - Power        │  │                                  │   │
│  └────────────────┘  └──────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                             ▲
                             │ REST API (JSON)
                             │
┌─────────────────────────────────────────────────────────────┐
│                    Backend (Python)                          │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ ARQ Worker (autotuner_worker.py)                     │   │
│  │ - GPU requirement estimation                         │   │
│  │ - Availability checking before task start            │   │
│  │ - Wait for GPU availability (timeout)                │   │
│  └──────────────────────────────────────────────────────┘   │
│                             │                                │
│  ┌──────────────────────────┼────────────────────────────┐  │
│  │ Orchestrator             │                            │  │
│  │ - Coordinates experiments│                            │  │
│  │ - Passes GPU indices     │                            │  │
│  └──────────────────────────┼────────────────────────────┘  │
│                             │                                │
│  ┌──────────────────────────┴────────────────────────────┐  │
│  │ Controllers                                           │  │
│  │                                                       │  │
│  │ ┌─────────────────────────────────────────────────┐  │  │
│  │ │ DockerController                                │  │  │
│  │ │ - Smart GPU allocation (select_gpus_for_task)   │  │  │
│  │ │ - Device requests with specific GPU IDs         │  │  │
│  │ └─────────────────────────────────────────────────┘  │  │
│  │                                                       │  │
│  │ ┌─────────────────────────────────────────────────┐  │  │
│  │ │ DirectBenchmarkController                       │  │  │
│  │ │ - Real-time GPU monitoring during benchmark     │  │  │
│  │ │ - Aggregates stats (min/max/mean)               │  │  │
│  │ │ - Returns monitoring data with metrics          │  │  │
│  │ └─────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────┘  │
│                             │                                │
│  ┌──────────────────────────┴────────────────────────────┐  │
│  │ Utilities                                             │  │
│  │                                                       │  │
│  │ ┌─────────────────────────────────────────────────┐  │  │
│  │ │ gpu_monitor.py                                  │  │  │
│  │ │ - nvidia-smi wrapper                            │  │  │
│  │ │ - GPU availability scoring                      │  │  │
│  │ │ - Continuous monitoring thread                  │  │  │
│  │ └─────────────────────────────────────────────────┘  │  │
│  │                                                       │  │
│  │ ┌─────────────────────────────────────────────────┐  │  │
│  │ │ gpu_scheduler.py                                │  │  │
│  │ │ - GPU requirement estimation                    │  │  │
│  │ │ - Availability checking                         │  │  │
│  │ │ - Wait-for-availability with timeout            │  │  │
│  │ └─────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
                             ▲
                             │ nvidia-smi
                             │
                        ┌────┴────┐
                        │   GPUs  │
                        └─────────┘

Data Flow¶

Task Submission: User creates task via frontend or API
Task Scheduling: ARQ worker checks GPU availability before starting
GPU Allocation: DockerController selects optimal GPUs for experiment
Experiment Execution: DirectBenchmarkController monitors GPUs during benchmark
Data Collection: GPU metrics aggregated and stored with experiment results
Visualization: Frontend displays GPU info and monitoring charts

GPU Monitoring¶

Overview¶

The gpu_monitor.py module provides a singleton-based GPU monitoring system with caching and continuous monitoring capabilities.

Core Components¶

GPUMonitor Singleton¶

from src.utils.gpu_monitor import get_gpu_monitor

# Get the global GPU monitor instance
gpu_monitor = get_gpu_monitor()

# Check if nvidia-smi is available
if gpu_monitor.is_available():
    print("GPU monitoring available")

Query GPU Status¶

# Get current GPU snapshot (uses cache if recent)
snapshot = gpu_monitor.query_gpus()

# Force fresh query (bypass cache)
snapshot = gpu_monitor.query_gpus(use_cache=False)

# Access GPU data
for gpu in snapshot.gpus:
    print(f"GPU {gpu.index}: {gpu.name}")
    print(f"  Memory: {gpu.memory_used_mb}/{gpu.memory_total_mb} MB")
    print(f"  Utilization: {gpu.utilization_percent}%")
    print(f"  Temperature: {gpu.temperature_c}°C")
    print(f"  Power: {gpu.power_draw_w}W")

Find Available GPUs¶

# Get GPUs with <50% utilization and at least 8GB free
available_gpus = gpu_monitor.get_available_gpus(
    min_memory_mb=8000,
    max_utilization=50
)

print(f"Available GPU indices: {available_gpus}")

GPU Availability Scoring¶

The system uses a composite scoring algorithm to rank GPUs:

score = 0.6 × memory_score + 0.4 × utilization_score

where:
  memory_score = memory_free_mb / memory_total_mb
  utilization_score = (100 - utilization_percent) / 100

This prioritizes GPUs with:

More free memory (60% weight)
Lower utilization (40% weight)

Continuous Monitoring¶

For real-time monitoring during long-running operations:

# Start monitoring thread (samples every 1 second)
gpu_monitor.start_monitoring(interval_seconds=1.0)

# ... run your workload ...

# Stop monitoring and get aggregated stats
stats = gpu_monitor.stop_monitoring()

# Access aggregated data
print(f"Monitoring duration: {stats['monitoring_duration_seconds']}s")
print(f"Sample count: {stats['sample_count']}")

for gpu_index, gpu_stats in stats['gpu_stats'].items():
    print(f"\nGPU {gpu_index}:")
    print(f"  Utilization: {gpu_stats['utilization']['mean']:.1f}%")
    print(f"    Range: {gpu_stats['utilization']['min']:.0f}% - {gpu_stats['utilization']['max']:.0f}%")
    print(f"  Memory: {gpu_stats['memory_used_mb']['mean']:.0f} MB")
    print(f"  Temperature: {gpu_stats['temperature_c']['mean']:.0f}°C")
    print(f"  Power: {gpu_stats['power_draw_w']['mean']:.1f}W")

Cache Behavior¶

Default cache TTL: 5 seconds
Cache cleared on manual refresh (use_cache=False)
Cache invalidated when monitoring starts/stops

GPU Allocation¶

Overview¶

The DockerController implements intelligent GPU allocation for experiments using the monitoring system.

Allocation Strategy¶

def select_gpus_for_task(self, required_gpus: int, min_memory_mb: int = 8000) -> List[int]:
    """
    Select optimal GPUs for task execution.

    Args:
        required_gpus: Number of GPUs needed
        min_memory_mb: Minimum free memory per GPU (default: 8GB)

    Returns:
        List of GPU indices (e.g., [0, 1])

    Raises:
        RuntimeError: If insufficient GPUs available
    """

Allocation Examples¶

Single GPU Allocation¶

from src.controllers.docker_controller import DockerController

controller = DockerController(
    docker_model_path="/mnt/data/models",
    verbose=True
)

# Allocate 1 GPU with at least 8GB free
gpu_indices = controller.select_gpus_for_task(
    required_gpus=1,
    min_memory_mb=8000
)
# Result: [2]  # GPU 2 had the highest availability score

Multi-GPU Allocation¶

# Allocate 4 GPUs for tensor parallelism
gpu_indices = controller.select_gpus_for_task(
    required_gpus=4,
    min_memory_mb=10000
)
# Result: [1, 2, 3, 5]  # Best 4 GPUs by composite score

Allocation Process¶

Query GPUs: Get current status via gpu_monitor.query_gpus(use_cache=False)
Filter GPUs: Remove GPUs with insufficient memory
Score GPUs: Calculate composite score (memory 60% + utilization 40%)
Sort & Select: Return top N GPUs by score
Validate: Raise error if insufficient GPUs available

Docker Integration¶

Selected GPUs are passed to Docker via device_requests:

device_request = docker.types.DeviceRequest(
    device_ids=[str(idx) for idx in gpu_indices],
    capabilities=[['gpu']]
)

container = client.containers.run(
    image=image_name,
    device_requests=[device_request],
    # ... other params
)

IMPORTANT: Do NOT set CUDA_VISIBLE_DEVICES environment variable when using device_requests. Docker handles GPU visibility automatically.

GPU Scheduling¶

Overview¶

The gpu_scheduler.py module provides task-level GPU resource management with intelligent waiting.

GPU Requirement Estimation¶

from src.utils.gpu_scheduler import estimate_gpu_requirements

task_config = {
    "model": {"id_or_path": "llama-3-70b"},
    "parameters": {
        "tp-size": [1, 2, 4],      # Tensor parallelism
        "pp-size": [1],             # Pipeline parallelism
        "dp-size": [1]              # Data parallelism
    }
}

required_gpus, estimated_memory_mb = estimate_gpu_requirements(task_config)
# Result: (4, 20000)  # 4 GPUs needed, ~20GB per GPU for 70B model

World Size Calculation¶

world_size = tp × pp × max(dp, dcp, cp)

where:
  tp = tensor_parallel_size
  pp = pipeline_parallel_size
  dp = data_parallel_size
  cp = context_parallel_size
  dcp = decode_context_parallel_size

Memory Estimation Heuristics¶

70B/65B models: 20,000 MB per GPU
13B/7B models: 12,000 MB per GPU
Unknown/small models: 8,000 MB per GPU (base)

Parameter Name Formats¶

The estimator supports multiple parameter naming conventions:

# All these are equivalent for tensor parallelism:
"tensor-parallel-size": [1, 2, 4]
"tp-size": [1, 2, 4]
"tp_size": [1, 2, 4]
"tp": [1, 2, 4]

Supported parameters:

Tensor Parallel: tensor-parallel-size, tp-size, tp_size, tp
Pipeline Parallel: pipeline-parallel-size, pp-size, pp_size, pp
Data Parallel: data-parallel-size, dp-size, dp_size, dp
Context Parallel: context-parallel-size, cp-size, cp_size, cp
Decode Context Parallel: decode-context-parallel-size, dcp-size, dcp_size, dcp

Availability Checking¶

from src.utils.gpu_scheduler import check_gpu_availability

# Check if 4 GPUs with 10GB free are available
is_available, message = check_gpu_availability(
    required_gpus=4,
    min_memory_mb=10000
)

if is_available:
    print(f"GPUs available: {message}")
else:
    print(f"GPUs unavailable: {message}")

Wait for Availability¶

from src.utils.gpu_scheduler import wait_for_gpu_availability

# Wait up to 5 minutes for GPUs to become available
is_available, message = wait_for_gpu_availability(
    required_gpus=4,
    min_memory_mb=10000,
    timeout_seconds=300,    # 5 minutes
    check_interval=30       # Check every 30 seconds
)

if is_available:
    print(f"GPUs became available: {message}")
else:
    print(f"Timeout: {message}")

ARQ Worker Integration¶

The GPU scheduler is integrated into the task execution workflow:

# In autotuner_worker.py:

if task.deployment_mode == "docker":
    # 1. Estimate GPU requirements
    required_gpus, estimated_memory_mb = estimate_gpu_requirements(task_config)

    # 2. Check immediate availability
    is_available, message = check_gpu_availability(
        required_gpus=required_gpus,
        min_memory_mb=estimated_memory_mb
    )

    # 3. Wait if not immediately available
    if not is_available:
        is_available, message = wait_for_gpu_availability(
            required_gpus=required_gpus,
            min_memory_mb=estimated_memory_mb,
            timeout_seconds=300,  # 5 minutes
            check_interval=30
        )

    # 4. Fail task if still unavailable
    if not is_available:
        task.status = TaskStatus.FAILED
        # ... update database and broadcast event
        return {"status": "failed", "error": message}

Real-Time Monitoring¶

Overview¶

The DirectBenchmarkController monitors GPU metrics during benchmark execution.

Monitoring Process¶

Start Monitoring: Thread begins sampling GPUs every 1 second
Run Benchmark: genai-bench executes while monitoring collects data
Stop Monitoring: Thread stops and aggregates statistics
Return Results: Monitoring data included in experiment metrics

Implementation¶

# In direct_benchmark_controller.py:

def run_benchmark_job(self, endpoint_url: str, benchmark_spec: Dict[str, Any],
                      gpu_indices: Optional[List[int]] = None) -> Dict[str, Any]:

    # Start GPU monitoring
    gpu_monitor = get_gpu_monitor()
    if gpu_monitor.is_available():
        gpu_monitor.start_monitoring(interval_seconds=1.0)

    # Run benchmark
    result = self._run_genai_bench(endpoint_url, benchmark_spec)

    # Stop monitoring and get stats
    monitoring_data = None
    if gpu_monitor.is_available():
        monitoring_data = gpu_monitor.stop_monitoring()

    # Include in results
    result["gpu_monitoring"] = monitoring_data
    return result

Monitoring Data Structure¶

{
    "monitoring_duration_seconds": 45.2,
    "sample_count": 45,
    "gpu_stats": {
        "0": {
            "name": "NVIDIA A100-SXM4-80GB",
            "utilization": {
                "min": 78.0,
                "max": 95.0,
                "mean": 87.3,
                "samples": 45
            },
            "memory_used_mb": {
                "min": 15234.0,
                "max": 15678.0,
                "mean": 15456.2
            },
            "memory_usage_percent": {
                "min": 18.5,
                "max": 19.1,
                "mean": 18.8
            },
            "temperature_c": {
                "min": 56.0,
                "max": 62.0,
                "mean": 59.1
            },
            "power_draw_w": {
                "min": 245.0,
                "max": 280.0,
                "mean": 265.3
            }
        }
    }
}

Frontend Visualization¶

Experiments Page¶

The Experiments page displays GPU information for each experiment:

GPU Column¶

Shows GPU count and model for experiments:

<td className="whitespace-nowrap px-3 py-4 text-sm text-gray-700">
  {experiment.gpu_info ? (
    <div className="flex items-center gap-1">
      <svg className="w-4 h-4 text-green-600">...</svg>
      <span className="font-medium">{experiment.gpu_info.count}</span>
      <span className="text-gray-500 text-xs">
        {experiment.gpu_info.model ? `(${experiment.gpu_info.model.split(' ')[0]})` : ''}
      </span>
      {experiment.metrics?.gpu_monitoring && (
        <span className="ml-1 inline-flex items-center text-xs text-blue-600"
              title="GPU monitoring data available">
          📊
        </span>
      )}
    </div>
  ) : (
    <span className="text-gray-400">N/A</span>
  )}
</td>

GPU Metrics Chart Component¶

The GPUMetricsChart component visualizes monitoring data:

import GPUMetricsChart from '@/components/GPUMetricsChart';

// In experiment details modal:
{selectedExperiment.metrics?.gpu_monitoring && (
  <div>
    <h3 className="text-sm font-medium text-gray-900 mb-3">
      GPU Monitoring
    </h3>
    <GPUMetricsChart gpuMonitoring={selectedExperiment.metrics.gpu_monitoring} />
  </div>
)}

Chart Features¶

Monitoring Summary
- Duration and sample count
- Displayed in blue info box
GPU Stats Table
- Per-GPU statistics
- Columns: GPU ID, Model, Utilization, Memory, Temperature, Power
- Shows mean values with min-max ranges
Interactive Charts (Recharts LineChart)
- GPU Utilization (%)
- Memory Usage (%)
- Temperature (°C)
- Power Draw (W)
- Responsive design (adapts to container width)

TypeScript Types¶

// frontend/src/types/api.ts

export interface Experiment {
  // ... other fields
  gpu_info?: {
    model: string;
    count: number;
    device_ids?: string[];
    world_size?: number;
    gpu_info?: {
      count: number;
      indices: number[];
      allocation_method: string;
      details?: Array<{
        index: number;
        name: string;
        memory_free_mb: number;
        utilization_percent: number;
        availability_score: number;
      }>;
    };
  };
  metrics?: {
    // ... other metrics
    gpu_monitoring?: {
      monitoring_duration_seconds: number;
      sample_count: number;
      gpu_stats: {
        [gpu_index: string]: {
          name: string;
          utilization: { min: number; max: number; mean: number; samples: number };
          memory_used_mb: { min: number; max: number; mean: number };
          memory_usage_percent: { min: number; max: number; mean: number };
          temperature_c: { min: number; max: number; mean: number };
          power_draw_w: { min: number; max: number; mean: number };
        };
      };
    };
  };
}

API Reference¶

gpu_monitor.py¶

GPUMonitor Class¶

get_gpu_monitor() -> GPUMonitor

Returns the global GPUMonitor singleton instance
Thread-safe initialization

is_available() -> bool

Check if nvidia-smi is available on the system
Returns False if nvidia-smi not found or execution fails

query_gpus(use_cache: bool = True) -> Optional[GPUSnapshot]

Query current GPU status
Args:
- use_cache: Use cached data if available and recent (default: True)
Returns: GPUSnapshot with GPU data or None if query fails
Cache TTL: 5 seconds

get_available_gpus(min_memory_mb: Optional[int] = None, max_utilization: int = 50) -> List[int]

Get list of available GPU indices
Args:
- min_memory_mb: Minimum free memory required (optional)
- max_utilization: Maximum utilization percentage (default: 50)
Returns: List of GPU indices sorted by availability score (descending)

get_gpu_info(gpu_index: int) -> Optional[GPUInfo]

Get information for a specific GPU
Args:
- gpu_index: GPU index (0-based)
Returns: GPUInfo object or None if not found

start_monitoring(interval_seconds: float = 1.0) -> None

Start continuous GPU monitoring thread
Args:
- interval_seconds: Sampling interval (default: 1.0)
Clears any previous monitoring data

stop_monitoring() -> Optional[Dict[str, Any]]

Stop monitoring thread and return aggregated statistics
Returns: Dictionary with monitoring data (see data structure above)
Returns None if monitoring was not started

Data Classes¶

GPUInfo

@dataclass
class GPUInfo:
    index: int                    # GPU index (0-based)
    name: str                     # GPU model name
    memory_total_mb: int          # Total memory in MB
    memory_used_mb: int           # Used memory in MB
    memory_free_mb: int           # Free memory in MB
    utilization_percent: int      # GPU utilization (0-100)
    temperature_c: int            # Temperature in Celsius
    power_draw_w: float           # Power draw in Watts
    score: float                  # Availability score (0.0-1.0)

GPUSnapshot

@dataclass
class GPUSnapshot:
    timestamp: datetime           # When snapshot was taken
    gpus: List[GPUInfo]          # List of GPU information

gpu_scheduler.py¶

estimate_gpu_requirements(task_config: Dict[str, Any]) -> Tuple[int, int]

Estimate GPU requirements from task configuration
Args:
- task_config: Task configuration dictionary
Returns: Tuple of (min_gpus_required, estimated_memory_mb_per_gpu)
Calculation: world_size = tp × pp × max(dp, dcp, cp)

check_gpu_availability(required_gpus: int, min_memory_mb: Optional[int] = None) -> Tuple[bool, str]

Check if sufficient GPUs are available
Args:
- required_gpus: Number of GPUs required
- min_memory_mb: Minimum memory per GPU (optional)
Returns: Tuple of (is_available, message)
Message contains detailed status or error information

wait_for_gpu_availability(required_gpus: int, min_memory_mb: Optional[int] = None, timeout_seconds: int = 300, check_interval: int = 30) -> Tuple[bool, str]

Wait for sufficient GPUs to become available
Args:
- required_gpus: Number of GPUs required
- min_memory_mb: Minimum memory per GPU (optional)
- timeout_seconds: Maximum wait time (default: 300 = 5 minutes)
- check_interval: Polling interval (default: 30 seconds)
Returns: Tuple of (is_available, message)
Logs check attempts and status periodically

docker_controller.py¶

select_gpus_for_task(required_gpus: int, min_memory_mb: int = 8000) -> List[int]

Select optimal GPUs for task execution
Args:
- required_gpus: Number of GPUs needed
- min_memory_mb: Minimum free memory per GPU (default: 8000)
Returns: List of GPU indices
Raises: RuntimeError if insufficient GPUs available

direct_benchmark_controller.py¶

run_benchmark_job(endpoint_url: str, benchmark_spec: Dict[str, Any], gpu_indices: Optional[List[int]] = None) -> Dict[str, Any]

Run benchmark with GPU monitoring
Args:
- endpoint_url: Inference service endpoint
- benchmark_spec: Benchmark configuration
- gpu_indices: GPU indices being used (optional, for logging)
Returns: Dictionary with benchmark results and gpu_monitoring field

Configuration¶

Environment Variables¶

No specific environment variables required. GPU monitoring uses nvidia-smi from PATH.

Task Configuration¶

Specify parallel configuration in task JSON:

{
  "parameters": {
    "tp-size": [1, 2, 4],              // Tensor parallelism
    "pp-size": [1],                     // Pipeline parallelism
    "dp-size": [1],                     // Data parallelism
    "cp-size": [1],                     // Context parallelism
    "dcp-size": [1]                     // Decode context parallelism
  }
}

Scheduler Configuration¶

Modify timeout and interval in autotuner_worker.py:

# Wait for GPUs with custom timeout
is_available, message = wait_for_gpu_availability(
    required_gpus=required_gpus,
    min_memory_mb=estimated_memory_mb,
    timeout_seconds=600,     # 10 minutes (default: 300)
    check_interval=60        # Check every minute (default: 30)
)

Monitoring Configuration¶

Adjust sampling interval for continuous monitoring:

# Sample every 2 seconds instead of 1
gpu_monitor.start_monitoring(interval_seconds=2.0)

Troubleshooting¶

nvidia-smi Not Found¶

Symptom: Warnings like “nvidia-smi not available”

Cause: nvidia-smi not in PATH or NVIDIA drivers not installed

Solution:

System proceeds without GPU monitoring (graceful degradation)
Install NVIDIA drivers and CUDA toolkit
Verify: nvidia-smi command works in terminal

No GPUs Available¶

Symptom: Task fails with “Insufficient GPUs after waiting”

Cause: All GPUs are busy or don’t meet memory requirements

Solutions:

Wait for running tasks to complete
Reduce min_memory_mb requirement
Reduce parallel configuration (tp-size, pp-size)
Increase timeout: timeout_seconds=600

Incorrect GPU Count Estimation¶

Symptom: Task requests wrong number of GPUs

Cause: Parameter names not recognized or misconfigured

Solutions:

Use standard hyphenated format: tp-size not tp_size
Check parameter values are lists: [1, 2, 4] not 1
Verify task config JSON structure
Check logs for “Estimated requirements” message

GPU Allocation Failures¶

Symptom: RuntimeError during GPU selection

Cause: Insufficient GPUs with required memory

Solutions:

Lower memory requirement: min_memory_mb=6000
Free up GPU memory (stop other processes)
Use fewer GPUs (reduce tp-size)

Monitoring Data Not Appearing¶

Symptom: No GPU charts in experiment details

Cause: Monitoring not enabled or failed to collect data

Solutions:

Verify nvidia-smi works: nvidia-smi in terminal
Check experiment metrics in database has gpu_monitoring field
Ensure DirectBenchmarkController is being used (Docker mode)
Check worker logs for monitoring errors

Docker Container Can’t Access GPUs¶

Symptom: “No accelerator available” in container logs

Cause: Incorrect Docker GPU configuration

Solutions:

Verify device_requests is used, not CUDA_VISIBLE_DEVICES
Check nvidia-container-toolkit installed: docker run --gpus all ...
Verify GPU indices are valid: nvidia-smi -L
Don’t mix device_requests and CUDA_VISIBLE_DEVICES

Frontend Not Showing GPU Info¶

Symptom: “N/A” in GPU column

Cause: Experiment doesn’t have gpu_info

Solutions:

Verify task uses Docker mode (OME mode has limited GPU tracking)
Check experiment record in database has gpu_info field
Ensure DockerController’s select_gpus_for_task was called
Verify frontend TypeScript types are up to date

Performance Considerations¶

Cache Usage¶

Query Cache: 5-second TTL reduces nvidia-smi overhead
Recommendation: Use default cache for frequent queries
Force Refresh: Use use_cache=False for critical decisions (GPU allocation)

Monitoring Overhead¶

Sampling Rate: Default 1 second is good for most workloads
Overhead: Minimal (~1% CPU per GPU monitored)
Recommendation: Increase interval to 2-5 seconds for very long benchmarks (>10 minutes)

Scheduler Polling¶

Default Interval: 30 seconds balances responsiveness and overhead
Recommendation: Use shorter interval (10-15s) for high-priority tasks
Recommendation: Use longer interval (60s) when many tasks in queue

GPU Allocation Strategy¶

Scoring Algorithm: Prioritizes memory (60%) over utilization (40%)
Rationale: Memory is hard constraint, utilization is soft
Recommendation: Adjust if workload is compute-bound rather than memory-bound

Database Storage¶

GPU Info: Stored as JSON in experiment record (~1-2 KB per experiment)
Monitoring Data: Can be large for long benchmarks (~50 KB for 1000 samples)
Recommendation: Consider cleanup policy for old experiment monitoring data

Best Practices¶

Use Docker Mode: Full GPU tracking support (OME mode has limited support)
Set Realistic Memory Requirements: Over-estimation causes unnecessary waits
Configure Timeouts Appropriately:
- Short tasks (< 5 min): 300s timeout
- Long tasks (> 10 min): 600-900s timeout
Monitor System Load: Use watch -n 1 nvidia-smi to understand GPU usage patterns
Tune Scoring Algorithm: Adjust weights in _calculate_gpu_score() for your workload
Archive Monitoring Data: Consider moving old monitoring data to separate storage
Use Graceful Degradation: System works without nvidia-smi, but with reduced visibility
Check Logs: Worker logs contain detailed GPU scheduling information
Test Parallel Configs: Verify world_size calculation matches your expectation
Frontend Caching: TanStack Query caches experiment data to reduce API calls

Future Enhancements¶

Potential areas for improvement:

Multi-Node GPU Scheduling: Support for distributed GPU allocation across nodes
Predictive Scheduling: ML-based prediction of task duration and GPU requirements
Dynamic Reallocation: Move tasks between GPUs based on load
WebSocket Updates: Real-time GPU metrics streaming to frontend
GPU Affinity: Pin specific experiments to specific GPUs
Power Capping: Enforce power limits for energy efficiency
Historical Analytics: Track GPU utilization trends over time

Intelligent GPU Allocation (OME/Kubernetes)¶

For Kubernetes deployments, the system includes cluster-wide GPU discovery and intelligent node selection.

Features¶

Cluster-wide GPU Discovery
- Queries all nodes in Kubernetes cluster
- Collects GPU capacity, utilization, memory, temperature
- Node-level GPU availability tracking
Intelligent Node Selection
- Determines GPU requirements from task parameters (tp-size)
- Ranks nodes based on idle GPU availability
- Idle criteria: <30% utilization AND <50% memory
- Applies node affinity to InferenceService deployments
Automatic Fallback
- Falls back to K8s scheduler if no metrics available
- Graceful degradation if no idle GPUs found
- Can disable with enable_gpu_selection=False

Implementation¶

See src/controllers/gpu_allocator.py:

get_cluster_gpu_status(): Cluster-wide discovery
select_best_node(): Node ranking algorithm
Integrates with OMEController for deployments

Benefits¶

Balanced GPU utilization across cluster
Avoids overloaded nodes
Reduces deployment failures from resource contention
Works with dynamic Kubernetes clusters