GPU Resource Tracking¶
Comprehensive GPU monitoring, allocation, and scheduling system for the LLM Autotuner.
Table of Contents¶
Overview¶
The GPU tracking system provides end-to-end visibility and intelligent management of GPU resources throughout the autotuning workflow:
Monitoring: Real-time collection of GPU metrics (utilization, memory, temperature, power)
Allocation: Intelligent GPU selection for experiments based on availability scoring
Scheduling: Task-level GPU requirement estimation and availability checking
Visualization: Rich frontend charts and tables for GPU metrics analysis
Key Features¶
Automatic GPU detection via nvidia-smi
Smart GPU allocation using composite scoring (memory + utilization)
GPU-aware task scheduling with timeout-based waiting
Real-time GPU monitoring during benchmark execution
Frontend visualization with Recharts
Detailed GPU information in experiment results
Supported Modes¶
Docker Mode: Full GPU tracking, allocation, and scheduling support
OME/Kubernetes Mode: GPU monitoring and visualization only (allocation handled by K8s scheduler)
Architecture¶
System Components¶
┌─────────────────────────────────────────────────────────────┐
│ Frontend (React) │
│ ┌────────────────┐ ┌──────────────────────────────────┐ │
│ │ GPU Metrics │ │ Experiments Page │ │
│ │ Chart │ │ - GPU count column │ │
│ │ - Utilization │ │ - GPU model info │ │
│ │ - Memory │ │ - Monitoring data indicator │ │
│ │ - Temperature │ │ │ │
│ │ - Power │ │ │ │
│ └────────────────┘ └──────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
▲
│ REST API (JSON)
│
┌─────────────────────────────────────────────────────────────┐
│ Backend (Python) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ ARQ Worker (autotuner_worker.py) │ │
│ │ - GPU requirement estimation │ │
│ │ - Availability checking before task start │ │
│ │ - Wait for GPU availability (timeout) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────┼────────────────────────────┐ │
│ │ Orchestrator │ │ │
│ │ - Coordinates experiments│ │ │
│ │ - Passes GPU indices │ │ │
│ └──────────────────────────┼────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────┴────────────────────────────┐ │
│ │ Controllers │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ DockerController │ │ │
│ │ │ - Smart GPU allocation (select_gpus_for_task) │ │ │
│ │ │ - Device requests with specific GPU IDs │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ DirectBenchmarkController │ │ │
│ │ │ - Real-time GPU monitoring during benchmark │ │ │
│ │ │ - Aggregates stats (min/max/mean) │ │ │
│ │ │ - Returns monitoring data with metrics │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────┴────────────────────────────┐ │
│ │ Utilities │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ gpu_monitor.py │ │ │
│ │ │ - nvidia-smi wrapper │ │ │
│ │ │ - GPU availability scoring │ │ │
│ │ │ - Continuous monitoring thread │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ gpu_scheduler.py │ │ │
│ │ │ - GPU requirement estimation │ │ │
│ │ │ - Availability checking │ │ │
│ │ │ - Wait-for-availability with timeout │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
▲
│ nvidia-smi
│
┌────┴────┐
│ GPUs │
└─────────┘
Data Flow¶
Task Submission: User creates task via frontend or API
Task Scheduling: ARQ worker checks GPU availability before starting
GPU Allocation: DockerController selects optimal GPUs for experiment
Experiment Execution: DirectBenchmarkController monitors GPUs during benchmark
Data Collection: GPU metrics aggregated and stored with experiment results
Visualization: Frontend displays GPU info and monitoring charts
GPU Monitoring¶
Overview¶
The gpu_monitor.py module provides a singleton-based GPU monitoring system with caching and continuous monitoring capabilities.
Core Components¶
GPUMonitor Singleton¶
from src.utils.gpu_monitor import get_gpu_monitor
# Get the global GPU monitor instance
gpu_monitor = get_gpu_monitor()
# Check if nvidia-smi is available
if gpu_monitor.is_available():
print("GPU monitoring available")
Query GPU Status¶
# Get current GPU snapshot (uses cache if recent)
snapshot = gpu_monitor.query_gpus()
# Force fresh query (bypass cache)
snapshot = gpu_monitor.query_gpus(use_cache=False)
# Access GPU data
for gpu in snapshot.gpus:
print(f"GPU {gpu.index}: {gpu.name}")
print(f" Memory: {gpu.memory_used_mb}/{gpu.memory_total_mb} MB")
print(f" Utilization: {gpu.utilization_percent}%")
print(f" Temperature: {gpu.temperature_c}°C")
print(f" Power: {gpu.power_draw_w}W")
Find Available GPUs¶
# Get GPUs with <50% utilization and at least 8GB free
available_gpus = gpu_monitor.get_available_gpus(
min_memory_mb=8000,
max_utilization=50
)
print(f"Available GPU indices: {available_gpus}")
GPU Availability Scoring¶
The system uses a composite scoring algorithm to rank GPUs:
score = 0.6 × memory_score + 0.4 × utilization_score
where:
memory_score = memory_free_mb / memory_total_mb
utilization_score = (100 - utilization_percent) / 100
This prioritizes GPUs with:
More free memory (60% weight)
Lower utilization (40% weight)
Continuous Monitoring¶
For real-time monitoring during long-running operations:
# Start monitoring thread (samples every 1 second)
gpu_monitor.start_monitoring(interval_seconds=1.0)
# ... run your workload ...
# Stop monitoring and get aggregated stats
stats = gpu_monitor.stop_monitoring()
# Access aggregated data
print(f"Monitoring duration: {stats['monitoring_duration_seconds']}s")
print(f"Sample count: {stats['sample_count']}")
for gpu_index, gpu_stats in stats['gpu_stats'].items():
print(f"\nGPU {gpu_index}:")
print(f" Utilization: {gpu_stats['utilization']['mean']:.1f}%")
print(f" Range: {gpu_stats['utilization']['min']:.0f}% - {gpu_stats['utilization']['max']:.0f}%")
print(f" Memory: {gpu_stats['memory_used_mb']['mean']:.0f} MB")
print(f" Temperature: {gpu_stats['temperature_c']['mean']:.0f}°C")
print(f" Power: {gpu_stats['power_draw_w']['mean']:.1f}W")
Cache Behavior¶
Default cache TTL: 5 seconds
Cache cleared on manual refresh (
use_cache=False)Cache invalidated when monitoring starts/stops
GPU Allocation¶
Overview¶
The DockerController implements intelligent GPU allocation for experiments using the monitoring system.
Allocation Strategy¶
def select_gpus_for_task(self, required_gpus: int, min_memory_mb: int = 8000) -> List[int]:
"""
Select optimal GPUs for task execution.
Args:
required_gpus: Number of GPUs needed
min_memory_mb: Minimum free memory per GPU (default: 8GB)
Returns:
List of GPU indices (e.g., [0, 1])
Raises:
RuntimeError: If insufficient GPUs available
"""
Allocation Examples¶
Single GPU Allocation¶
from src.controllers.docker_controller import DockerController
controller = DockerController(
docker_model_path="/mnt/data/models",
verbose=True
)
# Allocate 1 GPU with at least 8GB free
gpu_indices = controller.select_gpus_for_task(
required_gpus=1,
min_memory_mb=8000
)
# Result: [2] # GPU 2 had the highest availability score
Multi-GPU Allocation¶
# Allocate 4 GPUs for tensor parallelism
gpu_indices = controller.select_gpus_for_task(
required_gpus=4,
min_memory_mb=10000
)
# Result: [1, 2, 3, 5] # Best 4 GPUs by composite score
Allocation Process¶
Query GPUs: Get current status via
gpu_monitor.query_gpus(use_cache=False)Filter GPUs: Remove GPUs with insufficient memory
Score GPUs: Calculate composite score (memory 60% + utilization 40%)
Sort & Select: Return top N GPUs by score
Validate: Raise error if insufficient GPUs available
Docker Integration¶
Selected GPUs are passed to Docker via device_requests:
device_request = docker.types.DeviceRequest(
device_ids=[str(idx) for idx in gpu_indices],
capabilities=[['gpu']]
)
container = client.containers.run(
image=image_name,
device_requests=[device_request],
# ... other params
)
IMPORTANT: Do NOT set CUDA_VISIBLE_DEVICES environment variable when using device_requests. Docker handles GPU visibility automatically.
GPU Scheduling¶
Overview¶
The gpu_scheduler.py module provides task-level GPU resource management with intelligent waiting.
GPU Requirement Estimation¶
from src.utils.gpu_scheduler import estimate_gpu_requirements
task_config = {
"model": {"id_or_path": "llama-3-70b"},
"parameters": {
"tp-size": [1, 2, 4], # Tensor parallelism
"pp-size": [1], # Pipeline parallelism
"dp-size": [1] # Data parallelism
}
}
required_gpus, estimated_memory_mb = estimate_gpu_requirements(task_config)
# Result: (4, 20000) # 4 GPUs needed, ~20GB per GPU for 70B model
World Size Calculation¶
world_size = tp × pp × max(dp, dcp, cp)
where:
tp = tensor_parallel_size
pp = pipeline_parallel_size
dp = data_parallel_size
cp = context_parallel_size
dcp = decode_context_parallel_size
Memory Estimation Heuristics¶
70B/65B models: 20,000 MB per GPU
13B/7B models: 12,000 MB per GPU
Unknown/small models: 8,000 MB per GPU (base)
Parameter Name Formats¶
The estimator supports multiple parameter naming conventions:
# All these are equivalent for tensor parallelism:
"tensor-parallel-size": [1, 2, 4]
"tp-size": [1, 2, 4]
"tp_size": [1, 2, 4]
"tp": [1, 2, 4]
Supported parameters:
Tensor Parallel:
tensor-parallel-size,tp-size,tp_size,tpPipeline Parallel:
pipeline-parallel-size,pp-size,pp_size,ppData Parallel:
data-parallel-size,dp-size,dp_size,dpContext Parallel:
context-parallel-size,cp-size,cp_size,cpDecode Context Parallel:
decode-context-parallel-size,dcp-size,dcp_size,dcp
Availability Checking¶
from src.utils.gpu_scheduler import check_gpu_availability
# Check if 4 GPUs with 10GB free are available
is_available, message = check_gpu_availability(
required_gpus=4,
min_memory_mb=10000
)
if is_available:
print(f"GPUs available: {message}")
else:
print(f"GPUs unavailable: {message}")
Wait for Availability¶
from src.utils.gpu_scheduler import wait_for_gpu_availability
# Wait up to 5 minutes for GPUs to become available
is_available, message = wait_for_gpu_availability(
required_gpus=4,
min_memory_mb=10000,
timeout_seconds=300, # 5 minutes
check_interval=30 # Check every 30 seconds
)
if is_available:
print(f"GPUs became available: {message}")
else:
print(f"Timeout: {message}")
ARQ Worker Integration¶
The GPU scheduler is integrated into the task execution workflow:
# In autotuner_worker.py:
if task.deployment_mode == "docker":
# 1. Estimate GPU requirements
required_gpus, estimated_memory_mb = estimate_gpu_requirements(task_config)
# 2. Check immediate availability
is_available, message = check_gpu_availability(
required_gpus=required_gpus,
min_memory_mb=estimated_memory_mb
)
# 3. Wait if not immediately available
if not is_available:
is_available, message = wait_for_gpu_availability(
required_gpus=required_gpus,
min_memory_mb=estimated_memory_mb,
timeout_seconds=300, # 5 minutes
check_interval=30
)
# 4. Fail task if still unavailable
if not is_available:
task.status = TaskStatus.FAILED
# ... update database and broadcast event
return {"status": "failed", "error": message}
Real-Time Monitoring¶
Overview¶
The DirectBenchmarkController monitors GPU metrics during benchmark execution.
Monitoring Process¶
Start Monitoring: Thread begins sampling GPUs every 1 second
Run Benchmark: genai-bench executes while monitoring collects data
Stop Monitoring: Thread stops and aggregates statistics
Return Results: Monitoring data included in experiment metrics
Implementation¶
# In direct_benchmark_controller.py:
def run_benchmark_job(self, endpoint_url: str, benchmark_spec: Dict[str, Any],
gpu_indices: Optional[List[int]] = None) -> Dict[str, Any]:
# Start GPU monitoring
gpu_monitor = get_gpu_monitor()
if gpu_monitor.is_available():
gpu_monitor.start_monitoring(interval_seconds=1.0)
# Run benchmark
result = self._run_genai_bench(endpoint_url, benchmark_spec)
# Stop monitoring and get stats
monitoring_data = None
if gpu_monitor.is_available():
monitoring_data = gpu_monitor.stop_monitoring()
# Include in results
result["gpu_monitoring"] = monitoring_data
return result
Monitoring Data Structure¶
{
"monitoring_duration_seconds": 45.2,
"sample_count": 45,
"gpu_stats": {
"0": {
"name": "NVIDIA A100-SXM4-80GB",
"utilization": {
"min": 78.0,
"max": 95.0,
"mean": 87.3,
"samples": 45
},
"memory_used_mb": {
"min": 15234.0,
"max": 15678.0,
"mean": 15456.2
},
"memory_usage_percent": {
"min": 18.5,
"max": 19.1,
"mean": 18.8
},
"temperature_c": {
"min": 56.0,
"max": 62.0,
"mean": 59.1
},
"power_draw_w": {
"min": 245.0,
"max": 280.0,
"mean": 265.3
}
}
}
}
Frontend Visualization¶
Experiments Page¶
The Experiments page displays GPU information for each experiment:
GPU Column¶
Shows GPU count and model for experiments:
<td className="whitespace-nowrap px-3 py-4 text-sm text-gray-700">
{experiment.gpu_info ? (
<div className="flex items-center gap-1">
<svg className="w-4 h-4 text-green-600">...</svg>
<span className="font-medium">{experiment.gpu_info.count}</span>
<span className="text-gray-500 text-xs">
{experiment.gpu_info.model ? `(${experiment.gpu_info.model.split(' ')[0]})` : ''}
</span>
{experiment.metrics?.gpu_monitoring && (
<span className="ml-1 inline-flex items-center text-xs text-blue-600"
title="GPU monitoring data available">
📊
</span>
)}
</div>
) : (
<span className="text-gray-400">N/A</span>
)}
</td>
GPU Metrics Chart Component¶
The GPUMetricsChart component visualizes monitoring data:
import GPUMetricsChart from '@/components/GPUMetricsChart';
// In experiment details modal:
{selectedExperiment.metrics?.gpu_monitoring && (
<div>
<h3 className="text-sm font-medium text-gray-900 mb-3">
GPU Monitoring
</h3>
<GPUMetricsChart gpuMonitoring={selectedExperiment.metrics.gpu_monitoring} />
</div>
)}
Chart Features¶
Monitoring Summary
Duration and sample count
Displayed in blue info box
GPU Stats Table
Per-GPU statistics
Columns: GPU ID, Model, Utilization, Memory, Temperature, Power
Shows mean values with min-max ranges
Interactive Charts (Recharts LineChart)
GPU Utilization (%)
Memory Usage (%)
Temperature (°C)
Power Draw (W)
Responsive design (adapts to container width)
TypeScript Types¶
// frontend/src/types/api.ts
export interface Experiment {
// ... other fields
gpu_info?: {
model: string;
count: number;
device_ids?: string[];
world_size?: number;
gpu_info?: {
count: number;
indices: number[];
allocation_method: string;
details?: Array<{
index: number;
name: string;
memory_free_mb: number;
utilization_percent: number;
availability_score: number;
}>;
};
};
metrics?: {
// ... other metrics
gpu_monitoring?: {
monitoring_duration_seconds: number;
sample_count: number;
gpu_stats: {
[gpu_index: string]: {
name: string;
utilization: { min: number; max: number; mean: number; samples: number };
memory_used_mb: { min: number; max: number; mean: number };
memory_usage_percent: { min: number; max: number; mean: number };
temperature_c: { min: number; max: number; mean: number };
power_draw_w: { min: number; max: number; mean: number };
};
};
};
};
}
API Reference¶
gpu_monitor.py¶
GPUMonitor Class¶
get_gpu_monitor() -> GPUMonitor
Returns the global GPUMonitor singleton instance
Thread-safe initialization
is_available() -> bool
Check if nvidia-smi is available on the system
Returns False if nvidia-smi not found or execution fails
query_gpus(use_cache: bool = True) -> Optional[GPUSnapshot]
Query current GPU status
Args:
use_cache: Use cached data if available and recent (default: True)
Returns: GPUSnapshot with GPU data or None if query fails
Cache TTL: 5 seconds
get_available_gpus(min_memory_mb: Optional[int] = None, max_utilization: int = 50) -> List[int]
Get list of available GPU indices
Args:
min_memory_mb: Minimum free memory required (optional)max_utilization: Maximum utilization percentage (default: 50)
Returns: List of GPU indices sorted by availability score (descending)
get_gpu_info(gpu_index: int) -> Optional[GPUInfo]
Get information for a specific GPU
Args:
gpu_index: GPU index (0-based)
Returns: GPUInfo object or None if not found
start_monitoring(interval_seconds: float = 1.0) -> None
Start continuous GPU monitoring thread
Args:
interval_seconds: Sampling interval (default: 1.0)
Clears any previous monitoring data
stop_monitoring() -> Optional[Dict[str, Any]]
Stop monitoring thread and return aggregated statistics
Returns: Dictionary with monitoring data (see data structure above)
Returns None if monitoring was not started
Data Classes¶
GPUInfo
@dataclass
class GPUInfo:
index: int # GPU index (0-based)
name: str # GPU model name
memory_total_mb: int # Total memory in MB
memory_used_mb: int # Used memory in MB
memory_free_mb: int # Free memory in MB
utilization_percent: int # GPU utilization (0-100)
temperature_c: int # Temperature in Celsius
power_draw_w: float # Power draw in Watts
score: float # Availability score (0.0-1.0)
GPUSnapshot
@dataclass
class GPUSnapshot:
timestamp: datetime # When snapshot was taken
gpus: List[GPUInfo] # List of GPU information
gpu_scheduler.py¶
estimate_gpu_requirements(task_config: Dict[str, Any]) -> Tuple[int, int]
Estimate GPU requirements from task configuration
Args:
task_config: Task configuration dictionary
Returns: Tuple of (min_gpus_required, estimated_memory_mb_per_gpu)
Calculation:
world_size = tp × pp × max(dp, dcp, cp)
check_gpu_availability(required_gpus: int, min_memory_mb: Optional[int] = None) -> Tuple[bool, str]
Check if sufficient GPUs are available
Args:
required_gpus: Number of GPUs requiredmin_memory_mb: Minimum memory per GPU (optional)
Returns: Tuple of (is_available, message)
Message contains detailed status or error information
wait_for_gpu_availability(required_gpus: int, min_memory_mb: Optional[int] = None, timeout_seconds: int = 300, check_interval: int = 30) -> Tuple[bool, str]
Wait for sufficient GPUs to become available
Args:
required_gpus: Number of GPUs requiredmin_memory_mb: Minimum memory per GPU (optional)timeout_seconds: Maximum wait time (default: 300 = 5 minutes)check_interval: Polling interval (default: 30 seconds)
Returns: Tuple of (is_available, message)
Logs check attempts and status periodically
docker_controller.py¶
select_gpus_for_task(required_gpus: int, min_memory_mb: int = 8000) -> List[int]
Select optimal GPUs for task execution
Args:
required_gpus: Number of GPUs neededmin_memory_mb: Minimum free memory per GPU (default: 8000)
Returns: List of GPU indices
Raises: RuntimeError if insufficient GPUs available
direct_benchmark_controller.py¶
run_benchmark_job(endpoint_url: str, benchmark_spec: Dict[str, Any], gpu_indices: Optional[List[int]] = None) -> Dict[str, Any]
Run benchmark with GPU monitoring
Args:
endpoint_url: Inference service endpointbenchmark_spec: Benchmark configurationgpu_indices: GPU indices being used (optional, for logging)
Returns: Dictionary with benchmark results and
gpu_monitoringfield
Configuration¶
Environment Variables¶
No specific environment variables required. GPU monitoring uses nvidia-smi from PATH.
Task Configuration¶
Specify parallel configuration in task JSON:
{
"parameters": {
"tp-size": [1, 2, 4], // Tensor parallelism
"pp-size": [1], // Pipeline parallelism
"dp-size": [1], // Data parallelism
"cp-size": [1], // Context parallelism
"dcp-size": [1] // Decode context parallelism
}
}
Scheduler Configuration¶
Modify timeout and interval in autotuner_worker.py:
# Wait for GPUs with custom timeout
is_available, message = wait_for_gpu_availability(
required_gpus=required_gpus,
min_memory_mb=estimated_memory_mb,
timeout_seconds=600, # 10 minutes (default: 300)
check_interval=60 # Check every minute (default: 30)
)
Monitoring Configuration¶
Adjust sampling interval for continuous monitoring:
# Sample every 2 seconds instead of 1
gpu_monitor.start_monitoring(interval_seconds=2.0)
Troubleshooting¶
nvidia-smi Not Found¶
Symptom: Warnings like “nvidia-smi not available”
Cause: nvidia-smi not in PATH or NVIDIA drivers not installed
Solution:
System proceeds without GPU monitoring (graceful degradation)
Install NVIDIA drivers and CUDA toolkit
Verify:
nvidia-smicommand works in terminal
No GPUs Available¶
Symptom: Task fails with “Insufficient GPUs after waiting”
Cause: All GPUs are busy or don’t meet memory requirements
Solutions:
Wait for running tasks to complete
Reduce
min_memory_mbrequirementReduce parallel configuration (tp-size, pp-size)
Increase timeout:
timeout_seconds=600
Incorrect GPU Count Estimation¶
Symptom: Task requests wrong number of GPUs
Cause: Parameter names not recognized or misconfigured
Solutions:
Use standard hyphenated format:
tp-sizenottp_sizeCheck parameter values are lists:
[1, 2, 4]not1Verify task config JSON structure
Check logs for “Estimated requirements” message
GPU Allocation Failures¶
Symptom: RuntimeError during GPU selection
Cause: Insufficient GPUs with required memory
Solutions:
Lower memory requirement:
min_memory_mb=6000Free up GPU memory (stop other processes)
Use fewer GPUs (reduce tp-size)
Monitoring Data Not Appearing¶
Symptom: No GPU charts in experiment details
Cause: Monitoring not enabled or failed to collect data
Solutions:
Verify nvidia-smi works:
nvidia-smiin terminalCheck experiment metrics in database has
gpu_monitoringfieldEnsure DirectBenchmarkController is being used (Docker mode)
Check worker logs for monitoring errors
Docker Container Can’t Access GPUs¶
Symptom: “No accelerator available” in container logs
Cause: Incorrect Docker GPU configuration
Solutions:
Verify
device_requestsis used, notCUDA_VISIBLE_DEVICESCheck nvidia-container-toolkit installed:
docker run --gpus all ...Verify GPU indices are valid:
nvidia-smi -LDon’t mix
device_requestsandCUDA_VISIBLE_DEVICES
Frontend Not Showing GPU Info¶
Symptom: “N/A” in GPU column
Cause: Experiment doesn’t have gpu_info
Solutions:
Verify task uses Docker mode (OME mode has limited GPU tracking)
Check experiment record in database has
gpu_infofieldEnsure DockerController’s
select_gpus_for_taskwas calledVerify frontend TypeScript types are up to date
Performance Considerations¶
Cache Usage¶
Query Cache: 5-second TTL reduces nvidia-smi overhead
Recommendation: Use default cache for frequent queries
Force Refresh: Use
use_cache=Falsefor critical decisions (GPU allocation)
Monitoring Overhead¶
Sampling Rate: Default 1 second is good for most workloads
Overhead: Minimal (~1% CPU per GPU monitored)
Recommendation: Increase interval to 2-5 seconds for very long benchmarks (>10 minutes)
Scheduler Polling¶
Default Interval: 30 seconds balances responsiveness and overhead
Recommendation: Use shorter interval (10-15s) for high-priority tasks
Recommendation: Use longer interval (60s) when many tasks in queue
GPU Allocation Strategy¶
Scoring Algorithm: Prioritizes memory (60%) over utilization (40%)
Rationale: Memory is hard constraint, utilization is soft
Recommendation: Adjust if workload is compute-bound rather than memory-bound
Database Storage¶
GPU Info: Stored as JSON in experiment record (~1-2 KB per experiment)
Monitoring Data: Can be large for long benchmarks (~50 KB for 1000 samples)
Recommendation: Consider cleanup policy for old experiment monitoring data
Best Practices¶
Use Docker Mode: Full GPU tracking support (OME mode has limited support)
Set Realistic Memory Requirements: Over-estimation causes unnecessary waits
Configure Timeouts Appropriately:
Short tasks (< 5 min): 300s timeout
Long tasks (> 10 min): 600-900s timeout
Monitor System Load: Use
watch -n 1 nvidia-smito understand GPU usage patternsTune Scoring Algorithm: Adjust weights in
_calculate_gpu_score()for your workloadArchive Monitoring Data: Consider moving old monitoring data to separate storage
Use Graceful Degradation: System works without nvidia-smi, but with reduced visibility
Check Logs: Worker logs contain detailed GPU scheduling information
Test Parallel Configs: Verify world_size calculation matches your expectation
Frontend Caching: TanStack Query caches experiment data to reduce API calls
Future Enhancements¶
Potential areas for improvement:
Multi-Node GPU Scheduling: Support for distributed GPU allocation across nodes
Predictive Scheduling: ML-based prediction of task duration and GPU requirements
Dynamic Reallocation: Move tasks between GPUs based on load
WebSocket Updates: Real-time GPU metrics streaming to frontend
GPU Affinity: Pin specific experiments to specific GPUs
Power Capping: Enforce power limits for energy efficiency
Historical Analytics: Track GPU utilization trends over time
Intelligent GPU Allocation (OME/Kubernetes)¶
For Kubernetes deployments, the system includes cluster-wide GPU discovery and intelligent node selection.
Features¶
Cluster-wide GPU Discovery
Queries all nodes in Kubernetes cluster
Collects GPU capacity, utilization, memory, temperature
Node-level GPU availability tracking
Intelligent Node Selection
Determines GPU requirements from task parameters (tp-size)
Ranks nodes based on idle GPU availability
Idle criteria: <30% utilization AND <50% memory
Applies node affinity to InferenceService deployments
Automatic Fallback
Falls back to K8s scheduler if no metrics available
Graceful degradation if no idle GPUs found
Can disable with
enable_gpu_selection=False
Implementation¶
See src/controllers/gpu_allocator.py:
get_cluster_gpu_status(): Cluster-wide discoveryselect_best_node(): Node ranking algorithmIntegrates with OMEController for deployments
Benefits¶
Balanced GPU utilization across cluster
Avoids overloaded nodes
Reduces deployment failures from resource contention
Works with dynamic Kubernetes clusters