Parallel Experiment Execution¶
Design document for parallel experiment execution feature that enables running multiple experiments concurrently.
Overview¶
The parallel execution feature enables running multiple experiments simultaneously instead of sequentially, dramatically reducing total task completion time. With intelligent GPU resource management, the system can maximize hardware utilization while preventing resource conflicts.
Benefits¶
5-10x speedup: For tasks with many experiments and available GPUs
Better GPU utilization: Run multiple small experiments on different GPUs
Configurable concurrency: Control parallelism based on system capacity
Error isolation: Failed experiments don’t block others
Architecture¶
Current Sequential Flow¶
Task Started → Worker Loop:
Exp 1: suggest → create → deploy → benchmark → record → |
Exp 2: suggest → create → deploy → benchmark → record → |
Exp 3: suggest → create → deploy → benchmark → record → |
...
→ Task Complete
Total time: sum(experiment_durations)
Proposed Parallel Flow¶
Task Started → Worker Loop:
Batch 1 (parallel):
├─ Exp 1: suggest → create → deploy → benchmark → record
├─ Exp 2: suggest → create → deploy → benchmark → record
└─ Exp 3: suggest → create → deploy → benchmark → record
Batch 2 (parallel):
├─ Exp 4: suggest → create → deploy → benchmark → record
└─ Exp 5: suggest → create → deploy → benchmark → record
...
→ Task Complete
Total time: sum(max(batch_durations))
Key Components¶
1. GPU Resource Pool¶
class GPUResourcePool:
"""Manages GPU allocation for concurrent experiments."""
def __init__(self, max_parallel: int):
self.max_parallel = max_parallel
self.available_gpus = asyncio.Queue()
self.in_use = set()
async def acquire(self, required_gpus: int) -> List[int]:
"""Acquire GPU resources for an experiment."""
# Wait until required_gpus are available
# Return list of GPU indices
async def release(self, gpu_indices: List[int]):
"""Release GPU resources back to pool."""
2. Async Experiment Executor¶
async def run_experiment_async(
orchestrator,
task_config,
iteration,
params,
gpu_pool: GPUResourcePool,
db: AsyncSession
):
"""Run single experiment with async GPU allocation."""
# Estimate GPU requirements
required_gpus = estimate_gpu_requirements(task_config)
# Acquire GPUs from pool (blocks if unavailable)
gpu_indices = await gpu_pool.acquire(required_gpus)
try:
# Run experiment with allocated GPUs
result = await run_experiment_with_timeout(...)
# Update database
await update_experiment_record(db, iteration, result)
finally:
# Always release GPUs
await gpu_pool.release(gpu_indices)
3. Parallel Batch Executor¶
async def run_experiments_parallel(
strategy,
orchestrator,
task_config,
max_parallel: int,
db: AsyncSession
):
"""Run experiments in parallel batches."""
gpu_pool = GPUResourcePool(max_parallel)
tasks = []
while not strategy.should_stop():
# Suggest parameters for next experiment
params = strategy.suggest_parameters()
if params is None:
break
# Create async task for experiment
task = asyncio.create_task(
run_experiment_async(
orchestrator, task_config, iteration, params,
gpu_pool, db
)
)
tasks.append(task)
# Limit concurrent tasks
if len(tasks) >= max_parallel:
# Wait for at least one to complete
done, pending = await asyncio.wait(
tasks, return_when=asyncio.FIRST_COMPLETED
)
tasks = list(pending)
# Wait for all remaining tasks
await asyncio.gather(*tasks)
Implementation Plan¶
Phase 1: Database Preparation ✅¶
Goal: Enable concurrent database writes
Tasks:
Enable SQLite WAL (Write-Ahead Logging) mode
Test concurrent writes from multiple coroutines
Add database connection pooling if needed
Changes:
# src/web/db/session.py
engine = create_async_engine(
settings.database_url,
echo=False,
connect_args={
"check_same_thread": False,
"timeout": 30
}
)
# Enable WAL mode
async def init_db():
async with engine.begin() as conn:
await conn.execute(text("PRAGMA journal_mode=WAL"))
Phase 2: GPU Resource Pool ✅¶
Goal: Implement GPU allocation/deallocation system
Tasks:
Create GPUResourcePool class
Implement acquire/release with asyncio primitives
Add GPU availability checking
Integrate with existing gpu_monitor
File: src/utils/gpu_pool.py (new)
Phase 3: Async Experiment Execution ✅¶
Goal: Convert experiment execution to async
Tasks:
Wrap orchestrator.run_experiment in async executor
Update experiment record creation/updates for async
Add error handling and isolation
Implement cleanup on failure
File: src/web/workers/autotuner_worker.py (modified)
Phase 4: Configuration ✅¶
Goal: Add user-configurable concurrency settings
Tasks:
Add max_parallel_experiments to optimization config
Update Task model with new field
Add UI controls in NewTask.tsx
Add validation and defaults
Changes:
{
"optimization": {
"strategy": "grid_search",
"max_iterations": 100,
"max_parallel_experiments": 3 // NEW
}
}
Phase 5: Testing & Validation ✅¶
Goal: Verify parallel execution works correctly
Tasks:
Unit tests for GPUResourcePool
Integration tests with mock experiments
Load testing with real GPUs
Verify no database conflicts
Check GPU resource tracking
Configuration¶
Task JSON¶
{
"optimization": {
"strategy": "grid_search",
"objective": "minimize_latency",
"max_iterations": 20,
"max_parallel_experiments": 3,
"timeout_per_iteration": 600
}
}
Recommended Settings¶
Scenario |
max_parallel_experiments |
Rationale |
|---|---|---|
Single GPU system |
1 |
No benefit to parallelism |
4 GPU system, small models |
4 |
One experiment per GPU |
8 GPU system, large models (TP=4) |
2 |
Two experiments, each using 4 GPUs |
Limited CPU/memory |
2-3 |
Avoid system overload |
Fast experiments (<2 min) |
1 |
Overhead not worth it |
Slow experiments (>10 min) |
4-8 |
Maximize parallelism |
GPU Resource Management¶
Allocation Strategy¶
Available GPUs: [0, 1, 2, 3, 4, 5, 6, 7]
Experiment 1 needs 2 GPUs → Allocate [0, 1] → In use: {0, 1}
Experiment 2 needs 4 GPUs → Allocate [2, 3, 4, 5] → In use: {0, 1, 2, 3, 4, 5}
Experiment 3 needs 2 GPUs → Wait (only 2 GPUs free) → Queued
Experiment 1 completes → Release [0, 1] → In use: {2, 3, 4, 5}
Experiment 3 proceeds → Allocate [0, 1] → In use: {0, 1, 2, 3, 4, 5}
Resource Pool Properties¶
Fair allocation: FIFO queue for waiting experiments
Deadlock prevention: Acquire all GPUs atomically or wait
Automatic cleanup: GPUs released even if experiment fails
Smart selection: Prefer least-utilized GPUs (from gpu_monitor)
Error Handling¶
Isolation¶
Each experiment runs in independent async task:
Exceptions caught and logged
Failed experiment marked in database
Other experiments continue unaffected
GPUs properly released on failure
Recovery¶
try:
result = await run_experiment_async(...)
except asyncio.TimeoutError:
logger.error(f"Experiment {iteration} timed out")
await mark_experiment_failed(db, iteration, "Timeout")
except Exception as e:
logger.error(f"Experiment {iteration} failed: {e}")
await mark_experiment_failed(db, iteration, str(e))
finally:
await gpu_pool.release(gpu_indices)
Performance Impact¶
Expected Speedup¶
Assuming 3 experiments run in parallel:
Sequential:
Exp 1: 10 min
Exp 2: 10 min → Total: 30 min
Exp 3: 10 min
Parallel (max_parallel=3):
Exp 1, 2, 3 (concurrent): 10 min → Total: 10 min
Speedup: 3x (linear with concurrency)
Realistic Scenarios¶
Small models, many experiments:
20 experiments × 5 min each = 100 min sequential
With max_parallel=4: 25 min (4x speedup)
Large models, few experiments:
10 experiments × 30 min each = 300 min sequential
With max_parallel=2: 150 min (2x speedup)
Mixed workload:
Some experiments fail fast, others run full duration
Speedup varies: 2-5x typical
Limitations¶
SQLite Constraints¶
WAL mode required for concurrent writes
Database on NFS may have issues (use local disk)
Maximum ~1000 concurrent writers (far exceeds our needs)
GPU Constraints¶
Cannot run more experiments than available GPUs
Multi-GPU experiments (TP>1) reduce effective parallelism
GPU memory fragmentation may limit concurrency
System Constraints¶
CPU/memory overhead for multiple containers
Network bandwidth for concurrent downloads (HuggingFace models)
Disk I/O for logs and database writes
Best Practices¶
Start conservative: Begin with max_parallel=2, increase gradually
Monitor GPU usage: Use
watch -n 1 nvidia-smiduring taskCheck logs: Ensure no “GPU unavailable” errors
Adjust for model size: Large models → lower concurrency
Consider experiment duration: Short experiments → lower concurrency (overhead)
Troubleshooting¶
Problem: No speedup observed¶
Symptoms: Experiments still run sequentially
Solutions:
Check max_parallel_experiments > 1 in task config
Verify sufficient GPUs available
Check logs for “Waiting for GPUs” messages
Ensure WAL mode enabled:
sqlite3 autotuner.db "PRAGMA journal_mode"
Problem: GPU allocation errors¶
Symptoms: “No GPUs available” despite free GPUs
Solutions:
Check GPU resource pool initialization
Verify gpu_monitor is working
Look for GPU leak (not releasing properly)
Restart task to reset pool
Problem: Database lock errors¶
Symptoms: “database is locked” errors in logs
Solutions:
Enable WAL mode:
PRAGMA journal_mode=WALIncrease timeout:
connect_args={"timeout": 30}Reduce max_parallel_experiments
Check database not on NFS
Future Enhancements¶
Dynamic concurrency: Adjust based on GPU availability
Priority queuing: High-priority experiments skip queue
Cross-task parallelism: Multiple tasks share GPU pool
Distributed execution: Run experiments across multiple nodes
Smart batching: Group experiments with similar GPU requirements
References¶
SQLite WAL mode: https://www.sqlite.org/wal.html
AsyncIO task management: https://docs.python.org/3/library/asyncio-task.html
GPU resource management: docs/GPU_TRACKING.md
Implementation Status¶
Phase 1: Database Preparation ✅ COMPLETE¶
Goal: Enable concurrent database writes using SQLite WAL mode
Implementation (src/web/db/session.py):
Added
check_same_thread=Falseandtimeout=30to engine configEnabled
PRAGMA journal_mode=WALininit_db()WAL mode allows concurrent readers and single writer
Benefits:
Multiple readers can access database concurrently
Writer doesn’t block readers during commits
Improved throughput for parallel experiment updates
Phase 2: GPU Tracking ✅ COMPLETE¶
Goal: Track GPU availability and allocate to experiments
Implementation (src/controllers/gpu_tracker.py):
Real-time GPU monitoring via nvidia-smi
Thread-safe allocation tracking
Automatic cleanup on experiment completion
See: GPU_TRACKING.md for detailed documentation
Phase 3: Parallel Orchestrator 🚧 IN PROGRESS¶
Goal: Execute multiple experiments concurrently
Status: ~60% complete
Concurrent experiment scheduling
GPU-aware task distribution
Progress tracking and error handling
Expected Performance¶
With parallel execution:
Grid search: 5-10x speedup with 4+ GPUs
Bayesian optimization: 2-3x speedup (sequential dependencies limit parallelism)
Limited by: GPU count, memory per GPU, parameter space size