# LLM Autotuner - Product Roadmap
> **Last Updated**: 2025/12/22
> **Project Status**: Production-Ready with Active Development
> **Current Version**: v0.2.0 (Milestone 5 Complete)
---
## Executive Summary
The LLM Autotuner is a comprehensive system for automatically optimizing Large Language Model inference parameters. The project has successfully completed five major milestones including tri-mode deployment (Kubernetes/OME, Docker, and Local), full-stack web application, runtime-agnostic configuration, and an AI-powered Agent System.
**Key Achievements:**
- ✅ 28 tasks executed, 408 experiments run, 312 successful results
- ✅ Bayesian optimization achieving 80-87% reduction in experiments vs grid search
- ✅ Full-stack web application with React frontend and FastAPI backend
- ✅ Runtime-agnostic quantization and parallelism configuration
- ✅ GPU-aware optimization with per-GPU efficiency metrics
- ✅ YAML import/export for configuration management
- ✅ Real-time WebSocket updates and auto-update notifications
- ✅ Per-batch SLO filtering with graceful OOM handling
- ✅ Documentation refinement (66→15 files, 77% reduction)
- ✅ SLO-aware scoring with exponential penalty functions
- ✅ **Agent System** with LLM-powered task management, tool execution, and conversational interface
- ✅ **Local Deployment Mode** for subprocess-based model execution without containers
- ✅ **Remote Dataset Support** via URL fetching with caching
---
## Milestone Overview
| | M1: Core Foundation | M2: Web Interface | M3: Runtime-Agnostic | M4: UI/UX Polish | M5: Agent System | M6+: Future |
|---|---------------------|-------------------|----------------------|------------------|------------------|-------------|
| **Date** | 2025/10/24 | 2025/10/30 | 2025/11/14 | 2025/12/01 | 2025/12/22 | Planned |
| **Status** | ✅ Done | ✅ Done | ✅ Done | ✅ Done | ✅ Done | 🔵 Planned |
| **Features** | ✅ Grid/Random Search
✅ Docker Mode
✅ OME/K8s Mode
✅ Benchmark Parsing
✅ Scoring Algorithms
✅ CLI Interface | ✅ REST API
✅ React Frontend
✅ Task Queue (ARQ)
✅ Log Streaming
✅ Container Monitor
✅ Preset System | ✅ Bayesian Optimization
✅ Quantization Config
✅ Parallel Config
✅ GPU-Aware Scoring
✅ SLO-Aware Scoring
✅ Per-GPU Metrics | ✅ WebSocket Updates
✅ YAML Import/Export
✅ Auto-Update Notif
✅ Multi-Exp Comparison
✅ Sphinx Docs Site
✅ SLO Filtering | ✅ Agent Chat UI
✅ Tool Framework
✅ Local Deploy Mode
✅ Dataset URL Support
✅ Streaming Markdown
✅ GitHub/HF Tools | 🔵 Distributed Workers
🔵 Multi-User Auth
🔵 Cloud Deployment
🔵 CI/CD Integration
🔵 Advanced Analytics |
---
## Milestone Timeline
```
2025/10/24 ────► Milestone 1: Core Autotuner Foundation
2025/10/30 ────► Milestone 2: Complete Web Interface & Parameter Preset System
2025/11/14 ────► Milestone 3: Runtime-Agnostic Configuration & GPU-Aware Optimization
2025/12/01 ────► Milestone 4: UI/UX Polish, SLO Filtering & Documentation
2025/12/22 ────► Milestone 5: Agent System & Local Deployment Mode
```
---
## 🎉 Milestone 1: Core Autotuner Foundation
**Date**: 2025/10/24 (tag: `milestone-1`)
**Status**: ✅ COMPLETED
**Objective**: Establish solid foundation for LLM inference parameter autotuning with complete functionality, proper documentation, and code standards
### Key Accomplishments
#### 1.1 Architecture & Implementation ✅
- [x] Multi-tier architecture with clear separation of concerns
- [x] OME controller for Kubernetes InferenceService lifecycle
- [x] Docker controller for standalone deployment
- [x] Benchmark controller (OME BenchmarkJob + Direct CLI modes)
- [x] Parameter grid generator and optimizer utilities
- [x] Main orchestrator with JSON input
**Technical Specs:**
- Controllers: `ome_controller.py`, `docker_controller.py`, `benchmark_controller.py`, `direct_benchmark_controller.py`
- Utilities: `optimizer.py` (grid search, scoring algorithms)
- Templates: Jinja2 for Kubernetes resources
#### 1.2 Benchmark Results Parsing & Scoring ✅
- [x] Fixed critical bug in genai-bench result file parsing
- [x] Enhanced `DirectBenchmarkController._parse_results()`
- [x] Reads correct result files (D*.json pattern)
- [x] Handles multiple concurrency levels
- [x] Aggregates metrics across all runs
- [x] Extracts 15+ performance metrics
**Completed `calculate_objective_score()` with 4 objectives:**
- `minimize_latency` - E2E latency optimization
- `maximize_throughput` - Token throughput optimization
- `minimize_ttft` - Time to First Token optimization
- `minimize_tpot` - Time Per Output Token optimization
**Comprehensive Metrics:**
- Latency: mean/min/max/p50/p90/p99 E2E latency
- Throughput: output and total token throughput
- Request statistics: success rate, error tracking
#### 1.3 Code Quality & Standards ✅
- [x] Integrated **black-with-tabs** formatter
- [x] Formatted entire codebase (7 Python files, 1957+ lines)
- [x] Configuration: 120-char lines, tab indentation
- [x] PEP 8 compliance with 2 blank lines between top-level definitions
- [x] IDE integration guides (VS Code, PyCharm)
#### 1.4 CLI Usability Improvements ✅
- [x] Made `--direct` flag automatic when using `--mode docker`
- [x] Simplified command-line interface
- [x] Updated help text and usage examples
- [x] Better default behaviors for common use cases
#### 1.5 Documentation Structure ✅
- [x] Separated 420+ line Troubleshooting into `docs/TROUBLESHOOTING.md`
- [x] Created `docs/DEVELOPMENT.md` comprehensive guide
- [x] Established documentation conventions
- [x] Improved README readability
**Documentation Files Created:**
- `README.md` - User guide with installation and usage
- `CLAUDE.md` - Project overview and development guidelines
- `docs/TROUBLESHOOTING.md` - 13 common issues and solutions
- `docs/DEVELOPMENT.md` - Code formatting and contribution guide
- `docs/DOCKER_MODE.md` - Docker deployment guide
- `docs/OME_INSTALLATION.md` - Kubernetes/OME setup
#### 1.6 Web Integration Readiness ✅
- [x] Comprehensive codebase analysis: Zero blockers found
- [x] Created detailed readiness assessment
- [x] Verified all controllers fully implemented (no placeholder functions)
- [x] Confirmed orchestrator is programmatically importable
- [x] Documented data structures (input/output formats)
- [x] Technology stack recommendations (FastAPI, React/Vue)
- [x] API endpoint specifications
- [x] Implementation roadmap with effort estimates
### Technical Achievements
**Code Quality:**
- 1,957 lines of production Python code
- 100% method implementation (no placeholders in critical paths)
- Comprehensive error handling and logging
- Clean separation of concerns (controllers, orchestrator, utilities)
**Functionality:**
- ✅ Full Docker mode support (standalone, no K8s required)
- ✅ OME/Kubernetes mode support
- ✅ Grid search parameter optimization
- ✅ Multi-concurrency benchmark execution
- ✅ Comprehensive result aggregation and scoring
- ✅ Automatic resource cleanup
**Test Results:**
- Successfully parsed real benchmark data
- Concurrency levels: [1, 4]
- Mean E2E Latency: 0.1892s
- Mean Throughput: 2,304.82 tokens/s
---
## 🎉 Milestone 2: Complete Web Interface & Parameter Preset System
**Date**: 2025/10/30 (tag: `milestone-2`)
**Status**: ✅ COMPLETED
**Objective**: Build full-stack web application for task management, visualization, and introduce parameter preset system
### Key Accomplishments
#### 2.1 Backend API Infrastructure ✅
- [x] FastAPI application with async support
- [x] SQLAlchemy ORM with SQLite backend (moved to `~/.local/share/`)
- [x] Database models (Task, Experiment)
- [x] REST API endpoints (10+ routes)
- [x] ARQ background task queue (Redis integration)
- [x] Pydantic schemas for validation
- [x] Streaming log API endpoints
- [x] Health check improvements
**API Endpoints:**
```
POST /api/tasks/ - Create task
POST /api/tasks/{id}/start - Start task execution
GET /api/tasks/ - List tasks
GET /api/tasks/{id} - Get task details
GET /api/tasks/{id}/logs - Stream logs (SSE)
GET /api/experiments/task/{id} - Get experiments
GET /api/docker/containers - List containers
GET /api/system/health - Health check
```
**Database Migration:**
- Moved from local `autotuner.db` to XDG-compliant `~/.local/share/autotuner/`
- SQLite WAL mode for concurrent writes
- Proper session management with async context
#### 2.2 React Frontend Application ✅
- [x] React 18 with TypeScript
- [x] Vite build tooling with hot module replacement
- [x] React Router for navigation
- [x] TanStack Query (React Query) for API state
- [x] Tailwind CSS styling
- [x] Recharts for metrics visualization
- [x] React Hot Toast for notifications
**Pages Implemented:**
- **Dashboard** - System overview and statistics
- **Tasks** - Task list with create/list/monitor/restart
- **NewTask** - Task creation wizard with form validation
- **Experiments** - Results visualization with charts
- **Containers** - Docker container monitoring (Docker mode)
**Key Components:**
- `TaskResults.tsx` - Results visualization with Recharts
- `LogViewer.tsx` - Real-time log streaming viewer
- `Layout.tsx` - Main layout with navigation
- Form components with validation
**UI Features:**
- Task creation wizard with parameter presets
- Real-time status monitoring (polling-based)
- Experiment results table with sorting/filtering
- Performance graphs (throughput, latency, TPOT, TTFT)
- Container stats (CPU, memory, GPU)
- Log streaming with follow mode
- URL-based navigation with hash routing
- Error notifications with toast messages
#### 2.3 ARQ Worker Integration ✅
- [x] Background task processing with Redis queue
- [x] Worker configuration (max_jobs=5, timeout=2h)
- [x] Log redirection to task-specific files
- [x] Graceful shutdown handling
- [x] Worker management scripts
**Log Management:**
- Task logs: `~/.local/share/autotuner/logs/task_.log`
- Worker logs: `logs/worker.log`
- Python logging library integration
- StreamToLogger for real-time capture
#### 2.4 Task Management Features ✅
- [x] Task creation UI with form builder
- [x] Task restart functionality
- [x] Task edit capability
- [x] Task status tracking
- [x] Real-time log viewing
- [x] Environment variable configuration for Docker
#### 2.5 Parameter Preset System (Backend) ✅
- [x] Parameter preset API (CRUD operations)
- [x] Preset merge functionality
- [x] Import/export capabilities
- [x] System preset seeding
**Note**: Frontend integration for preset system completed in later sprints.
### Bug Fixes & Improvements
**Critical Fixes:**
- Fixed best experiment selection bug
- Fixed model name field linkings
- Fixed health check 503 errors
- Fixed data display in task view
- Refined task restart logic
- Enhanced container log viewing
**Code Organization:**
- Reorganized web backend code structure
- Separated orchestrator from web modules
- Formatted code with Prettier
- Improved error handling and validation
### Technical Stack
| Component | Technology |
|-----------|-----------|
| **Frontend** | React 18, TypeScript, Vite 5 |
| **State Management** | TanStack Query 5 |
| **Styling** | Tailwind CSS 3 |
| **Charts** | Recharts 2 |
| **Backend** | FastAPI, Python 3.10+ |
| **Database** | SQLite 3 with SQLAlchemy 2 |
| **Task Queue** | ARQ 0.26 + Redis 7 |
| **API Docs** | Swagger UI (OpenAPI) |
### Statistics
- **Commits since Milestone 1**: 40+
- **Frontend Components**: 20+ React components
- **API Endpoints**: 15+ routes
- **Database Tables**: 2 (tasks, experiments)
- **Lines of Code**: ~12,000 total (5,000 backend + 7,000 frontend)
---
## 🎉 Milestone 3: Runtime-Agnostic Configuration Architecture & GPU-Aware Optimization
**Date**: 2025/11/14 (tag: `milestone-3`)
**Status**: ✅ COMPLETED
**Timeline**: 2025/11/10 → 2025/11/14
**Objective**: Unified configuration abstraction for quantization and parallelism across multiple runtimes, plus GPU-aware optimization
### Overview
Milestone 3 achieved **two major architectural breakthroughs**:
1. **Runtime-Agnostic Configuration System** - Unified abstraction for quantization and parallel execution across vLLM, SGLang, and TensorRT-LLM
2. **GPU-Aware Optimization** - Per-GPU efficiency metrics enabling fair comparison across different parallelism strategies
These foundational changes enable **portable, efficiency-aware autotuning** where users specify high-level intent and the system automatically maps to runtime-specific implementations while optimizing for per-GPU efficiency.
### Part 1: Runtime-Agnostic Configuration System
#### 1.1 Quantization Configuration Abstraction ✅
**Problem Solved:**
Different inference runtimes use incompatible CLI syntax for quantization. Users had to learn runtime-specific arguments and rewrite configurations when switching engines.
**Solution: Three-Layer Abstraction Architecture**
**Four-Field Normalized Schema:**
```python
{
"gemm_dtype": "fp8", # Weight/activation quantization
"kvcache_dtype": "fp8_e5m2", # KV cache compression
"attention_dtype": "auto", # Attention compute precision
"moe_dtype": "auto" # MoE expert quantization
}
```
**Modules Created:**
1. **`quantization_mapper.py`** (450 lines)
- Runtime-specific CLI argument mapping
- 5 production presets: `default`, `kv-cache-fp8`, `dynamic-fp8`, `bf16-stable`, `aggressive-moe`
- Validation with dtype compatibility checking
- Automatic detection of offline quantization (AWQ, GPTQ, GGUF)
2. **`quantization_integration.py`** (350 lines)
- Orchestrator integration layer
- Experiment parameter preparation
- Conflict resolution between user params and quant config
**Runtime Mapping Example:**
```
User Config vLLM Args SGLang Args
────────────────────────────────────────────────────────────────────────────
gemm_dtype: "fp8" → --quantization fp8 --quantization fp8
kvcache_dtype: "fp8_e5m2" → --kv-cache-dtype fp8_e5m2 --kv-cache-dtype fp8_e5m2
attention_dtype: "fp8" → (inferred from gemm) --attention-backend fp8
```
**Grid Expansion with `__quant__` Prefix:**
```json
{
"quant_config": {
"gemm_dtype": ["auto", "fp8"],
"kvcache_dtype": ["auto", "fp8_e5m2"]
}
}
```
Expands to 4 experiments (2×2):
- `__quant__gemm_dtype=auto, __quant__kvcache_dtype=auto`
- `__quant__gemm_dtype=auto, __quant__kvcache_dtype=fp8_e5m2`
- `__quant__gemm_dtype=fp8, __quant__kvcache_dtype=auto`
- `__quant__gemm_dtype=fp8, __quant__kvcache_dtype=fp8_e5m2`
**Frontend Integration:**
- **`QuantizationConfigForm.tsx`** (612 lines)
- Preset mode vs. Custom mode toggle
- Real-time preview of generated parameters
- Combination count calculation
- Validation feedback
#### 1.2 Parallel Configuration Abstraction ✅
**Normalized Parameter Schema:**
```python
{
"tp": 4, # Tensor parallelism
"pp": 1, # Pipeline parallelism
"dp": 2, # Data parallelism
"dcp": 1, # Decode context parallelism (vLLM)
"cp": 1, # Context parallelism (TensorRT-LLM)
"ep": 1, # Expert parallelism (MoE)
"moe_tp": 1, # MoE tensor parallelism
"moe_ep": 1 # MoE expert parallelism
}
```
**Modules Created:**
1. **`parallel_mapper.py`** (520 lines)
- 18 runtime-specific presets (6 per engine)
- Constraint validation (e.g., SGLang: `tp % dp == 0`, TensorRT-LLM: no DP support)
- world_size calculation: `world_size = tp × pp × dp`
2. **`parallel_integration.py`** (280 lines)
- Parameter grid expansion
- Orchestrator integration
- GPU allocation coordination
**Presets Per Engine:**
```
vLLM (6 presets):
- single-gpu, high-throughput, large-model-tp, large-model-tp-pp
- moe-optimized, long-context (with dcp), balanced
SGLang (6 presets):
- single-gpu, high-throughput, large-model-tp, large-model-tp-pp
- moe-optimized (with moe_dense_tp), balanced
- Constraint: tp % dp == 0
TensorRT-LLM (6 presets):
- single-gpu, large-model-tp, large-model-tp-pp
- moe-optimized (with moe_tp, moe_ep), long-context (with cp)
- Constraint: No data parallelism support (dp must be 1)
```
**Runtime Mapping Example:**
```
User Config vLLM Args SGLang Args
─────────────────────────────────────────────────────────────────────────────────
tp: 4 → --tensor-parallel-size 4 --tp-size 4
pp: 1 → --pipeline-parallel-size 1 (not supported)
dp: 2 → --distributed-executor-backend ray --dp-size 2
--num-gpu-blocks-override
```
**Grid Expansion with `__parallel__` Prefix:**
```json
{
"parallel_config": {
"tp": [2, 4],
"pp": 1,
"dp": [1, 2]
}
}
```
Expands to 4 experiments (2×2).
**Frontend Integration:**
- **`ParallelConfigForm.tsx`** (similar to QuantizationConfigForm)
- Preset mode with 18 runtime-specific presets
- Custom mode with constraint validation
- GPU requirement calculation
- Real-time parameter preview
### Part 2: GPU-Aware Optimization
#### 2.1 Per-GPU Efficiency Metrics ✅
**Problem Solved:**
Traditional throughput metrics favor higher parallelism blindly. A configuration using 8 GPUs with 100 tokens/s looks better than 2 GPUs with 60 tokens/s, but the latter is 2.4× more efficient per GPU.
**Solution: Per-GPU Throughput Calculation**
**Formula:**
```
per_gpu_throughput = total_throughput / gpu_count
```
**Example Comparison:**
```
Config A: TP=2, throughput=661.36 tokens/s → 330.68 tokens/s/GPU
Config B: TP=4, throughput=628.22 tokens/s → 157.06 tokens/s/GPU
Winner: Config A (2.1× more efficient)
```
**Implementation:**
- GPU info recorded in database: `gpu_info` JSON field
- Contains: `model`, `count`, `device_ids`, `world_size`
- Automatic calculation during scoring
- Frontend displays both total and per-GPU metrics
#### 2.2 GPU Information Tracking ✅
**Database Schema:**
```python
gpu_info = {
"model": "NVIDIA A100",
"count": 2,
"device_ids": [0, 1],
"world_size": 2
}
```
**Recording Logic:**
- Captured during experiment setup
- Stored in `experiments.gpu_info` column (JSON)
- Used for per-GPU metric calculation
- Displayed in results table
#### 2.3 Enhanced Result Visualization ✅
**Frontend Enhancements:**
- Added "GPUs" column to experiment table
- Display: `2 (A100)` or `4 (H100)`
- Tooltip shows device IDs and world size
- Per-GPU throughput column
- Color coding for efficiency comparison
**Charts:**
- Per-GPU efficiency scatter plot
- GPU count vs throughput line chart
- Pareto frontier with GPU cost consideration
### Technical Achievements
**Code Additions:**
- **Quantization System**: 800 lines (mapper + integration)
- **Parallel System**: 800 lines (mapper + integration)
- **GPU Tracking**: 200 lines (backend + frontend)
- **Frontend Forms**: 1,200 lines (Quant + Parallel components)
- **Documentation**: 3 new docs (QUANTIZATION, PARALLEL, GPU_TRACKING)
**Total**: ~3,000 lines of new production code
**Functionality:**
- ✅ Support for 3 inference runtimes (vLLM, SGLang, TensorRT-LLM)
- ✅ 5 quantization presets + custom mode
- ✅ 18 parallelism presets (6 per runtime)
- ✅ Automatic runtime-specific CLI mapping
- ✅ Constraint validation and conflict resolution
- ✅ Per-GPU efficiency metrics
- ✅ GPU information persistence
**Documentation:**
- `docs/QUANTIZATION_CONFIGURATION.md`
- `docs/PARALLEL_CONFIGURATION.md`
- `docs/GPU_TRACKING.md`
---
---
## 🎉 Milestone 4: UI/UX Polish, SLO Filtering & Documentation
**Date**: 2025-12-01 (tag: `milestone-4`)
**Status**: ✅ COMPLETED
**Timeline**: 2025-11-15 → 2025-12-01
**Objective**: Transform from functional prototype to production-ready platform with professional UI, SLO filtering, and comprehensive documentation
### Key Accomplishments
#### 4.1 Frontend UI/UX Enhancements ✅
- [x] Real-time WebSocket updates (<100ms latency)
- [x] YAML import/export for task configurations
- [x] Auto-update notification system (GitHub releases)
- [x] Enhanced result visualization with SLO reference lines
- [x] Custom logo and branding (SVG icon + favicon)
- [x] Protected completed tasks (hidden edit/cleanup buttons)
- [x] Clickable task names for details view
- [x] UI refinements (width-limited controls, placeholder cleanup)
**YAML Import/Export System:**
```typescript
// Import: Full-page drag-and-drop zone
populateForm(config)} />
// Export: Single-click download
```
**Auto-Update Notifications:**
- Automatic version checking against GitHub releases
- Notification banner when updates available
- Build timestamp tracking: `v1.0.0+20251203T195130Z`
#### 4.2 SLO-Aware Benchmarking ✅
- [x] Per-batch SLO filtering (filter non-compliant batches before aggregation)
- [x] Graceful OOM handling (partial success support)
- [x] Visual SLO indicators (reference lines on performance charts)
- [x] Detailed compliance logging per batch
**Per-Batch Filtering Example:**
```
[Benchmark] Filtering 4 batches by SLO compliance...
[Benchmark] ✗ Batch concurrency=8 violated SLO: {'p90': {'threshold': 5.0, 'actual': 6.2}}
[Benchmark] ✓ 3/4 batches passed SLO
[Benchmark] Max throughput: 145.2 req/s (from 3 SLO-compliant batches)
```
**Graceful Degradation:**
- Experiments succeed if at least one batch completes
- Partial results better than no results
- OOM at high concurrency doesn't invalidate low-concurrency data
#### 4.3 Documentation Refinement ✅
- [x] Aggressive cleanup (66 → 15 files, 77% reduction)
- [x] Content merges (GENAI_BENCH_LOGS → TROUBLESHOOTING, etc.)
- [x] Reference fixes (zero broken links across all docs)
- [x] Focus on long-term maintainability
**15 Essential Files Kept:**
- **User Guides (4)**: QUICKSTART, DOCKER_MODE, OME_INSTALLATION, TROUBLESHOOTING
- **Architecture (3)**: DEPLOYMENT_ARCHITECTURE, GPU_TRACKING, ROADMAP
- **Features (4)**: BAYESIAN_OPTIMIZATION, SLO_SCORING, PARALLEL_EXECUTION, WEBSOCKET_IMPLEMENTATION
- **Configuration (4)**: UNIFIED_QUANTIZATION_PARAMETERS, PARALLEL_PARAMETERS, PRESET_QUICK_REFERENCE, PVC_STORAGE
#### 4.4 Bug Fixes & Infrastructure ✅
- [x] Template parameter fix (OME InferenceService: `params=parameters` instead of `**parameters`)
- [x] API proxy configuration (fixed hardcoded URLs in service files)
- [x] Pydantic settings fix (added `extra='ignore'` for VITE_* variables)
#### 4.5 Documentation Website ✅
- [x] Sphinx documentation with Furo theme
- [x] GitHub Actions workflow for automated deployment
- [x] MyST Parser for Markdown support
- [x] Auto-generated API documentation (autodoc)
- [x] Organized directory structure (getting-started, user-guide, features, api)
- [x] Dark mode support with custom branding
### Technical Achievements
**Code Statistics:**
- **Frontend**: ~800 lines (YAML I/O, auto-update, SLO visualization)
- **Backend**: ~400 lines (SLO filtering, OOM handling, fixes)
- **Total New Code**: ~1,200 lines
- **Documentation**: Sphinx site with 15+ pages, GitHub Actions CI/CD
**Components Created:**
- `TaskYAMLImport.tsx` (180 lines) - Drag-and-drop import with validation
- `TaskYAMLExport.tsx` (80 lines) - Single-click YAML export
- `UpdateNotification.tsx` (110 lines) - Auto-update banner with GitHub integration
- `versionService.ts` (60 lines) - Version checking service
- `check_batch_slo_compliance()` (133 lines) - Per-batch SLO validation
- `docs/conf.py` - Sphinx configuration with Furo theme
- `.github/workflows/docs.yml` - GitHub Pages deployment workflow
**Files Modified:**
- Frontend: Tasks.tsx, TaskResults.tsx, NewTask.tsx, Logo.tsx (10+ files)
- Backend: optimizer.py, direct_benchmark_controller.py, config.py (5 files)
- Documentation: README.md, CLAUDE.md, ROADMAP.md (reference fixes)
### Performance Impact
| Metric | Before M4 | After M4 | Improvement |
|--------|-----------|----------|-------------|
| **UI Response Time** | 2-5s polling | <100ms WebSocket | 20-50x faster |
| **Config Reusability** | Manual JSON edit | YAML import/export | Instant |
| **Update Awareness** | Manual check | Auto-notification | Automatic |
| **SLO Visibility** | Numbers only | Visual ref lines | Intuitive |
| **OOM Resilience** | Experiment fails | Partial success | Graceful |
| **Doc Files** | 66 files | 15 files | 77% reduction |
### Impact Summary
**For Users:**
- ✅ Faster feedback: WebSocket real-time updates
- ✅ Better visualization: SLO reference lines, enhanced charts
- ✅ Config management: YAML import/export workflow
- ✅ Stay updated: Automatic version checking
- ✅ Fewer failures: Graceful OOM handling
- ✅ Cleaner UI: Protected actions, clickable names
- ✅ Professional branding: Custom logo and favicon
**For Operators:**
- ✅ Easier troubleshooting: Per-batch SLO logging
- ✅ Better resource utilization: Partial success support
- ✅ Clearer documentation: 15 essential files vs 66
- ✅ No broken links: All references verified
**For Developers:**
- ✅ Maintainable docs: Focused, merged content
- ✅ Working examples: Templates verified
- ✅ Clear architecture: Essential docs only
- ✅ Build tracking: Timestamp in version display
---
## 🎉 Milestone 5: Agent System & Local Deployment Mode
**Date**: 2025-12-22 (tag: `milestone-5`)
**Status**: ✅ COMPLETED
**Timeline**: 2025-12-01 → 2025-12-22
**Objective**: Introduce LLM-powered Agent System for conversational task management and add Local Deployment Mode for faster development iteration
### Key Accomplishments
#### 5.1 Agent Chat Interface ✅
- [x] Full-featured chat UI with streaming markdown responses (`/agent`)
- [x] Session management with persistent conversation history
- [x] Editable session titles with auto-generation from first message
- [x] IndexedDB-based message storage with backend sync
- [x] Server-Sent Events (SSE) for real-time streaming responses
**Architecture:**
```
┌─────────────────────┐ SSE Stream ┌──────────────────┐
│ AgentChat.tsx │◄───────────────────►│ /api/agent/chat │
│ (React Frontend) │ │ (FastAPI) │
└─────────────────────┘ └────────┬─────────┘
│ │
│ Markdown │ OpenAI API
▼ ▼
┌─────────────────────┐ ┌──────────────────┐
│ StreamingMarkdown │ │ LLM Backend │
│ (react-markdown) │ │ (Configurable) │
└─────────────────────┘ └──────────────────┘
```
#### 5.2 Tool Execution Framework ✅
- [x] Authorization system with `AuthorizationScope` enum (none, privileged, dangerous)
- [x] Privileged tools require user approval before execution
- [x] Auto-execute pending tools after authorization granted
- [x] Clear visual indicators for tool status in chat UI
**Tool Categories Implemented:**
| Category | Tools | Authorization |
|----------|-------|---------------|
| Task Management | `create_task`, `start_task`, `get_task_status`, `get_task_logs` | None |
| Worker Control | `restart_arq_worker` | Privileged |
| System Utilities | `sleep`, `get_current_time` | None |
| GitHub Integration | `search_github_issues`, `create_github_issue`, `comment_github_issue` | None |
| HuggingFace CLI | `hf_cache_scan`, `hf_download`, `hf_repo_info` | None |
| Experiment Analysis | `get_experiment_logs` | None |
#### 5.3 Agent Backend Architecture ✅
- [x] LangChain framework for flexible model support
- [x] Support for Claude (Anthropic) and open-source models
- [x] Max iterations increased from 10 to 100 for complex tasks
- [x] Automatic tool result handling for multi-step operations
**API Endpoints:**
- `POST /api/agent/chat` - SSE streaming chat endpoint
- `GET /api/agent/sessions` - List all sessions
- `POST /api/agent/sessions` - Create new session
- `PUT /api/agent/sessions/{id}` - Update session title
- `DELETE /api/agent/sessions/{id}` - Delete session
- `GET /api/agent/sessions/{id}/messages` - Get session messages
#### 5.4 Streaming Markdown Component ✅
- [x] Paragraph-aware streaming that preserves atomic elements
- [x] GitHub Flavored Markdown support (tables, code blocks, task lists)
- [x] Copy buttons for code blocks and tables (copies source, not rendered HTML)
- [x] Tailwind typography styling for consistent appearance
#### 5.5 Local Deployment Mode ✅
- [x] New `LocalController` for subprocess-based model execution
- [x] Direct vLLM/SGLang server launch via `python -m` commands
- [x] Automatic port allocation (30000-30100 range)
- [x] Process lifecycle management with graceful shutdown
- [x] Log capture and streaming to task log files
**Runtime Support:**
- vLLM local environment: `.venv-vllm/` with CUDA 12
- SGLang local environment: `.venv-sglang/` (SM86 limitation)
- Automatic environment detection and activation
#### 5.6 Dataset URL Support ✅
- [x] Remote URL dataset loading (CSV, JSONL, compressed archives)
- [x] Automatic format detection and conversion
- [x] URL-hash based caching in `~/.local/share/autotuner/datasets/`
- [x] Deduplication option for prompt datasets
- [x] genai-bench submodule updated to fork with URL support
#### 5.7 Additional Improvements ✅
- [x] HuggingFace offline mode: Fixed cache path handling for air-gapped environments
- [x] GitHub Issue #3 fix: Resolved `analyze_slo_violations()` AttributeError
- [x] Foldable experiments list: Collapsible UI for large experiment sets
- [x] Comprehensive parameter presets: Runtime-specific vLLM and SGLang presets
- [x] Project rebranding: Logo redesign inspired by Novita.ai style
### Technical Achievements
**Code Statistics:**
- **Agent Frontend**: ~800 lines (AgentChat.tsx, StreamingMarkdown.tsx, AgentMessage.tsx)
- **Agent Backend**: ~600 lines (routes/agent.py, schemas/agent.py, services/agent_service.py)
- **Tools**: ~500 lines (task_tools.py, worker_tools.py, github_tools.py, hf_tools.py)
- **Local Mode**: ~400 lines (local_controller.py, autotuner_worker.py updates)
- **Dataset**: ~200 lines (dataset_controller.py)
- **Total New Code**: ~2,500 lines
**Key Components Created:**
- `AgentChat.tsx` (~400 lines) - Main chat interface with message history
- `StreamingMarkdown.tsx` (~350 lines) - Paragraph-aware markdown renderer
- `LocalController` (~300 lines) - Subprocess-based deployment controller
- `DatasetController` (~200 lines) - Remote dataset fetching and caching
- 12 agent tools across 6 categories
### Performance Impact
| Metric | Before M5 | After M5 | Improvement |
|--------|-----------|----------|-------------|
| **Task Creation** | Form-only | Conversational + Form | Natural language |
| **Deployment Setup** | Docker/K8s required | Local subprocess option | Faster iteration |
| **Dataset Loading** | Local files only | Remote URL support | More flexible |
| **Agent Iterations** | N/A | Up to 100 steps | Complex workflows |
### Impact Summary
**For Users:**
- ✅ Natural language task creation and management via Agent
- ✅ Faster local development without Docker/Kubernetes overhead
- ✅ Remote dataset support for production workload testing
- ✅ GitHub integration for issue tracking and collaboration
- ✅ HuggingFace integration for model management
**For Developers:**
- ✅ Local deployment mode for rapid iteration
- ✅ Extensible tool framework for custom integrations
- ✅ Comprehensive logging and debugging support
- ✅ Session-based conversation history
## Current Status: Production-Ready v0.2.0 ✅
### What Works Today
**Core Functionality:**
- ✅ Grid search, random search, Bayesian optimization (Optuna TPE)
- ✅ Docker mode deployment (recommended)
- ✅ Kubernetes/OME mode deployment
- ✅ **Local mode deployment** (subprocess-based, no containers)
- ✅ Runtime-agnostic quantization configuration (vLLM, SGLang, TensorRT-LLM)
- ✅ Runtime-agnostic parallelism configuration (18 presets)
- ✅ SLO-aware scoring with exponential penalties
- ✅ GPU intelligent scheduling with per-GPU efficiency metrics
- ✅ Checkpoint mechanism for fault tolerance
- ✅ Multi-objective Pareto optimization
- ✅ Model caching optimization
- ✅ Full-stack web UI with real-time monitoring
- ✅ **Agent System** with LLM-powered conversational interface
- ✅ **Remote dataset support** via URL fetching
**Performance:**
- ✅ 28 tasks executed successfully
- ✅ 408 total experiments run
- ✅ 312 successful experiments (76.5% success rate)
- ✅ Average experiment duration: 303.6 seconds
- ✅ Bayesian optimization: 80-87% reduction vs grid search
**Infrastructure:**
- ✅ FastAPI backend with async support
- ✅ React 18 frontend with TypeScript
- ✅ WebSocket real-time communication (backend + frontend)
- ✅ SQLite database with WAL mode (XDG-compliant location)
- ✅ Redis task queue with ARQ worker
- ✅ Docker container management
- ✅ Kubernetes resource management
- ✅ **LangChain-based agent framework** with 12 tools
---
## Future Roadmap
### 🔵 Phase 6: Distributed Architecture & Parallel Execution (Planned)
**Priority**: High
**Effort**: 3-4 weeks
**Value**: ⭐⭐⭐⭐⭐
#### 6.1 Distributed Worker Architecture
- [ ] **Central Web Manager**: Single control plane for multiple workers
- [ ] **Worker Registration**: Auto-discovery and registration via Redis
- [ ] **Heartbeat Monitoring**: Worker health checks and failure detection
- [ ] **Work Stealing**: Dynamic task redistribution across workers
- [ ] **Worker Pools**: Group workers by capabilities (GPU type, region, etc.)
**Architecture Design:**
```
┌─────────────────────┐
│ Central Web Manager│
│ (FastAPI + Redis) │
└──────────┬──────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Worker 1 │ │ Worker 2 │ │ Worker 3 │
│ 8×A100 GPUs│ │ 8×H100 GPUs│ │ 4×L40S GPUs│
│ Node: gpu-1│ │ Node: gpu-2│ │ Node: gpu-3│
└─────────────┘ └─────────────┘ └─────────────┘
```
**Components:**
- **Manager**:
- Task queue management
- Worker registry with capabilities
- Experiment distribution algorithm
- Result aggregation service
- Centralized logging
- **Worker**:
- Capability advertisement (GPU count, model, memory)
- Experiment execution engine
- Result reporting via REST API
- Local checkpoint storage
- Worker-level parallelism (max_parallel per worker)
**Benefits:**
- **Horizontal Scaling**: Add workers to increase throughput
- **Resource Isolation**: Different workers for different GPU types
- **Fault Tolerance**: Worker failures don't affect others
- **Geographic Distribution**: Workers in different data centers
- **Cost Optimization**: Use spot instances for workers
**Implementation Plan:**
1. Week 1: Worker registration and discovery
2. Week 2: Task distribution and scheduling
3. Week 3: Result aggregation and monitoring
4. Week 4: Frontend dashboard and testing
#### 6.2 Advanced Parallel Execution
- [ ] User-configurable max_parallel setting (currently hardcoded at 5)
- [ ] Per-worker parallelism configuration
- [ ] Dynamic parallelism based on GPU availability
- [ ] Experiment dependency graph
- [ ] Priority-based scheduling (high/normal/low priority tasks)
- [ ] Resource reservation system (reserve GPUs for specific tasks)
**Benefits:**
- Faster task completion (5-10x speedup with multiple workers)
- Better GPU utilization across cluster
- Configurable resource allocation per task
- Fair scheduling with priority queues
#### 6.3 Task Sharding & Load Balancing
- [ ] Automatic task splitting across workers
- [ ] Load-aware scheduling (balance by GPU count)
- [ ] Locality-aware scheduling (prefer same-node experiments)
- [ ] Cross-worker result aggregation
- [ ] Consistent hashing for worker selection
---
### 🔵 Phase 7: Advanced Optimization & Runtime Features (Planned)
**Priority**: Medium
**Effort**: 2-4 weeks
**Value**: ⭐⭐⭐⭐
#### 7.0 Agent Charting Tool
- [ ] Add chart generation tool for Agent to visualize experiment results
- [ ] Candidates: Matplotlib (static images), Plotly (interactive HTML)
- [ ] Chart types: bar charts, line plots, scatter plots, heatmaps
- [ ] Use cases:
- Compare throughput/latency across experiments
- Visualize parameter sensitivity
- Generate Pareto frontier plots
- Create SLO compliance charts
- [ ] Output: Save charts to files or display inline in chat
**Implementation Options:**
| Library | Pros | Cons |
|---------|------|------|
| Matplotlib | Simple, widely used, static images | Not interactive |
| Plotly | Interactive, HTML export, beautiful | Larger dependency |
| Seaborn | Statistical plots, built on Matplotlib | Limited interactivity |
#### 7.1 Runtime-Specific Optimizations
**SGLang Radix Cache Management:**
- [ ] **Reset radix cache at experiment start**: Clear cache before each experiment
- [ ] **Benchmark purity**: Ensure fair comparison without cache pollution
- [ ] **Cache warming option**: Optional pre-fill for production scenarios
- [ ] **Cache statistics tracking**: Monitor hit rate and memory usage
**Implementation:**
```python
# Before each experiment
def reset_sglang_radix_cache(container_id: str):
"""Reset SGLang radix cache via HTTP API"""
response = requests.post(
f"http://localhost:{port}/reset_cache",
json={"cache_type": "radix"}
)
logger.info(f"Radix cache reset: {response.json()}")
```
**Benefits:**
- Fair experiment comparisons (no cached KV states)
- Reproducible benchmark results
- Accurate TTFT measurements
- Option to test both cold-start and warm-cache scenarios
**Additional Runtime Features:**
- [ ] vLLM prefix caching control
- [ ] TensorRT-LLM engine rebuild triggers
- [ ] Runtime-specific profiling hooks
- [ ] Memory defragmentation between experiments
#### 7.2 Multi-Fidelity Optimization
- [ ] Progressive benchmark complexity
- [ ] Early stopping for poor configurations
- [ ] Hyperband algorithm integration
- [ ] Adaptive resource allocation
- [ ] Quick validation runs (low concurrency, short duration)
- [ ] Full benchmark only for promising configs
#### 7.3 Transfer Learning
- [ ] Model similarity detection (architecture, size, quantization)
- [ ] Cross-model parameter transfer
- [ ] Historical performance database (SQLite → PostgreSQL)
- [ ] Meta-learning for initialization
- [ ] Warmstart Bayesian optimization with historical data
#### 7.4 Enhanced Multi-Objective Optimization
- [ ] NSGA-II algorithm for Pareto frontier
- [ ] 3+ objective support (latency, throughput, cost, energy, memory)
- [ ] Interactive trade-off exploration
- [ ] User preference learning
- [ ] Weighted objective combination
- [ ] Pareto frontier approximation with surrogate models
---
#### 7.5 Enhanced Export & Data Portability
- [ ] Export experiment results to CSV
- [ ] Export results to JSON for analysis
- [ ] Export results to Excel (.xlsx) format
- [ ] Batch import multiple task configs
- [ ] Template library (export/import task templates)
- [ ] Share configurations via file or URL
- [ ] YAML parser with schema validation
- [ ] Automatic conversion between JSON ↔ YAML
- [ ] YAML syntax highlighting in frontend
**Benefits:**
- Data portability for external analysis tools (Excel, Python, R)
- Batch operations for managing multiple tasks
- Configuration templates for common use cases
- Team collaboration via shared configs
- Integration with data science workflows
**Export Formats:**
- Experiment Results: `.csv`, `.json`, `.xlsx`
- Task Configs: `.yaml`, `.json`
- Templates: Zip archive with metadata
#### 7.6 Custom Dataset Support for GenAI-Bench
- [ ] Fetch datasets from user-specified URLs
- [ ] Support CSV format parsing
- [ ] Support JSONL (JSON Lines) format parsing
- [ ] Conversion script to genai-bench format
- [ ] Dataset validation and preprocessing
- [ ] Automatic schema detection
- [ ] Support for custom prompt templates
- [ ] Integration with task configuration
**Supported Input Formats:**
```csv
# CSV format
prompt,max_tokens,temperature
"Explain quantum computing",100,0.7
"Write a story about AI",200,0.9
```
```jsonl
# JSONL format
{"prompt": "Explain quantum computing", "max_tokens": 100, "temperature": 0.7}
{"prompt": "Write a story about AI", "max_tokens": 200, "temperature": 0.9}
```
**Conversion Pipeline:**
```python
# Download and convert custom dataset
python scripts/prepare_custom_dataset.py \
--url https://example.com/dataset.csv \
--format csv \
--output ./data/custom_benchmark.json
# Use in task configuration
{
"benchmark": {
"custom_dataset": "./data/custom_benchmark.json",
"task": "text-to-text"
}
}
```
**Features:**
- HTTP/HTTPS URL fetching with authentication support
- Automatic format detection (CSV/JSONL)
- Field mapping configuration (map CSV columns to genai-bench schema)
- Data validation (check required fields, token limits)
- Sampling strategies (random, stratified, sequential)
- Dataset caching to avoid re-downloading
**Benefits:**
- Use real production workloads for benchmarking
- Test with domain-specific prompts
- Reproducible benchmarks with versioned datasets
- Support for custom evaluation scenarios
- Integration with existing data pipelines
**GenAI-Bench Schema Mapping:**
```python
# Required fields for genai-bench
{
"prompt": str, # Input text
"output_len": int, # Expected output length
"input_len": int, # Input length (auto-calculated if not provided)
"temperature": float, # Optional: sampling temperature
"top_p": float, # Optional: nucleus sampling
"max_tokens": int # Optional: max output tokens
}
```
**Implementation Components:**
1. **Dataset Fetcher** (`src/utils/dataset_fetcher.py`)
- URL download with retries
- Authentication headers support
- Local file caching
2. **Format Converters** (`src/utils/dataset_converters/`)
- `csv_converter.py`: CSV → genai-bench JSON
- `jsonl_converter.py`: JSONL → genai-bench JSON
- Base converter interface for extensibility
3. **Validation Module** (`src/utils/dataset_validator.py`)
- Schema validation
- Token limit checking
- Duplicate detection
4. **CLI Tool** (`scripts/prepare_custom_dataset.py`)
- Standalone conversion utility
- Preview mode (show first N records)
- Statistics reporting
### 🔵 Phase 8: Enterprise Features (Planned)
**Priority**: Low-Medium
**Effort**: 3-5 weeks
**Value**: ⭐⭐⭐
#### 8.1 Multi-User Support
- [ ] User authentication (OAuth2)
- [ ] Role-based access control (RBAC)
- [ ] Task ownership and sharing
- [ ] Team workspaces
#### 8.2 Advanced Monitoring
- [ ] Prometheus metrics exporter
- [ ] Grafana dashboard templates
- [ ] Alert rules for failures
- [ ] Performance analytics
#### 8.3 CI/CD Integration
- [ ] GitHub Actions workflow
- [ ] Automated benchmarking on PR
- [ ] Performance regression detection
- [ ] Automated deployment
#### 8.4 Cloud Deployment
- [ ] AWS deployment guide (EKS)
- [ ] GCP deployment guide (GKE)
- [ ] Azure deployment guide (AKS)
- [ ] Terraform modules
- [ ] Helm charts
---
### 🟢 Phase 9: Research & Innovation (Future)
**Priority**: Low
**Effort**: Variable
**Value**: ⭐⭐⭐
#### 9.1 Auto-Scaling Integration
- [ ] Horizontal Pod Autoscaler (HPA) optimization
- [ ] Vertical Pod Autoscaler (VPA) tuning
- [ ] Knative Serving integration
- [ ] Cost-aware scaling
#### 9.2 Advanced Benchmarking
- [ ] Custom benchmark scenario editor
- [ ] Real-world traffic replay
- [ ] Synthetic load generation
- [ ] Multi-modal benchmarking
#### 9.3 Model-Specific Optimization
- [ ] Architecture-aware parameter tuning
- [ ] Quantization-aware optimization
- [ ] Attention mechanism tuning
- [ ] Memory layout optimization
---
## Maintenance & Technical Debt
### Recently Fixed (2025/11/25) ✅
**Database Schema Mismatch:**
- ❌ Missing columns: `clusterbasemodel_config`, `clusterservingruntime_config`, `created_clusterbasemodel`, `created_clusterservingruntime`
- ✅ Fixed: Added ALTER TABLE statements
- ✅ Verified: All endpoints working, HTTP 500 errors resolved
### Known Issues
1. **Worker Restart Required**
- ⚠️ ARQ worker doesn't hot-reload code changes
- Manual restart needed after editing `orchestrator.py`, `controllers/`
- **Future**: Add file watcher for auto-restart
2. **Polling-Based UI Updates**
- ⚠️ Frontend polls every 2-5 seconds
- Inefficient for idle states
- **Future**: WebSocket migration (Phase 4)
### Technical Improvements
1. **Testing Coverage**
- Current: Manual testing only
- Future: Unit tests, integration tests, E2E tests
- Target: 80% code coverage
2. **Error Handling**
- Current: Basic try-catch blocks
- Future: Comprehensive error taxonomy, retry logic, graceful degradation
3. **Database Migration**
- Current: Manual SQL commands
- Future: Alembic migrations
- Version-controlled schema changes
---
## Success Metrics
### Current Performance (Milestone 5)
| Metric | Value | Target |
|--------|-------|--------|
| **Total Tasks** | 28 | - |
| **Total Experiments** | 408 | - |
| **Success Rate** | 76.5% | >80% |
| **Avg Experiment Duration** | 303.6s | <300s |
| **Bayesian Efficiency** | 80-87% reduction | >70% |
| **UI Response Time** | <200ms | <100ms |
| **API Latency (P95)** | <500ms | <200ms |
| **Supported Runtimes** | 3 (vLLM, SGLang, TRT-LLM) | - |
| **Deployment Modes** | 3 (Docker, OME, Local) | - |
| **Agent Tools** | 12 (across 6 categories) | - |
### Future Targets (v2.0)
- **Experiment Success Rate**: >90%
- **Avg Experiment Duration**: <240s (20% improvement)
- **UI Response Time**: <100ms (WebSocket)
- **Concurrent Experiments**: >10 parallel
- **Cost Reduction**: 50% fewer experiments vs grid search
- **Multi-Runtime Support**: Add Triton, others
---
**End of Roadmap** | Last Updated: 2025/12/22 | Version: 0.2.0 (Milestone 5 Complete)