LLM Autotuner - Product Roadmap¶
Last Updated: 2025/12/22 Project Status: Production-Ready with Active Development Current Version: v0.2.0 (Milestone 5 Complete)
Executive Summary¶
The LLM Autotuner is a comprehensive system for automatically optimizing Large Language Model inference parameters. The project has successfully completed five major milestones including tri-mode deployment (Kubernetes/OME, Docker, and Local), full-stack web application, runtime-agnostic configuration, and an AI-powered Agent System.
Key Achievements:
✅ 28 tasks executed, 408 experiments run, 312 successful results
✅ Bayesian optimization achieving 80-87% reduction in experiments vs grid search
✅ Full-stack web application with React frontend and FastAPI backend
✅ Runtime-agnostic quantization and parallelism configuration
✅ GPU-aware optimization with per-GPU efficiency metrics
✅ YAML import/export for configuration management
✅ Real-time WebSocket updates and auto-update notifications
✅ Per-batch SLO filtering with graceful OOM handling
✅ Documentation refinement (66→15 files, 77% reduction)
✅ SLO-aware scoring with exponential penalty functions
✅ Agent System with LLM-powered task management, tool execution, and conversational interface
✅ Local Deployment Mode for subprocess-based model execution without containers
✅ Remote Dataset Support via URL fetching with caching
Milestone Overview¶
M1: Core Foundation |
M2: Web Interface |
M3: Runtime-Agnostic |
M4: UI/UX Polish |
M5: Agent System |
M6+: Future |
|
|---|---|---|---|---|---|---|
Date |
2025/10/24 |
2025/10/30 |
2025/11/14 |
2025/12/01 |
2025/12/22 |
Planned |
Status |
✅ Done |
✅ Done |
✅ Done |
✅ Done |
✅ Done |
🔵 Planned |
Features |
✅ Grid/Random Search |
✅ REST API |
✅ Bayesian Optimization |
✅ WebSocket Updates |
✅ Agent Chat UI |
🔵 Distributed Workers |
Milestone Timeline¶
2025/10/24 ────► Milestone 1: Core Autotuner Foundation
2025/10/30 ────► Milestone 2: Complete Web Interface & Parameter Preset System
2025/11/14 ────► Milestone 3: Runtime-Agnostic Configuration & GPU-Aware Optimization
2025/12/01 ────► Milestone 4: UI/UX Polish, SLO Filtering & Documentation
2025/12/22 ────► Milestone 5: Agent System & Local Deployment Mode
🎉 Milestone 1: Core Autotuner Foundation¶
Date: 2025/10/24 (tag: milestone-1)
Status: ✅ COMPLETED
Objective: Establish solid foundation for LLM inference parameter autotuning with complete functionality, proper documentation, and code standards
Key Accomplishments¶
1.1 Architecture & Implementation ✅¶
Multi-tier architecture with clear separation of concerns
OME controller for Kubernetes InferenceService lifecycle
Docker controller for standalone deployment
Benchmark controller (OME BenchmarkJob + Direct CLI modes)
Parameter grid generator and optimizer utilities
Main orchestrator with JSON input
Technical Specs:
Controllers:
ome_controller.py,docker_controller.py,benchmark_controller.py,direct_benchmark_controller.pyUtilities:
optimizer.py(grid search, scoring algorithms)Templates: Jinja2 for Kubernetes resources
1.2 Benchmark Results Parsing & Scoring ✅¶
Fixed critical bug in genai-bench result file parsing
Enhanced
DirectBenchmarkController._parse_results()Reads correct result files (D*.json pattern)
Handles multiple concurrency levels
Aggregates metrics across all runs
Extracts 15+ performance metrics
Completed calculate_objective_score() with 4 objectives:
minimize_latency- E2E latency optimizationmaximize_throughput- Token throughput optimizationminimize_ttft- Time to First Token optimizationminimize_tpot- Time Per Output Token optimization
Comprehensive Metrics:
Latency: mean/min/max/p50/p90/p99 E2E latency
Throughput: output and total token throughput
Request statistics: success rate, error tracking
1.3 Code Quality & Standards ✅¶
Integrated black-with-tabs formatter
Formatted entire codebase (7 Python files, 1957+ lines)
Configuration: 120-char lines, tab indentation
PEP 8 compliance with 2 blank lines between top-level definitions
IDE integration guides (VS Code, PyCharm)
1.4 CLI Usability Improvements ✅¶
Made
--directflag automatic when using--mode dockerSimplified command-line interface
Updated help text and usage examples
Better default behaviors for common use cases
1.5 Documentation Structure ✅¶
Separated 420+ line Troubleshooting into
docs/TROUBLESHOOTING.mdCreated
docs/DEVELOPMENT.mdcomprehensive guideEstablished documentation conventions
Improved README readability
Documentation Files Created:
README.md- User guide with installation and usageCLAUDE.md- Project overview and development guidelinesdocs/TROUBLESHOOTING.md- 13 common issues and solutionsdocs/DEVELOPMENT.md- Code formatting and contribution guidedocs/DOCKER_MODE.md- Docker deployment guidedocs/OME_INSTALLATION.md- Kubernetes/OME setup
1.6 Web Integration Readiness ✅¶
Comprehensive codebase analysis: Zero blockers found
Created detailed readiness assessment
Verified all controllers fully implemented (no placeholder functions)
Confirmed orchestrator is programmatically importable
Documented data structures (input/output formats)
Technology stack recommendations (FastAPI, React/Vue)
API endpoint specifications
Implementation roadmap with effort estimates
Technical Achievements¶
Code Quality:
1,957 lines of production Python code
100% method implementation (no placeholders in critical paths)
Comprehensive error handling and logging
Clean separation of concerns (controllers, orchestrator, utilities)
Functionality:
✅ Full Docker mode support (standalone, no K8s required)
✅ OME/Kubernetes mode support
✅ Grid search parameter optimization
✅ Multi-concurrency benchmark execution
✅ Comprehensive result aggregation and scoring
✅ Automatic resource cleanup
Test Results:
Successfully parsed real benchmark data
Concurrency levels: [1, 4]
Mean E2E Latency: 0.1892s
Mean Throughput: 2,304.82 tokens/s
🎉 Milestone 2: Complete Web Interface & Parameter Preset System¶
Date: 2025/10/30 (tag: milestone-2)
Status: ✅ COMPLETED
Objective: Build full-stack web application for task management, visualization, and introduce parameter preset system
Key Accomplishments¶
2.1 Backend API Infrastructure ✅¶
FastAPI application with async support
SQLAlchemy ORM with SQLite backend (moved to
~/.local/share/)Database models (Task, Experiment)
REST API endpoints (10+ routes)
ARQ background task queue (Redis integration)
Pydantic schemas for validation
Streaming log API endpoints
Health check improvements
API Endpoints:
POST /api/tasks/ - Create task
POST /api/tasks/{id}/start - Start task execution
GET /api/tasks/ - List tasks
GET /api/tasks/{id} - Get task details
GET /api/tasks/{id}/logs - Stream logs (SSE)
GET /api/experiments/task/{id} - Get experiments
GET /api/docker/containers - List containers
GET /api/system/health - Health check
Database Migration:
Moved from local
autotuner.dbto XDG-compliant~/.local/share/autotuner/SQLite WAL mode for concurrent writes
Proper session management with async context
2.2 React Frontend Application ✅¶
React 18 with TypeScript
Vite build tooling with hot module replacement
React Router for navigation
TanStack Query (React Query) for API state
Tailwind CSS styling
Recharts for metrics visualization
React Hot Toast for notifications
Pages Implemented:
Dashboard - System overview and statistics
Tasks - Task list with create/list/monitor/restart
NewTask - Task creation wizard with form validation
Experiments - Results visualization with charts
Containers - Docker container monitoring (Docker mode)
Key Components:
TaskResults.tsx- Results visualization with RechartsLogViewer.tsx- Real-time log streaming viewerLayout.tsx- Main layout with navigationForm components with validation
UI Features:
Task creation wizard with parameter presets
Real-time status monitoring (polling-based)
Experiment results table with sorting/filtering
Performance graphs (throughput, latency, TPOT, TTFT)
Container stats (CPU, memory, GPU)
Log streaming with follow mode
URL-based navigation with hash routing
Error notifications with toast messages
2.3 ARQ Worker Integration ✅¶
Background task processing with Redis queue
Worker configuration (max_jobs=5, timeout=2h)
Log redirection to task-specific files
Graceful shutdown handling
Worker management scripts
Log Management:
Task logs:
~/.local/share/autotuner/logs/task_<id>.logWorker logs:
logs/worker.logPython logging library integration
StreamToLogger for real-time capture
2.4 Task Management Features ✅¶
Task creation UI with form builder
Task restart functionality
Task edit capability
Task status tracking
Real-time log viewing
Environment variable configuration for Docker
2.5 Parameter Preset System (Backend) ✅¶
Parameter preset API (CRUD operations)
Preset merge functionality
Import/export capabilities
System preset seeding
Note: Frontend integration for preset system completed in later sprints.
Bug Fixes & Improvements¶
Critical Fixes:
Fixed best experiment selection bug
Fixed model name field linkings
Fixed health check 503 errors
Fixed data display in task view
Refined task restart logic
Enhanced container log viewing
Code Organization:
Reorganized web backend code structure
Separated orchestrator from web modules
Formatted code with Prettier
Improved error handling and validation
Technical Stack¶
Component |
Technology |
|---|---|
Frontend |
React 18, TypeScript, Vite 5 |
State Management |
TanStack Query 5 |
Styling |
Tailwind CSS 3 |
Charts |
Recharts 2 |
Backend |
FastAPI, Python 3.10+ |
Database |
SQLite 3 with SQLAlchemy 2 |
Task Queue |
ARQ 0.26 + Redis 7 |
API Docs |
Swagger UI (OpenAPI) |
Statistics¶
Commits since Milestone 1: 40+
Frontend Components: 20+ React components
API Endpoints: 15+ routes
Database Tables: 2 (tasks, experiments)
Lines of Code: ~12,000 total (5,000 backend + 7,000 frontend)
🎉 Milestone 3: Runtime-Agnostic Configuration Architecture & GPU-Aware Optimization¶
Date: 2025/11/14 (tag: milestone-3)
Status: ✅ COMPLETED
Timeline: 2025/11/10 → 2025/11/14
Objective: Unified configuration abstraction for quantization and parallelism across multiple runtimes, plus GPU-aware optimization
Overview¶
Milestone 3 achieved two major architectural breakthroughs:
Runtime-Agnostic Configuration System - Unified abstraction for quantization and parallel execution across vLLM, SGLang, and TensorRT-LLM
GPU-Aware Optimization - Per-GPU efficiency metrics enabling fair comparison across different parallelism strategies
These foundational changes enable portable, efficiency-aware autotuning where users specify high-level intent and the system automatically maps to runtime-specific implementations while optimizing for per-GPU efficiency.
Part 1: Runtime-Agnostic Configuration System¶
1.1 Quantization Configuration Abstraction ✅¶
Problem Solved: Different inference runtimes use incompatible CLI syntax for quantization. Users had to learn runtime-specific arguments and rewrite configurations when switching engines.
Solution: Three-Layer Abstraction Architecture
Four-Field Normalized Schema:
{
"gemm_dtype": "fp8", # Weight/activation quantization
"kvcache_dtype": "fp8_e5m2", # KV cache compression
"attention_dtype": "auto", # Attention compute precision
"moe_dtype": "auto" # MoE expert quantization
}
Modules Created:
quantization_mapper.py(450 lines)Runtime-specific CLI argument mapping
5 production presets:
default,kv-cache-fp8,dynamic-fp8,bf16-stable,aggressive-moeValidation with dtype compatibility checking
Automatic detection of offline quantization (AWQ, GPTQ, GGUF)
quantization_integration.py(350 lines)Orchestrator integration layer
Experiment parameter preparation
Conflict resolution between user params and quant config
Runtime Mapping Example:
User Config vLLM Args SGLang Args
────────────────────────────────────────────────────────────────────────────
gemm_dtype: "fp8" → --quantization fp8 --quantization fp8
kvcache_dtype: "fp8_e5m2" → --kv-cache-dtype fp8_e5m2 --kv-cache-dtype fp8_e5m2
attention_dtype: "fp8" → (inferred from gemm) --attention-backend fp8
Grid Expansion with __quant__ Prefix:
{
"quant_config": {
"gemm_dtype": ["auto", "fp8"],
"kvcache_dtype": ["auto", "fp8_e5m2"]
}
}
Expands to 4 experiments (2×2):
__quant__gemm_dtype=auto, __quant__kvcache_dtype=auto__quant__gemm_dtype=auto, __quant__kvcache_dtype=fp8_e5m2__quant__gemm_dtype=fp8, __quant__kvcache_dtype=auto__quant__gemm_dtype=fp8, __quant__kvcache_dtype=fp8_e5m2
Frontend Integration:
QuantizationConfigForm.tsx(612 lines)Preset mode vs. Custom mode toggle
Real-time preview of generated parameters
Combination count calculation
Validation feedback
1.2 Parallel Configuration Abstraction ✅¶
Normalized Parameter Schema:
{
"tp": 4, # Tensor parallelism
"pp": 1, # Pipeline parallelism
"dp": 2, # Data parallelism
"dcp": 1, # Decode context parallelism (vLLM)
"cp": 1, # Context parallelism (TensorRT-LLM)
"ep": 1, # Expert parallelism (MoE)
"moe_tp": 1, # MoE tensor parallelism
"moe_ep": 1 # MoE expert parallelism
}
Modules Created:
parallel_mapper.py(520 lines)18 runtime-specific presets (6 per engine)
Constraint validation (e.g., SGLang:
tp % dp == 0, TensorRT-LLM: no DP support)world_size calculation:
world_size = tp × pp × dp
parallel_integration.py(280 lines)Parameter grid expansion
Orchestrator integration
GPU allocation coordination
Presets Per Engine:
vLLM (6 presets):
- single-gpu, high-throughput, large-model-tp, large-model-tp-pp
- moe-optimized, long-context (with dcp), balanced
SGLang (6 presets):
- single-gpu, high-throughput, large-model-tp, large-model-tp-pp
- moe-optimized (with moe_dense_tp), balanced
- Constraint: tp % dp == 0
TensorRT-LLM (6 presets):
- single-gpu, large-model-tp, large-model-tp-pp
- moe-optimized (with moe_tp, moe_ep), long-context (with cp)
- Constraint: No data parallelism support (dp must be 1)
Runtime Mapping Example:
User Config vLLM Args SGLang Args
─────────────────────────────────────────────────────────────────────────────────
tp: 4 → --tensor-parallel-size 4 --tp-size 4
pp: 1 → --pipeline-parallel-size 1 (not supported)
dp: 2 → --distributed-executor-backend ray --dp-size 2
--num-gpu-blocks-override
Grid Expansion with __parallel__ Prefix:
{
"parallel_config": {
"tp": [2, 4],
"pp": 1,
"dp": [1, 2]
}
}
Expands to 4 experiments (2×2).
Frontend Integration:
ParallelConfigForm.tsx(similar to QuantizationConfigForm)Preset mode with 18 runtime-specific presets
Custom mode with constraint validation
GPU requirement calculation
Real-time parameter preview
Part 2: GPU-Aware Optimization¶
2.1 Per-GPU Efficiency Metrics ✅¶
Problem Solved: Traditional throughput metrics favor higher parallelism blindly. A configuration using 8 GPUs with 100 tokens/s looks better than 2 GPUs with 60 tokens/s, but the latter is 2.4× more efficient per GPU.
Solution: Per-GPU Throughput Calculation
Formula:
per_gpu_throughput = total_throughput / gpu_count
Example Comparison:
Config A: TP=2, throughput=661.36 tokens/s → 330.68 tokens/s/GPU
Config B: TP=4, throughput=628.22 tokens/s → 157.06 tokens/s/GPU
Winner: Config A (2.1× more efficient)
Implementation:
GPU info recorded in database:
gpu_infoJSON fieldContains:
model,count,device_ids,world_sizeAutomatic calculation during scoring
Frontend displays both total and per-GPU metrics
2.2 GPU Information Tracking ✅¶
Database Schema:
gpu_info = {
"model": "NVIDIA A100",
"count": 2,
"device_ids": [0, 1],
"world_size": 2
}
Recording Logic:
Captured during experiment setup
Stored in
experiments.gpu_infocolumn (JSON)Used for per-GPU metric calculation
Displayed in results table
2.3 Enhanced Result Visualization ✅¶
Frontend Enhancements:
Added “GPUs” column to experiment table
Display:
2 (A100)or4 (H100)Tooltip shows device IDs and world size
Per-GPU throughput column
Color coding for efficiency comparison
Charts:
Per-GPU efficiency scatter plot
GPU count vs throughput line chart
Pareto frontier with GPU cost consideration
Technical Achievements¶
Code Additions:
Quantization System: 800 lines (mapper + integration)
Parallel System: 800 lines (mapper + integration)
GPU Tracking: 200 lines (backend + frontend)
Frontend Forms: 1,200 lines (Quant + Parallel components)
Documentation: 3 new docs (QUANTIZATION, PARALLEL, GPU_TRACKING)
Total: ~3,000 lines of new production code
Functionality:
✅ Support for 3 inference runtimes (vLLM, SGLang, TensorRT-LLM)
✅ 5 quantization presets + custom mode
✅ 18 parallelism presets (6 per runtime)
✅ Automatic runtime-specific CLI mapping
✅ Constraint validation and conflict resolution
✅ Per-GPU efficiency metrics
✅ GPU information persistence
Documentation:
docs/QUANTIZATION_CONFIGURATION.mddocs/PARALLEL_CONFIGURATION.mddocs/GPU_TRACKING.md
🎉 Milestone 4: UI/UX Polish, SLO Filtering & Documentation¶
Date: 2025-12-01 (tag: milestone-4)
Status: ✅ COMPLETED
Timeline: 2025-11-15 → 2025-12-01
Objective: Transform from functional prototype to production-ready platform with professional UI, SLO filtering, and comprehensive documentation
Key Accomplishments¶
4.1 Frontend UI/UX Enhancements ✅¶
Real-time WebSocket updates (<100ms latency)
YAML import/export for task configurations
Auto-update notification system (GitHub releases)
Enhanced result visualization with SLO reference lines
Custom logo and branding (SVG icon + favicon)
Protected completed tasks (hidden edit/cleanup buttons)
Clickable task names for details view
UI refinements (width-limited controls, placeholder cleanup)
YAML Import/Export System:
// Import: Full-page drag-and-drop zone
<TaskYAMLImport onImport={(config) => populateForm(config)} />
// Export: Single-click download
<button onClick={() => exportTaskAsYAML(task)}>Export YAML</button>
Auto-Update Notifications:
Automatic version checking against GitHub releases
Notification banner when updates available
Build timestamp tracking:
v1.0.0+20251203T195130Z
4.2 SLO-Aware Benchmarking ✅¶
Per-batch SLO filtering (filter non-compliant batches before aggregation)
Graceful OOM handling (partial success support)
Visual SLO indicators (reference lines on performance charts)
Detailed compliance logging per batch
Per-Batch Filtering Example:
[Benchmark] Filtering 4 batches by SLO compliance...
[Benchmark] ✗ Batch concurrency=8 violated SLO: {'p90': {'threshold': 5.0, 'actual': 6.2}}
[Benchmark] ✓ 3/4 batches passed SLO
[Benchmark] Max throughput: 145.2 req/s (from 3 SLO-compliant batches)
Graceful Degradation:
Experiments succeed if at least one batch completes
Partial results better than no results
OOM at high concurrency doesn’t invalidate low-concurrency data
4.3 Documentation Refinement ✅¶
Aggressive cleanup (66 → 15 files, 77% reduction)
Content merges (GENAI_BENCH_LOGS → TROUBLESHOOTING, etc.)
Reference fixes (zero broken links across all docs)
Focus on long-term maintainability
15 Essential Files Kept:
User Guides (4): QUICKSTART, DOCKER_MODE, OME_INSTALLATION, TROUBLESHOOTING
Architecture (3): DEPLOYMENT_ARCHITECTURE, GPU_TRACKING, ROADMAP
Features (4): BAYESIAN_OPTIMIZATION, SLO_SCORING, PARALLEL_EXECUTION, WEBSOCKET_IMPLEMENTATION
Configuration (4): UNIFIED_QUANTIZATION_PARAMETERS, PARALLEL_PARAMETERS, PRESET_QUICK_REFERENCE, PVC_STORAGE
4.4 Bug Fixes & Infrastructure ✅¶
Template parameter fix (OME InferenceService:
params=parametersinstead of**parameters)API proxy configuration (fixed hardcoded URLs in service files)
Pydantic settings fix (added
extra='ignore'for VITE_* variables)
4.5 Documentation Website ✅¶
Sphinx documentation with Furo theme
GitHub Actions workflow for automated deployment
MyST Parser for Markdown support
Auto-generated API documentation (autodoc)
Organized directory structure (getting-started, user-guide, features, api)
Dark mode support with custom branding
Technical Achievements¶
Code Statistics:
Frontend: ~800 lines (YAML I/O, auto-update, SLO visualization)
Backend: ~400 lines (SLO filtering, OOM handling, fixes)
Total New Code: ~1,200 lines
Documentation: Sphinx site with 15+ pages, GitHub Actions CI/CD
Components Created:
TaskYAMLImport.tsx(180 lines) - Drag-and-drop import with validationTaskYAMLExport.tsx(80 lines) - Single-click YAML exportUpdateNotification.tsx(110 lines) - Auto-update banner with GitHub integrationversionService.ts(60 lines) - Version checking servicecheck_batch_slo_compliance()(133 lines) - Per-batch SLO validationdocs/conf.py- Sphinx configuration with Furo theme.github/workflows/docs.yml- GitHub Pages deployment workflow
Files Modified:
Frontend: Tasks.tsx, TaskResults.tsx, NewTask.tsx, Logo.tsx (10+ files)
Backend: optimizer.py, direct_benchmark_controller.py, config.py (5 files)
Documentation: README.md, CLAUDE.md, ROADMAP.md (reference fixes)
Performance Impact¶
Metric |
Before M4 |
After M4 |
Improvement |
|---|---|---|---|
UI Response Time |
2-5s polling |
<100ms WebSocket |
20-50x faster |
Config Reusability |
Manual JSON edit |
YAML import/export |
Instant |
Update Awareness |
Manual check |
Auto-notification |
Automatic |
SLO Visibility |
Numbers only |
Visual ref lines |
Intuitive |
OOM Resilience |
Experiment fails |
Partial success |
Graceful |
Doc Files |
66 files |
15 files |
77% reduction |
Impact Summary¶
For Users:
✅ Faster feedback: WebSocket real-time updates
✅ Better visualization: SLO reference lines, enhanced charts
✅ Config management: YAML import/export workflow
✅ Stay updated: Automatic version checking
✅ Fewer failures: Graceful OOM handling
✅ Cleaner UI: Protected actions, clickable names
✅ Professional branding: Custom logo and favicon
For Operators:
✅ Easier troubleshooting: Per-batch SLO logging
✅ Better resource utilization: Partial success support
✅ Clearer documentation: 15 essential files vs 66
✅ No broken links: All references verified
For Developers:
✅ Maintainable docs: Focused, merged content
✅ Working examples: Templates verified
✅ Clear architecture: Essential docs only
✅ Build tracking: Timestamp in version display
🎉 Milestone 5: Agent System & Local Deployment Mode¶
Date: 2025-12-22 (tag: milestone-5)
Status: ✅ COMPLETED
Timeline: 2025-12-01 → 2025-12-22
Objective: Introduce LLM-powered Agent System for conversational task management and add Local Deployment Mode for faster development iteration
Key Accomplishments¶
5.1 Agent Chat Interface ✅¶
Full-featured chat UI with streaming markdown responses (
/agent)Session management with persistent conversation history
Editable session titles with auto-generation from first message
IndexedDB-based message storage with backend sync
Server-Sent Events (SSE) for real-time streaming responses
Architecture:
┌─────────────────────┐ SSE Stream ┌──────────────────┐
│ AgentChat.tsx │◄───────────────────►│ /api/agent/chat │
│ (React Frontend) │ │ (FastAPI) │
└─────────────────────┘ └────────┬─────────┘
│ │
│ Markdown │ OpenAI API
▼ ▼
┌─────────────────────┐ ┌──────────────────┐
│ StreamingMarkdown │ │ LLM Backend │
│ (react-markdown) │ │ (Configurable) │
└─────────────────────┘ └──────────────────┘
5.2 Tool Execution Framework ✅¶
Authorization system with
AuthorizationScopeenum (none, privileged, dangerous)Privileged tools require user approval before execution
Auto-execute pending tools after authorization granted
Clear visual indicators for tool status in chat UI
Tool Categories Implemented:
Category |
Tools |
Authorization |
|---|---|---|
Task Management |
|
None |
Worker Control |
|
Privileged |
System Utilities |
|
None |
GitHub Integration |
|
None |
HuggingFace CLI |
|
None |
Experiment Analysis |
|
None |
5.3 Agent Backend Architecture ✅¶
LangChain framework for flexible model support
Support for Claude (Anthropic) and open-source models
Max iterations increased from 10 to 100 for complex tasks
Automatic tool result handling for multi-step operations
API Endpoints:
POST /api/agent/chat- SSE streaming chat endpointGET /api/agent/sessions- List all sessionsPOST /api/agent/sessions- Create new sessionPUT /api/agent/sessions/{id}- Update session titleDELETE /api/agent/sessions/{id}- Delete sessionGET /api/agent/sessions/{id}/messages- Get session messages
5.4 Streaming Markdown Component ✅¶
Paragraph-aware streaming that preserves atomic elements
GitHub Flavored Markdown support (tables, code blocks, task lists)
Copy buttons for code blocks and tables (copies source, not rendered HTML)
Tailwind typography styling for consistent appearance
5.5 Local Deployment Mode ✅¶
New
LocalControllerfor subprocess-based model executionDirect vLLM/SGLang server launch via
python -mcommandsAutomatic port allocation (30000-30100 range)
Process lifecycle management with graceful shutdown
Log capture and streaming to task log files
Runtime Support:
vLLM local environment:
.venv-vllm/with CUDA 12SGLang local environment:
.venv-sglang/(SM86 limitation)Automatic environment detection and activation
5.6 Dataset URL Support ✅¶
Remote URL dataset loading (CSV, JSONL, compressed archives)
Automatic format detection and conversion
URL-hash based caching in
~/.local/share/autotuner/datasets/Deduplication option for prompt datasets
genai-bench submodule updated to fork with URL support
5.7 Additional Improvements ✅¶
HuggingFace offline mode: Fixed cache path handling for air-gapped environments
GitHub Issue #3 fix: Resolved
analyze_slo_violations()AttributeErrorFoldable experiments list: Collapsible UI for large experiment sets
Comprehensive parameter presets: Runtime-specific vLLM and SGLang presets
Project rebranding: Logo redesign inspired by Novita.ai style
Technical Achievements¶
Code Statistics:
Agent Frontend: ~800 lines (AgentChat.tsx, StreamingMarkdown.tsx, AgentMessage.tsx)
Agent Backend: ~600 lines (routes/agent.py, schemas/agent.py, services/agent_service.py)
Tools: ~500 lines (task_tools.py, worker_tools.py, github_tools.py, hf_tools.py)
Local Mode: ~400 lines (local_controller.py, autotuner_worker.py updates)
Dataset: ~200 lines (dataset_controller.py)
Total New Code: ~2,500 lines
Key Components Created:
AgentChat.tsx(~400 lines) - Main chat interface with message historyStreamingMarkdown.tsx(~350 lines) - Paragraph-aware markdown rendererLocalController(~300 lines) - Subprocess-based deployment controllerDatasetController(~200 lines) - Remote dataset fetching and caching12 agent tools across 6 categories
Performance Impact¶
Metric |
Before M5 |
After M5 |
Improvement |
|---|---|---|---|
Task Creation |
Form-only |
Conversational + Form |
Natural language |
Deployment Setup |
Docker/K8s required |
Local subprocess option |
Faster iteration |
Dataset Loading |
Local files only |
Remote URL support |
More flexible |
Agent Iterations |
N/A |
Up to 100 steps |
Complex workflows |
Impact Summary¶
For Users:
✅ Natural language task creation and management via Agent
✅ Faster local development without Docker/Kubernetes overhead
✅ Remote dataset support for production workload testing
✅ GitHub integration for issue tracking and collaboration
✅ HuggingFace integration for model management
For Developers:
✅ Local deployment mode for rapid iteration
✅ Extensible tool framework for custom integrations
✅ Comprehensive logging and debugging support
✅ Session-based conversation history
Current Status: Production-Ready v0.2.0 ✅¶
What Works Today¶
Core Functionality:
✅ Grid search, random search, Bayesian optimization (Optuna TPE)
✅ Docker mode deployment (recommended)
✅ Kubernetes/OME mode deployment
✅ Local mode deployment (subprocess-based, no containers)
✅ Runtime-agnostic quantization configuration (vLLM, SGLang, TensorRT-LLM)
✅ Runtime-agnostic parallelism configuration (18 presets)
✅ SLO-aware scoring with exponential penalties
✅ GPU intelligent scheduling with per-GPU efficiency metrics
✅ Checkpoint mechanism for fault tolerance
✅ Multi-objective Pareto optimization
✅ Model caching optimization
✅ Full-stack web UI with real-time monitoring
✅ Agent System with LLM-powered conversational interface
✅ Remote dataset support via URL fetching
Performance:
✅ 28 tasks executed successfully
✅ 408 total experiments run
✅ 312 successful experiments (76.5% success rate)
✅ Average experiment duration: 303.6 seconds
✅ Bayesian optimization: 80-87% reduction vs grid search
Infrastructure:
✅ FastAPI backend with async support
✅ React 18 frontend with TypeScript
✅ WebSocket real-time communication (backend + frontend)
✅ SQLite database with WAL mode (XDG-compliant location)
✅ Redis task queue with ARQ worker
✅ Docker container management
✅ Kubernetes resource management
✅ LangChain-based agent framework with 12 tools
Future Roadmap¶
🔵 Phase 6: Distributed Architecture & Parallel Execution (Planned)¶
Priority: High Effort: 3-4 weeks Value: ⭐⭐⭐⭐⭐
6.1 Distributed Worker Architecture¶
Central Web Manager: Single control plane for multiple workers
Worker Registration: Auto-discovery and registration via Redis
Heartbeat Monitoring: Worker health checks and failure detection
Work Stealing: Dynamic task redistribution across workers
Worker Pools: Group workers by capabilities (GPU type, region, etc.)
Architecture Design:
┌─────────────────────┐
│ Central Web Manager│
│ (FastAPI + Redis) │
└──────────┬──────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Worker 1 │ │ Worker 2 │ │ Worker 3 │
│ 8×A100 GPUs│ │ 8×H100 GPUs│ │ 4×L40S GPUs│
│ Node: gpu-1│ │ Node: gpu-2│ │ Node: gpu-3│
└─────────────┘ └─────────────┘ └─────────────┘
Components:
Manager:
Task queue management
Worker registry with capabilities
Experiment distribution algorithm
Result aggregation service
Centralized logging
Worker:
Capability advertisement (GPU count, model, memory)
Experiment execution engine
Result reporting via REST API
Local checkpoint storage
Worker-level parallelism (max_parallel per worker)
Benefits:
Horizontal Scaling: Add workers to increase throughput
Resource Isolation: Different workers for different GPU types
Fault Tolerance: Worker failures don’t affect others
Geographic Distribution: Workers in different data centers
Cost Optimization: Use spot instances for workers
Implementation Plan:
Week 1: Worker registration and discovery
Week 2: Task distribution and scheduling
Week 3: Result aggregation and monitoring
Week 4: Frontend dashboard and testing
6.2 Advanced Parallel Execution¶
User-configurable max_parallel setting (currently hardcoded at 5)
Per-worker parallelism configuration
Dynamic parallelism based on GPU availability
Experiment dependency graph
Priority-based scheduling (high/normal/low priority tasks)
Resource reservation system (reserve GPUs for specific tasks)
Benefits:
Faster task completion (5-10x speedup with multiple workers)
Better GPU utilization across cluster
Configurable resource allocation per task
Fair scheduling with priority queues
6.3 Task Sharding & Load Balancing¶
Automatic task splitting across workers
Load-aware scheduling (balance by GPU count)
Locality-aware scheduling (prefer same-node experiments)
Cross-worker result aggregation
Consistent hashing for worker selection
🔵 Phase 7: Advanced Optimization & Runtime Features (Planned)¶
Priority: Medium Effort: 2-4 weeks Value: ⭐⭐⭐⭐
7.0 Agent Charting Tool¶
Add chart generation tool for Agent to visualize experiment results
Candidates: Matplotlib (static images), Plotly (interactive HTML)
Chart types: bar charts, line plots, scatter plots, heatmaps
Use cases:
Compare throughput/latency across experiments
Visualize parameter sensitivity
Generate Pareto frontier plots
Create SLO compliance charts
Output: Save charts to files or display inline in chat
Implementation Options:
Library |
Pros |
Cons |
|---|---|---|
Matplotlib |
Simple, widely used, static images |
Not interactive |
Plotly |
Interactive, HTML export, beautiful |
Larger dependency |
Seaborn |
Statistical plots, built on Matplotlib |
Limited interactivity |
7.1 Runtime-Specific Optimizations¶
SGLang Radix Cache Management:
Reset radix cache at experiment start: Clear cache before each experiment
Benchmark purity: Ensure fair comparison without cache pollution
Cache warming option: Optional pre-fill for production scenarios
Cache statistics tracking: Monitor hit rate and memory usage
Implementation:
# Before each experiment
def reset_sglang_radix_cache(container_id: str):
"""Reset SGLang radix cache via HTTP API"""
response = requests.post(
f"http://localhost:{port}/reset_cache",
json={"cache_type": "radix"}
)
logger.info(f"Radix cache reset: {response.json()}")
Benefits:
Fair experiment comparisons (no cached KV states)
Reproducible benchmark results
Accurate TTFT measurements
Option to test both cold-start and warm-cache scenarios
Additional Runtime Features:
vLLM prefix caching control
TensorRT-LLM engine rebuild triggers
Runtime-specific profiling hooks
Memory defragmentation between experiments
7.2 Multi-Fidelity Optimization¶
Progressive benchmark complexity
Early stopping for poor configurations
Hyperband algorithm integration
Adaptive resource allocation
Quick validation runs (low concurrency, short duration)
Full benchmark only for promising configs
7.3 Transfer Learning¶
Model similarity detection (architecture, size, quantization)
Cross-model parameter transfer
Historical performance database (SQLite → PostgreSQL)
Meta-learning for initialization
Warmstart Bayesian optimization with historical data
7.4 Enhanced Multi-Objective Optimization¶
NSGA-II algorithm for Pareto frontier
3+ objective support (latency, throughput, cost, energy, memory)
Interactive trade-off exploration
User preference learning
Weighted objective combination
Pareto frontier approximation with surrogate models
7.5 Enhanced Export & Data Portability¶
Export experiment results to CSV
Export results to JSON for analysis
Export results to Excel (.xlsx) format
Batch import multiple task configs
Template library (export/import task templates)
Share configurations via file or URL
YAML parser with schema validation
Automatic conversion between JSON ↔ YAML
YAML syntax highlighting in frontend
Benefits:
Data portability for external analysis tools (Excel, Python, R)
Batch operations for managing multiple tasks
Configuration templates for common use cases
Team collaboration via shared configs
Integration with data science workflows
Export Formats:
Experiment Results:
.csv,.json,.xlsxTask Configs:
.yaml,.jsonTemplates: Zip archive with metadata
7.6 Custom Dataset Support for GenAI-Bench¶
Fetch datasets from user-specified URLs
Support CSV format parsing
Support JSONL (JSON Lines) format parsing
Conversion script to genai-bench format
Dataset validation and preprocessing
Automatic schema detection
Support for custom prompt templates
Integration with task configuration
Supported Input Formats:
# CSV format
prompt,max_tokens,temperature
"Explain quantum computing",100,0.7
"Write a story about AI",200,0.9
# JSONL format
{"prompt": "Explain quantum computing", "max_tokens": 100, "temperature": 0.7}
{"prompt": "Write a story about AI", "max_tokens": 200, "temperature": 0.9}
Conversion Pipeline:
# Download and convert custom dataset
python scripts/prepare_custom_dataset.py \
--url https://example.com/dataset.csv \
--format csv \
--output ./data/custom_benchmark.json
# Use in task configuration
{
"benchmark": {
"custom_dataset": "./data/custom_benchmark.json",
"task": "text-to-text"
}
}
Features:
HTTP/HTTPS URL fetching with authentication support
Automatic format detection (CSV/JSONL)
Field mapping configuration (map CSV columns to genai-bench schema)
Data validation (check required fields, token limits)
Sampling strategies (random, stratified, sequential)
Dataset caching to avoid re-downloading
Benefits:
Use real production workloads for benchmarking
Test with domain-specific prompts
Reproducible benchmarks with versioned datasets
Support for custom evaluation scenarios
Integration with existing data pipelines
GenAI-Bench Schema Mapping:
# Required fields for genai-bench
{
"prompt": str, # Input text
"output_len": int, # Expected output length
"input_len": int, # Input length (auto-calculated if not provided)
"temperature": float, # Optional: sampling temperature
"top_p": float, # Optional: nucleus sampling
"max_tokens": int # Optional: max output tokens
}
Implementation Components:
Dataset Fetcher (
src/utils/dataset_fetcher.py)URL download with retries
Authentication headers support
Local file caching
Format Converters (
src/utils/dataset_converters/)csv_converter.py: CSV → genai-bench JSONjsonl_converter.py: JSONL → genai-bench JSONBase converter interface for extensibility
Validation Module (
src/utils/dataset_validator.py)Schema validation
Token limit checking
Duplicate detection
CLI Tool (
scripts/prepare_custom_dataset.py)Standalone conversion utility
Preview mode (show first N records)
Statistics reporting
🔵 Phase 8: Enterprise Features (Planned)¶
Priority: Low-Medium Effort: 3-5 weeks Value: ⭐⭐⭐
8.1 Multi-User Support¶
User authentication (OAuth2)
Role-based access control (RBAC)
Task ownership and sharing
Team workspaces
8.2 Advanced Monitoring¶
Prometheus metrics exporter
Grafana dashboard templates
Alert rules for failures
Performance analytics
8.3 CI/CD Integration¶
GitHub Actions workflow
Automated benchmarking on PR
Performance regression detection
Automated deployment
8.4 Cloud Deployment¶
AWS deployment guide (EKS)
GCP deployment guide (GKE)
Azure deployment guide (AKS)
Terraform modules
Helm charts
🟢 Phase 9: Research & Innovation (Future)¶
Priority: Low Effort: Variable Value: ⭐⭐⭐
9.1 Auto-Scaling Integration¶
Horizontal Pod Autoscaler (HPA) optimization
Vertical Pod Autoscaler (VPA) tuning
Knative Serving integration
Cost-aware scaling
9.2 Advanced Benchmarking¶
Custom benchmark scenario editor
Real-world traffic replay
Synthetic load generation
Multi-modal benchmarking
9.3 Model-Specific Optimization¶
Architecture-aware parameter tuning
Quantization-aware optimization
Attention mechanism tuning
Memory layout optimization
Maintenance & Technical Debt¶
Recently Fixed (2025/11/25) ✅¶
Database Schema Mismatch:
❌ Missing columns:
clusterbasemodel_config,clusterservingruntime_config,created_clusterbasemodel,created_clusterservingruntime✅ Fixed: Added ALTER TABLE statements
✅ Verified: All endpoints working, HTTP 500 errors resolved
Known Issues¶
Worker Restart Required
⚠️ ARQ worker doesn’t hot-reload code changes
Manual restart needed after editing
orchestrator.py,controllers/Future: Add file watcher for auto-restart
Polling-Based UI Updates
⚠️ Frontend polls every 2-5 seconds
Inefficient for idle states
Future: WebSocket migration (Phase 4)
Technical Improvements¶
Testing Coverage
Current: Manual testing only
Future: Unit tests, integration tests, E2E tests
Target: 80% code coverage
Error Handling
Current: Basic try-catch blocks
Future: Comprehensive error taxonomy, retry logic, graceful degradation
Database Migration
Current: Manual SQL commands
Future: Alembic migrations
Version-controlled schema changes
Success Metrics¶
Current Performance (Milestone 5)¶
Metric |
Value |
Target |
|---|---|---|
Total Tasks |
28 |
- |
Total Experiments |
408 |
- |
Success Rate |
76.5% |
>80% |
Avg Experiment Duration |
303.6s |
<300s |
Bayesian Efficiency |
80-87% reduction |
>70% |
UI Response Time |
<200ms |
<100ms |
API Latency (P95) |
<500ms |
<200ms |
Supported Runtimes |
3 (vLLM, SGLang, TRT-LLM) |
- |
Deployment Modes |
3 (Docker, OME, Local) |
- |
Agent Tools |
12 (across 6 categories) |
- |
Future Targets (v2.0)¶
Experiment Success Rate: >90%
Avg Experiment Duration: <240s (20% improvement)
UI Response Time: <100ms (WebSocket)
Concurrent Experiments: >10 parallel
Cost Reduction: 50% fewer experiments vs grid search
Multi-Runtime Support: Add Triton, others
End of Roadmap | Last Updated: 2025/12/22 | Version: 0.2.0 (Milestone 5 Complete)