LLM Autotuner - Product Roadmap¶

Last Updated: 2025/12/22 Project Status: Production-Ready with Active Development Current Version: v0.2.0 (Milestone 5 Complete)

Executive Summary¶

The LLM Autotuner is a comprehensive system for automatically optimizing Large Language Model inference parameters. The project has successfully completed five major milestones including tri-mode deployment (Kubernetes/OME, Docker, and Local), full-stack web application, runtime-agnostic configuration, and an AI-powered Agent System.

Key Achievements:

✅ 28 tasks executed, 408 experiments run, 312 successful results
✅ Bayesian optimization achieving 80-87% reduction in experiments vs grid search
✅ Full-stack web application with React frontend and FastAPI backend
✅ Runtime-agnostic quantization and parallelism configuration
✅ GPU-aware optimization with per-GPU efficiency metrics
✅ YAML import/export for configuration management
✅ Real-time WebSocket updates and auto-update notifications
✅ Per-batch SLO filtering with graceful OOM handling
✅ Documentation refinement (66→15 files, 77% reduction)
✅ SLO-aware scoring with exponential penalty functions
✅ Agent System with LLM-powered task management, tool execution, and conversational interface
✅ Local Deployment Mode for subprocess-based model execution without containers
✅ Remote Dataset Support via URL fetching with caching

Milestone Overview¶

	M1: Core Foundation	M2: Web Interface	M3: Runtime-Agnostic	M4: UI/UX Polish	M5: Agent System	M6+: Future
Date	2025/10/24	2025/10/30	2025/11/14	2025/12/01	2025/12/22	Planned
Status	✅ Done	✅ Done	✅ Done	✅ Done	✅ Done	🔵 Planned
Features	✅ Grid/Random Search ✅ Docker Mode ✅ OME/K8s Mode ✅ Benchmark Parsing ✅ Scoring Algorithms ✅ CLI Interface	✅ REST API ✅ React Frontend ✅ Task Queue (ARQ) ✅ Log Streaming ✅ Container Monitor ✅ Preset System	✅ Bayesian Optimization ✅ Quantization Config ✅ Parallel Config ✅ GPU-Aware Scoring ✅ SLO-Aware Scoring ✅ Per-GPU Metrics	✅ WebSocket Updates ✅ YAML Import/Export ✅ Auto-Update Notif ✅ Multi-Exp Comparison ✅ Sphinx Docs Site ✅ SLO Filtering	✅ Agent Chat UI ✅ Tool Framework ✅ Local Deploy Mode ✅ Dataset URL Support ✅ Streaming Markdown ✅ GitHub/HF Tools	🔵 Distributed Workers 🔵 Multi-User Auth 🔵 Cloud Deployment 🔵 CI/CD Integration 🔵 Advanced Analytics

Milestone Timeline¶

2025/10/24 ────► Milestone 1: Core Autotuner Foundation
2025/10/30 ────► Milestone 2: Complete Web Interface & Parameter Preset System
2025/11/14 ────► Milestone 3: Runtime-Agnostic Configuration & GPU-Aware Optimization
2025/12/01 ────► Milestone 4: UI/UX Polish, SLO Filtering & Documentation
2025/12/22 ────► Milestone 5: Agent System & Local Deployment Mode

🎉 Milestone 2: Complete Web Interface & Parameter Preset System¶

Date: 2025/10/30 (tag: milestone-2) Status: ✅ COMPLETED Objective: Build full-stack web application for task management, visualization, and introduce parameter preset system

Bug Fixes & Improvements¶

Critical Fixes:

Fixed best experiment selection bug
Fixed model name field linkings
Fixed health check 503 errors
Fixed data display in task view
Refined task restart logic
Enhanced container log viewing

Code Organization:

Reorganized web backend code structure
Separated orchestrator from web modules
Formatted code with Prettier
Improved error handling and validation

Technical Stack¶

Component	Technology
Frontend	React 18, TypeScript, Vite 5
State Management	TanStack Query 5
Styling	Tailwind CSS 3
Charts	Recharts 2
Backend	FastAPI, Python 3.10+
Database	SQLite 3 with SQLAlchemy 2
Task Queue	ARQ 0.26 + Redis 7
API Docs	Swagger UI (OpenAPI)

Statistics¶

Commits since Milestone 1: 40+
Frontend Components: 20+ React components
API Endpoints: 15+ routes
Database Tables: 2 (tasks, experiments)
Lines of Code: ~12,000 total (5,000 backend + 7,000 frontend)

🎉 Milestone 3: Runtime-Agnostic Configuration Architecture & GPU-Aware Optimization¶

Date: 2025/11/14 (tag: milestone-3) Status: ✅ COMPLETED Timeline: 2025/11/10 → 2025/11/14 Objective: Unified configuration abstraction for quantization and parallelism across multiple runtimes, plus GPU-aware optimization

Overview¶

Milestone 3 achieved two major architectural breakthroughs:

Runtime-Agnostic Configuration System - Unified abstraction for quantization and parallel execution across vLLM, SGLang, and TensorRT-LLM
GPU-Aware Optimization - Per-GPU efficiency metrics enabling fair comparison across different parallelism strategies

These foundational changes enable portable, efficiency-aware autotuning where users specify high-level intent and the system automatically maps to runtime-specific implementations while optimizing for per-GPU efficiency.

Part 1: Runtime-Agnostic Configuration System¶

1.1 Quantization Configuration Abstraction ✅¶

Problem Solved: Different inference runtimes use incompatible CLI syntax for quantization. Users had to learn runtime-specific arguments and rewrite configurations when switching engines.

Solution: Three-Layer Abstraction Architecture

Four-Field Normalized Schema:

{
  "gemm_dtype": "fp8",           # Weight/activation quantization
  "kvcache_dtype": "fp8_e5m2",   # KV cache compression
  "attention_dtype": "auto",      # Attention compute precision
  "moe_dtype": "auto"             # MoE expert quantization
}

Modules Created:

quantization_mapper.py (450 lines)
- Runtime-specific CLI argument mapping
- 5 production presets: default, kv-cache-fp8, dynamic-fp8, bf16-stable, aggressive-moe
- Validation with dtype compatibility checking
- Automatic detection of offline quantization (AWQ, GPTQ, GGUF)
quantization_integration.py (350 lines)
- Orchestrator integration layer
- Experiment parameter preparation
- Conflict resolution between user params and quant config

Runtime Mapping Example:

User Config                     vLLM Args                    SGLang Args
────────────────────────────────────────────────────────────────────────────
gemm_dtype: "fp8"        →      --quantization fp8           --quantization fp8
kvcache_dtype: "fp8_e5m2" →     --kv-cache-dtype fp8_e5m2   --kv-cache-dtype fp8_e5m2
attention_dtype: "fp8"    →     (inferred from gemm)         --attention-backend fp8

Grid Expansion with __quant__ Prefix:

{
  "quant_config": {
    "gemm_dtype": ["auto", "fp8"],
    "kvcache_dtype": ["auto", "fp8_e5m2"]
  }
}

Expands to 4 experiments (2×2):

__quant__gemm_dtype=auto, __quant__kvcache_dtype=auto
__quant__gemm_dtype=auto, __quant__kvcache_dtype=fp8_e5m2
__quant__gemm_dtype=fp8, __quant__kvcache_dtype=auto
__quant__gemm_dtype=fp8, __quant__kvcache_dtype=fp8_e5m2

Frontend Integration:

QuantizationConfigForm.tsx (612 lines)
Preset mode vs. Custom mode toggle
Real-time preview of generated parameters
Combination count calculation
Validation feedback

1.2 Parallel Configuration Abstraction ✅¶

Normalized Parameter Schema:

{
  "tp": 4,              # Tensor parallelism
  "pp": 1,              # Pipeline parallelism
  "dp": 2,              # Data parallelism
  "dcp": 1,             # Decode context parallelism (vLLM)
  "cp": 1,              # Context parallelism (TensorRT-LLM)
  "ep": 1,              # Expert parallelism (MoE)
  "moe_tp": 1,          # MoE tensor parallelism
  "moe_ep": 1           # MoE expert parallelism
}

Modules Created:

parallel_mapper.py (520 lines)
- 18 runtime-specific presets (6 per engine)
- Constraint validation (e.g., SGLang: tp % dp == 0, TensorRT-LLM: no DP support)
- world_size calculation: world_size = tp × pp × dp
parallel_integration.py (280 lines)
- Parameter grid expansion
- Orchestrator integration
- GPU allocation coordination

Presets Per Engine:

vLLM (6 presets):
  - single-gpu, high-throughput, large-model-tp, large-model-tp-pp
  - moe-optimized, long-context (with dcp), balanced

SGLang (6 presets):
  - single-gpu, high-throughput, large-model-tp, large-model-tp-pp
  - moe-optimized (with moe_dense_tp), balanced
  - Constraint: tp % dp == 0

TensorRT-LLM (6 presets):
  - single-gpu, large-model-tp, large-model-tp-pp
  - moe-optimized (with moe_tp, moe_ep), long-context (with cp)
  - Constraint: No data parallelism support (dp must be 1)

Runtime Mapping Example:

User Config                 vLLM Args                           SGLang Args
─────────────────────────────────────────────────────────────────────────────────
tp: 4                 →     --tensor-parallel-size 4            --tp-size 4
pp: 1                 →     --pipeline-parallel-size 1          (not supported)
dp: 2                 →     --distributed-executor-backend ray  --dp-size 2
                            --num-gpu-blocks-override

Grid Expansion with __parallel__ Prefix:

{
  "parallel_config": {
    "tp": [2, 4],
    "pp": 1,
    "dp": [1, 2]
  }
}

Expands to 4 experiments (2×2).

Frontend Integration:

ParallelConfigForm.tsx (similar to QuantizationConfigForm)
Preset mode with 18 runtime-specific presets
Custom mode with constraint validation
GPU requirement calculation
Real-time parameter preview

Part 2: GPU-Aware Optimization¶

2.1 Per-GPU Efficiency Metrics ✅¶

Problem Solved: Traditional throughput metrics favor higher parallelism blindly. A configuration using 8 GPUs with 100 tokens/s looks better than 2 GPUs with 60 tokens/s, but the latter is 2.4× more efficient per GPU.

Solution: Per-GPU Throughput Calculation

Formula:

per_gpu_throughput = total_throughput / gpu_count

Example Comparison:

Config A: TP=2, throughput=661.36 tokens/s → 330.68 tokens/s/GPU
Config B: TP=4, throughput=628.22 tokens/s → 157.06 tokens/s/GPU
Winner: Config A (2.1× more efficient)

Implementation:

GPU info recorded in database: gpu_info JSON field
Contains: model, count, device_ids, world_size
Automatic calculation during scoring
Frontend displays both total and per-GPU metrics

2.2 GPU Information Tracking ✅¶

Database Schema:

gpu_info = {
  "model": "NVIDIA A100",
  "count": 2,
  "device_ids": [0, 1],
  "world_size": 2
}

Recording Logic:

Captured during experiment setup
Stored in experiments.gpu_info column (JSON)
Used for per-GPU metric calculation
Displayed in results table

2.3 Enhanced Result Visualization ✅¶

Frontend Enhancements:

Added “GPUs” column to experiment table
Display: 2 (A100) or 4 (H100)
Tooltip shows device IDs and world size
Per-GPU throughput column
Color coding for efficiency comparison

Charts:

Per-GPU efficiency scatter plot
GPU count vs throughput line chart
Pareto frontier with GPU cost consideration

Technical Achievements¶

Code Additions:

Quantization System: 800 lines (mapper + integration)
Parallel System: 800 lines (mapper + integration)
GPU Tracking: 200 lines (backend + frontend)
Frontend Forms: 1,200 lines (Quant + Parallel components)
Documentation: 3 new docs (QUANTIZATION, PARALLEL, GPU_TRACKING)

Total: ~3,000 lines of new production code

Functionality:

✅ Support for 3 inference runtimes (vLLM, SGLang, TensorRT-LLM)
✅ 5 quantization presets + custom mode
✅ 18 parallelism presets (6 per runtime)
✅ Automatic runtime-specific CLI mapping
✅ Constraint validation and conflict resolution
✅ Per-GPU efficiency metrics
✅ GPU information persistence

Documentation:

docs/QUANTIZATION_CONFIGURATION.md
docs/PARALLEL_CONFIGURATION.md
docs/GPU_TRACKING.md

🎉 Milestone 4: UI/UX Polish, SLO Filtering & Documentation¶

Date: 2025-12-01 (tag: milestone-4) Status: ✅ COMPLETED Timeline: 2025-11-15 → 2025-12-01 Objective: Transform from functional prototype to production-ready platform with professional UI, SLO filtering, and comprehensive documentation

Technical Achievements¶

Code Statistics:

Frontend: ~800 lines (YAML I/O, auto-update, SLO visualization)
Backend: ~400 lines (SLO filtering, OOM handling, fixes)
Total New Code: ~1,200 lines
Documentation: Sphinx site with 15+ pages, GitHub Actions CI/CD

Components Created:

TaskYAMLImport.tsx (180 lines) - Drag-and-drop import with validation
TaskYAMLExport.tsx (80 lines) - Single-click YAML export
UpdateNotification.tsx (110 lines) - Auto-update banner with GitHub integration
versionService.ts (60 lines) - Version checking service
check_batch_slo_compliance() (133 lines) - Per-batch SLO validation
docs/conf.py - Sphinx configuration with Furo theme
.github/workflows/docs.yml - GitHub Pages deployment workflow

Files Modified:

Frontend: Tasks.tsx, TaskResults.tsx, NewTask.tsx, Logo.tsx (10+ files)
Backend: optimizer.py, direct_benchmark_controller.py, config.py (5 files)
Documentation: README.md, CLAUDE.md, ROADMAP.md (reference fixes)

Performance Impact¶

Metric	Before M4	After M4	Improvement
UI Response Time	2-5s polling	<100ms WebSocket	20-50x faster
Config Reusability	Manual JSON edit	YAML import/export	Instant
Update Awareness	Manual check	Auto-notification	Automatic
SLO Visibility	Numbers only	Visual ref lines	Intuitive
OOM Resilience	Experiment fails	Partial success	Graceful
Doc Files	66 files	15 files	77% reduction

Impact Summary¶

For Users:

✅ Faster feedback: WebSocket real-time updates
✅ Better visualization: SLO reference lines, enhanced charts
✅ Config management: YAML import/export workflow
✅ Stay updated: Automatic version checking
✅ Fewer failures: Graceful OOM handling
✅ Cleaner UI: Protected actions, clickable names
✅ Professional branding: Custom logo and favicon

For Operators:

✅ Easier troubleshooting: Per-batch SLO logging
✅ Better resource utilization: Partial success support
✅ Clearer documentation: 15 essential files vs 66
✅ No broken links: All references verified

For Developers:

✅ Maintainable docs: Focused, merged content
✅ Working examples: Templates verified
✅ Clear architecture: Essential docs only
✅ Build tracking: Timestamp in version display

🎉 Milestone 5: Agent System & Local Deployment Mode¶

Date: 2025-12-22 (tag: milestone-5) Status: ✅ COMPLETED Timeline: 2025-12-01 → 2025-12-22 Objective: Introduce LLM-powered Agent System for conversational task management and add Local Deployment Mode for faster development iteration

Key Accomplishments¶

5.1 Agent Chat Interface ✅¶

Full-featured chat UI with streaming markdown responses (/agent)
Session management with persistent conversation history
Editable session titles with auto-generation from first message
IndexedDB-based message storage with backend sync
Server-Sent Events (SSE) for real-time streaming responses

Architecture:

┌─────────────────────┐     SSE Stream      ┌──────────────────┐
│  AgentChat.tsx      │◄───────────────────►│  /api/agent/chat │
│  (React Frontend)   │                     │  (FastAPI)       │
└─────────────────────┘                     └────────┬─────────┘
         │                                           │
         │ Markdown                                  │ OpenAI API
         ▼                                           ▼
┌─────────────────────┐                     ┌──────────────────┐
│ StreamingMarkdown   │                     │  LLM Backend     │
│ (react-markdown)    │                     │  (Configurable)  │
└─────────────────────┘                     └──────────────────┘

5.2 Tool Execution Framework ✅¶

Authorization system with AuthorizationScope enum (none, privileged, dangerous)
Privileged tools require user approval before execution
Auto-execute pending tools after authorization granted
Clear visual indicators for tool status in chat UI

Tool Categories Implemented:

Category	Tools	Authorization
Task Management	`create_task`, `start_task`, `get_task_status`, `get_task_logs`	None
Worker Control	`restart_arq_worker`	Privileged
System Utilities	`sleep`, `get_current_time`	None
GitHub Integration	`search_github_issues`, `create_github_issue`, `comment_github_issue`	None
HuggingFace CLI	`hf_cache_scan`, `hf_download`, `hf_repo_info`	None
Experiment Analysis	`get_experiment_logs`	None

5.3 Agent Backend Architecture ✅¶

LangChain framework for flexible model support
Support for Claude (Anthropic) and open-source models
Max iterations increased from 10 to 100 for complex tasks
Automatic tool result handling for multi-step operations

API Endpoints:

POST /api/agent/chat - SSE streaming chat endpoint
GET /api/agent/sessions - List all sessions
POST /api/agent/sessions - Create new session
PUT /api/agent/sessions/{id} - Update session title
DELETE /api/agent/sessions/{id} - Delete session
GET /api/agent/sessions/{id}/messages - Get session messages

5.4 Streaming Markdown Component ✅¶

Paragraph-aware streaming that preserves atomic elements
GitHub Flavored Markdown support (tables, code blocks, task lists)
Copy buttons for code blocks and tables (copies source, not rendered HTML)
Tailwind typography styling for consistent appearance

5.5 Local Deployment Mode ✅¶

New LocalController for subprocess-based model execution
Direct vLLM/SGLang server launch via python -m commands
Automatic port allocation (30000-30100 range)
Process lifecycle management with graceful shutdown
Log capture and streaming to task log files

Runtime Support:

vLLM local environment: .venv-vllm/ with CUDA 12
SGLang local environment: .venv-sglang/ (SM86 limitation)
Automatic environment detection and activation

5.6 Dataset URL Support ✅¶

Remote URL dataset loading (CSV, JSONL, compressed archives)
Automatic format detection and conversion
URL-hash based caching in ~/.local/share/autotuner/datasets/
Deduplication option for prompt datasets
genai-bench submodule updated to fork with URL support

5.7 Additional Improvements ✅¶

HuggingFace offline mode: Fixed cache path handling for air-gapped environments
GitHub Issue #3 fix: Resolved analyze_slo_violations() AttributeError
Foldable experiments list: Collapsible UI for large experiment sets
Comprehensive parameter presets: Runtime-specific vLLM and SGLang presets
Project rebranding: Logo redesign inspired by Novita.ai style

Technical Achievements¶

Code Statistics:

Agent Frontend: ~800 lines (AgentChat.tsx, StreamingMarkdown.tsx, AgentMessage.tsx)
Agent Backend: ~600 lines (routes/agent.py, schemas/agent.py, services/agent_service.py)
Tools: ~500 lines (task_tools.py, worker_tools.py, github_tools.py, hf_tools.py)
Local Mode: ~400 lines (local_controller.py, autotuner_worker.py updates)
Dataset: ~200 lines (dataset_controller.py)
Total New Code: ~2,500 lines

Key Components Created:

AgentChat.tsx (~400 lines) - Main chat interface with message history
StreamingMarkdown.tsx (~350 lines) - Paragraph-aware markdown renderer
LocalController (~300 lines) - Subprocess-based deployment controller
DatasetController (~200 lines) - Remote dataset fetching and caching
12 agent tools across 6 categories

Performance Impact¶

Metric	Before M5	After M5	Improvement
Task Creation	Form-only	Conversational + Form	Natural language
Deployment Setup	Docker/K8s required	Local subprocess option	Faster iteration
Dataset Loading	Local files only	Remote URL support	More flexible
Agent Iterations	N/A	Up to 100 steps	Complex workflows

Impact Summary¶

For Users:

✅ Natural language task creation and management via Agent
✅ Faster local development without Docker/Kubernetes overhead
✅ Remote dataset support for production workload testing
✅ GitHub integration for issue tracking and collaboration
✅ HuggingFace integration for model management

For Developers:

✅ Local deployment mode for rapid iteration
✅ Extensible tool framework for custom integrations
✅ Comprehensive logging and debugging support
✅ Session-based conversation history

Current Status: Production-Ready v0.2.0 ✅¶

What Works Today¶

Core Functionality:

✅ Grid search, random search, Bayesian optimization (Optuna TPE)
✅ Docker mode deployment (recommended)
✅ Kubernetes/OME mode deployment
✅ Local mode deployment (subprocess-based, no containers)
✅ Runtime-agnostic quantization configuration (vLLM, SGLang, TensorRT-LLM)
✅ Runtime-agnostic parallelism configuration (18 presets)
✅ SLO-aware scoring with exponential penalties
✅ GPU intelligent scheduling with per-GPU efficiency metrics
✅ Checkpoint mechanism for fault tolerance
✅ Multi-objective Pareto optimization
✅ Model caching optimization
✅ Full-stack web UI with real-time monitoring
✅ Agent System with LLM-powered conversational interface
✅ Remote dataset support via URL fetching

Performance:

✅ 28 tasks executed successfully
✅ 408 total experiments run
✅ 312 successful experiments (76.5% success rate)
✅ Average experiment duration: 303.6 seconds
✅ Bayesian optimization: 80-87% reduction vs grid search

Infrastructure:

✅ FastAPI backend with async support
✅ React 18 frontend with TypeScript
✅ WebSocket real-time communication (backend + frontend)
✅ SQLite database with WAL mode (XDG-compliant location)
✅ Redis task queue with ARQ worker
✅ Docker container management
✅ Kubernetes resource management
✅ LangChain-based agent framework with 12 tools

Future Roadmap¶

🔵 Phase 7: Advanced Optimization & Runtime Features (Planned)¶

Priority: Medium Effort: 2-4 weeks Value: ⭐⭐⭐⭐

7.0 Agent Charting Tool¶

Add chart generation tool for Agent to visualize experiment results
Candidates: Matplotlib (static images), Plotly (interactive HTML)
Chart types: bar charts, line plots, scatter plots, heatmaps
Use cases:
- Compare throughput/latency across experiments
- Visualize parameter sensitivity
- Generate Pareto frontier plots
- Create SLO compliance charts
Output: Save charts to files or display inline in chat

Implementation Options:

Library	Pros	Cons
Matplotlib	Simple, widely used, static images	Not interactive
Plotly	Interactive, HTML export, beautiful	Larger dependency
Seaborn	Statistical plots, built on Matplotlib	Limited interactivity

7.1 Runtime-Specific Optimizations¶

SGLang Radix Cache Management:

Reset radix cache at experiment start: Clear cache before each experiment
Benchmark purity: Ensure fair comparison without cache pollution
Cache warming option: Optional pre-fill for production scenarios
Cache statistics tracking: Monitor hit rate and memory usage

Implementation:

# Before each experiment
def reset_sglang_radix_cache(container_id: str):
    """Reset SGLang radix cache via HTTP API"""
    response = requests.post(
        f"http://localhost:{port}/reset_cache",
        json={"cache_type": "radix"}
    )
    logger.info(f"Radix cache reset: {response.json()}")

Benefits:

Fair experiment comparisons (no cached KV states)
Reproducible benchmark results
Accurate TTFT measurements
Option to test both cold-start and warm-cache scenarios

Additional Runtime Features:

vLLM prefix caching control
TensorRT-LLM engine rebuild triggers
Runtime-specific profiling hooks
Memory defragmentation between experiments

7.2 Multi-Fidelity Optimization¶

Progressive benchmark complexity
Early stopping for poor configurations
Hyperband algorithm integration
Adaptive resource allocation
Quick validation runs (low concurrency, short duration)
Full benchmark only for promising configs

7.3 Transfer Learning¶

Model similarity detection (architecture, size, quantization)
Cross-model parameter transfer
Historical performance database (SQLite → PostgreSQL)
Meta-learning for initialization
Warmstart Bayesian optimization with historical data

7.4 Enhanced Multi-Objective Optimization¶

NSGA-II algorithm for Pareto frontier
3+ objective support (latency, throughput, cost, energy, memory)
Interactive trade-off exploration
User preference learning
Weighted objective combination
Pareto frontier approximation with surrogate models

7.5 Enhanced Export & Data Portability¶

Export experiment results to CSV
Export results to JSON for analysis
Export results to Excel (.xlsx) format
Batch import multiple task configs
Template library (export/import task templates)
Share configurations via file or URL
YAML parser with schema validation
Automatic conversion between JSON ↔ YAML
YAML syntax highlighting in frontend

Benefits:

Data portability for external analysis tools (Excel, Python, R)
Batch operations for managing multiple tasks
Configuration templates for common use cases
Team collaboration via shared configs
Integration with data science workflows

Export Formats:

Experiment Results: .csv, .json, .xlsx
Task Configs: .yaml, .json
Templates: Zip archive with metadata

7.6 Custom Dataset Support for GenAI-Bench¶

Fetch datasets from user-specified URLs
Support CSV format parsing
Support JSONL (JSON Lines) format parsing
Conversion script to genai-bench format
Dataset validation and preprocessing
Automatic schema detection
Support for custom prompt templates
Integration with task configuration

Supported Input Formats:

# CSV format
prompt,max_tokens,temperature
"Explain quantum computing",100,0.7
"Write a story about AI",200,0.9

# JSONL format
{"prompt": "Explain quantum computing", "max_tokens": 100, "temperature": 0.7}
{"prompt": "Write a story about AI", "max_tokens": 200, "temperature": 0.9}

Conversion Pipeline:

# Download and convert custom dataset
python scripts/prepare_custom_dataset.py \
  --url https://example.com/dataset.csv \
  --format csv \
  --output ./data/custom_benchmark.json

# Use in task configuration
{
  "benchmark": {
    "custom_dataset": "./data/custom_benchmark.json",
    "task": "text-to-text"
  }
}

Features:

HTTP/HTTPS URL fetching with authentication support
Automatic format detection (CSV/JSONL)
Field mapping configuration (map CSV columns to genai-bench schema)
Data validation (check required fields, token limits)
Sampling strategies (random, stratified, sequential)
Dataset caching to avoid re-downloading

Benefits:

Use real production workloads for benchmarking
Test with domain-specific prompts
Reproducible benchmarks with versioned datasets
Support for custom evaluation scenarios
Integration with existing data pipelines

GenAI-Bench Schema Mapping:

# Required fields for genai-bench
{
  "prompt": str,           # Input text
  "output_len": int,       # Expected output length
  "input_len": int,        # Input length (auto-calculated if not provided)
  "temperature": float,    # Optional: sampling temperature
  "top_p": float,          # Optional: nucleus sampling
  "max_tokens": int        # Optional: max output tokens
}

Implementation Components:

Dataset Fetcher (src/utils/dataset_fetcher.py)
- URL download with retries
- Authentication headers support
- Local file caching
Format Converters (src/utils/dataset_converters/)
- csv_converter.py: CSV → genai-bench JSON
- jsonl_converter.py: JSONL → genai-bench JSON
- Base converter interface for extensibility
Validation Module (src/utils/dataset_validator.py)
- Schema validation
- Token limit checking
- Duplicate detection
CLI Tool (scripts/prepare_custom_dataset.py)
- Standalone conversion utility
- Preview mode (show first N records)
- Statistics reporting

Maintenance & Technical Debt¶

Recently Fixed (2025/11/25) ✅¶

Database Schema Mismatch:

❌ Missing columns: clusterbasemodel_config, clusterservingruntime_config, created_clusterbasemodel, created_clusterservingruntime
✅ Fixed: Added ALTER TABLE statements
✅ Verified: All endpoints working, HTTP 500 errors resolved

Known Issues¶

Worker Restart Required
- ⚠️ ARQ worker doesn’t hot-reload code changes
- Manual restart needed after editing orchestrator.py, controllers/
- Future: Add file watcher for auto-restart
Polling-Based UI Updates
- ⚠️ Frontend polls every 2-5 seconds
- Inefficient for idle states
- Future: WebSocket migration (Phase 4)

Technical Improvements¶

Testing Coverage
- Current: Manual testing only
- Future: Unit tests, integration tests, E2E tests
- Target: 80% code coverage
Error Handling
- Current: Basic try-catch blocks
- Future: Comprehensive error taxonomy, retry logic, graceful degradation
Database Migration
- Current: Manual SQL commands
- Future: Alembic migrations
- Version-controlled schema changes

Success Metrics¶

Current Performance (Milestone 5)¶

Metric	Value	Target
Total Tasks	28	-
Total Experiments	408	-
Success Rate	76.5%	>80%
Avg Experiment Duration	303.6s	<300s
Bayesian Efficiency	80-87% reduction	>70%
UI Response Time	<200ms	<100ms
API Latency (P95)	<500ms	<200ms
Supported Runtimes	3 (vLLM, SGLang, TRT-LLM)	-
Deployment Modes	3 (Docker, OME, Local)	-
Agent Tools	12 (across 6 categories)	-

Future Targets (v2.0)¶

Experiment Success Rate: >90%
Avg Experiment Duration: <240s (20% improvement)
UI Response Time: <100ms (WebSocket)
Concurrent Experiments: >10 parallel
Cost Reduction: 50% fewer experiments vs grid search
Multi-Runtime Support: Add Triton, others

End of Roadmap | Last Updated: 2025/12/22 | Version: 0.2.0 (Milestone 5 Complete)

LLM Autotuner - Product Roadmap¶

Executive Summary¶

Milestone Overview¶

Milestone Timeline¶

🎉 Milestone 1: Core Autotuner Foundation¶

Key Accomplishments¶

1.1 Architecture & Implementation ✅¶

1.2 Benchmark Results Parsing & Scoring ✅¶

1.3 Code Quality & Standards ✅¶

1.4 CLI Usability Improvements ✅¶

1.5 Documentation Structure ✅¶

1.6 Web Integration Readiness ✅¶

Technical Achievements¶

🎉 Milestone 2: Complete Web Interface & Parameter Preset System¶

Key Accomplishments¶

2.1 Backend API Infrastructure ✅¶

2.2 React Frontend Application ✅¶

2.3 ARQ Worker Integration ✅¶

2.4 Task Management Features ✅¶

2.5 Parameter Preset System (Backend) ✅¶

Bug Fixes & Improvements¶

Technical Stack¶

Statistics¶

🎉 Milestone 3: Runtime-Agnostic Configuration Architecture & GPU-Aware Optimization¶

Overview¶

Part 1: Runtime-Agnostic Configuration System¶

1.1 Quantization Configuration Abstraction ✅¶

1.2 Parallel Configuration Abstraction ✅¶

Part 2: GPU-Aware Optimization¶

2.1 Per-GPU Efficiency Metrics ✅¶

2.2 GPU Information Tracking ✅¶

2.3 Enhanced Result Visualization ✅¶

Technical Achievements¶

🎉 Milestone 4: UI/UX Polish, SLO Filtering & Documentation¶

Key Accomplishments¶

4.1 Frontend UI/UX Enhancements ✅¶

4.2 SLO-Aware Benchmarking ✅¶

4.3 Documentation Refinement ✅¶

4.4 Bug Fixes & Infrastructure ✅¶

4.5 Documentation Website ✅¶

Technical Achievements¶

Performance Impact¶

Impact Summary¶

🎉 Milestone 5: Agent System & Local Deployment Mode¶

Key Accomplishments¶

5.1 Agent Chat Interface ✅¶

5.2 Tool Execution Framework ✅¶

5.3 Agent Backend Architecture ✅¶

5.4 Streaming Markdown Component ✅¶

5.5 Local Deployment Mode ✅¶

5.6 Dataset URL Support ✅¶

5.7 Additional Improvements ✅¶

Technical Achievements¶

Performance Impact¶

Impact Summary¶

Current Status: Production-Ready v0.2.0 ✅¶

What Works Today¶

Future Roadmap¶

🔵 Phase 6: Distributed Architecture & Parallel Execution (Planned)¶

6.1 Distributed Worker Architecture¶

6.2 Advanced Parallel Execution¶

6.3 Task Sharding & Load Balancing¶

🔵 Phase 7: Advanced Optimization & Runtime Features (Planned)¶

7.0 Agent Charting Tool¶

7.1 Runtime-Specific Optimizations¶

7.2 Multi-Fidelity Optimization¶

7.3 Transfer Learning¶

7.4 Enhanced Multi-Objective Optimization¶

7.5 Enhanced Export & Data Portability¶

7.6 Custom Dataset Support for GenAI-Bench¶

🔵 Phase 8: Enterprise Features (Planned)¶

8.1 Multi-User Support¶

8.2 Advanced Monitoring¶

8.3 CI/CD Integration¶

8.4 Cloud Deployment¶

🟢 Phase 9: Research & Innovation (Future)¶

9.1 Auto-Scaling Integration¶

9.2 Advanced Benchmarking¶

9.3 Model-Specific Optimization¶

Maintenance & Technical Debt¶