LLM Autotuner - Product Roadmap

Last Updated: 2025/12/22 Project Status: Production-Ready with Active Development Current Version: v0.2.0 (Milestone 5 Complete)


Executive Summary

The LLM Autotuner is a comprehensive system for automatically optimizing Large Language Model inference parameters. The project has successfully completed five major milestones including tri-mode deployment (Kubernetes/OME, Docker, and Local), full-stack web application, runtime-agnostic configuration, and an AI-powered Agent System.

Key Achievements:

  • ✅ 28 tasks executed, 408 experiments run, 312 successful results

  • ✅ Bayesian optimization achieving 80-87% reduction in experiments vs grid search

  • ✅ Full-stack web application with React frontend and FastAPI backend

  • ✅ Runtime-agnostic quantization and parallelism configuration

  • ✅ GPU-aware optimization with per-GPU efficiency metrics

  • ✅ YAML import/export for configuration management

  • ✅ Real-time WebSocket updates and auto-update notifications

  • ✅ Per-batch SLO filtering with graceful OOM handling

  • ✅ Documentation refinement (66→15 files, 77% reduction)

  • ✅ SLO-aware scoring with exponential penalty functions

  • Agent System with LLM-powered task management, tool execution, and conversational interface

  • Local Deployment Mode for subprocess-based model execution without containers

  • Remote Dataset Support via URL fetching with caching


Milestone Overview

M1: Core Foundation

M2: Web Interface

M3: Runtime-Agnostic

M4: UI/UX Polish

M5: Agent System

M6+: Future

Date

2025/10/24

2025/10/30

2025/11/14

2025/12/01

2025/12/22

Planned

Status

✅ Done

✅ Done

✅ Done

✅ Done

✅ Done

🔵 Planned

Features

✅ Grid/Random Search
✅ Docker Mode
✅ OME/K8s Mode
✅ Benchmark Parsing
✅ Scoring Algorithms
✅ CLI Interface

✅ REST API
✅ React Frontend
✅ Task Queue (ARQ)
✅ Log Streaming
✅ Container Monitor
✅ Preset System

✅ Bayesian Optimization
✅ Quantization Config
✅ Parallel Config
✅ GPU-Aware Scoring
✅ SLO-Aware Scoring
✅ Per-GPU Metrics

✅ WebSocket Updates
✅ YAML Import/Export
✅ Auto-Update Notif
✅ Multi-Exp Comparison
✅ Sphinx Docs Site
✅ SLO Filtering

✅ Agent Chat UI
✅ Tool Framework
✅ Local Deploy Mode
✅ Dataset URL Support
✅ Streaming Markdown
✅ GitHub/HF Tools

🔵 Distributed Workers
🔵 Multi-User Auth
🔵 Cloud Deployment
🔵 CI/CD Integration
🔵 Advanced Analytics


Milestone Timeline

2025/10/24 ────► Milestone 1: Core Autotuner Foundation
2025/10/30 ────► Milestone 2: Complete Web Interface & Parameter Preset System
2025/11/14 ────► Milestone 3: Runtime-Agnostic Configuration & GPU-Aware Optimization
2025/12/01 ────► Milestone 4: UI/UX Polish, SLO Filtering & Documentation
2025/12/22 ────► Milestone 5: Agent System & Local Deployment Mode

🎉 Milestone 1: Core Autotuner Foundation

Date: 2025/10/24 (tag: milestone-1) Status: ✅ COMPLETED Objective: Establish solid foundation for LLM inference parameter autotuning with complete functionality, proper documentation, and code standards

Key Accomplishments

1.1 Architecture & Implementation ✅

  • Multi-tier architecture with clear separation of concerns

  • OME controller for Kubernetes InferenceService lifecycle

  • Docker controller for standalone deployment

  • Benchmark controller (OME BenchmarkJob + Direct CLI modes)

  • Parameter grid generator and optimizer utilities

  • Main orchestrator with JSON input

Technical Specs:

  • Controllers: ome_controller.py, docker_controller.py, benchmark_controller.py, direct_benchmark_controller.py

  • Utilities: optimizer.py (grid search, scoring algorithms)

  • Templates: Jinja2 for Kubernetes resources

1.2 Benchmark Results Parsing & Scoring ✅

  • Fixed critical bug in genai-bench result file parsing

  • Enhanced DirectBenchmarkController._parse_results()

  • Reads correct result files (D*.json pattern)

  • Handles multiple concurrency levels

  • Aggregates metrics across all runs

  • Extracts 15+ performance metrics

Completed calculate_objective_score() with 4 objectives:

  • minimize_latency - E2E latency optimization

  • maximize_throughput - Token throughput optimization

  • minimize_ttft - Time to First Token optimization

  • minimize_tpot - Time Per Output Token optimization

Comprehensive Metrics:

  • Latency: mean/min/max/p50/p90/p99 E2E latency

  • Throughput: output and total token throughput

  • Request statistics: success rate, error tracking

1.3 Code Quality & Standards ✅

  • Integrated black-with-tabs formatter

  • Formatted entire codebase (7 Python files, 1957+ lines)

  • Configuration: 120-char lines, tab indentation

  • PEP 8 compliance with 2 blank lines between top-level definitions

  • IDE integration guides (VS Code, PyCharm)

1.4 CLI Usability Improvements ✅

  • Made --direct flag automatic when using --mode docker

  • Simplified command-line interface

  • Updated help text and usage examples

  • Better default behaviors for common use cases

1.5 Documentation Structure ✅

  • Separated 420+ line Troubleshooting into docs/TROUBLESHOOTING.md

  • Created docs/DEVELOPMENT.md comprehensive guide

  • Established documentation conventions

  • Improved README readability

Documentation Files Created:

  • README.md - User guide with installation and usage

  • CLAUDE.md - Project overview and development guidelines

  • docs/TROUBLESHOOTING.md - 13 common issues and solutions

  • docs/DEVELOPMENT.md - Code formatting and contribution guide

  • docs/DOCKER_MODE.md - Docker deployment guide

  • docs/OME_INSTALLATION.md - Kubernetes/OME setup

1.6 Web Integration Readiness ✅

  • Comprehensive codebase analysis: Zero blockers found

  • Created detailed readiness assessment

  • Verified all controllers fully implemented (no placeholder functions)

  • Confirmed orchestrator is programmatically importable

  • Documented data structures (input/output formats)

  • Technology stack recommendations (FastAPI, React/Vue)

  • API endpoint specifications

  • Implementation roadmap with effort estimates

Technical Achievements

Code Quality:

  • 1,957 lines of production Python code

  • 100% method implementation (no placeholders in critical paths)

  • Comprehensive error handling and logging

  • Clean separation of concerns (controllers, orchestrator, utilities)

Functionality:

  • ✅ Full Docker mode support (standalone, no K8s required)

  • ✅ OME/Kubernetes mode support

  • ✅ Grid search parameter optimization

  • ✅ Multi-concurrency benchmark execution

  • ✅ Comprehensive result aggregation and scoring

  • ✅ Automatic resource cleanup

Test Results:

  • Successfully parsed real benchmark data

  • Concurrency levels: [1, 4]

  • Mean E2E Latency: 0.1892s

  • Mean Throughput: 2,304.82 tokens/s


🎉 Milestone 2: Complete Web Interface & Parameter Preset System

Date: 2025/10/30 (tag: milestone-2) Status: ✅ COMPLETED Objective: Build full-stack web application for task management, visualization, and introduce parameter preset system

Key Accomplishments

2.1 Backend API Infrastructure ✅

  • FastAPI application with async support

  • SQLAlchemy ORM with SQLite backend (moved to ~/.local/share/)

  • Database models (Task, Experiment)

  • REST API endpoints (10+ routes)

  • ARQ background task queue (Redis integration)

  • Pydantic schemas for validation

  • Streaming log API endpoints

  • Health check improvements

API Endpoints:

POST   /api/tasks/          - Create task
POST   /api/tasks/{id}/start - Start task execution
GET    /api/tasks/          - List tasks
GET    /api/tasks/{id}      - Get task details
GET    /api/tasks/{id}/logs - Stream logs (SSE)
GET    /api/experiments/task/{id} - Get experiments
GET    /api/docker/containers - List containers
GET    /api/system/health   - Health check

Database Migration:

  • Moved from local autotuner.db to XDG-compliant ~/.local/share/autotuner/

  • SQLite WAL mode for concurrent writes

  • Proper session management with async context

2.2 React Frontend Application ✅

  • React 18 with TypeScript

  • Vite build tooling with hot module replacement

  • React Router for navigation

  • TanStack Query (React Query) for API state

  • Tailwind CSS styling

  • Recharts for metrics visualization

  • React Hot Toast for notifications

Pages Implemented:

  • Dashboard - System overview and statistics

  • Tasks - Task list with create/list/monitor/restart

  • NewTask - Task creation wizard with form validation

  • Experiments - Results visualization with charts

  • Containers - Docker container monitoring (Docker mode)

Key Components:

  • TaskResults.tsx - Results visualization with Recharts

  • LogViewer.tsx - Real-time log streaming viewer

  • Layout.tsx - Main layout with navigation

  • Form components with validation

UI Features:

  • Task creation wizard with parameter presets

  • Real-time status monitoring (polling-based)

  • Experiment results table with sorting/filtering

  • Performance graphs (throughput, latency, TPOT, TTFT)

  • Container stats (CPU, memory, GPU)

  • Log streaming with follow mode

  • URL-based navigation with hash routing

  • Error notifications with toast messages

2.3 ARQ Worker Integration ✅

  • Background task processing with Redis queue

  • Worker configuration (max_jobs=5, timeout=2h)

  • Log redirection to task-specific files

  • Graceful shutdown handling

  • Worker management scripts

Log Management:

  • Task logs: ~/.local/share/autotuner/logs/task_<id>.log

  • Worker logs: logs/worker.log

  • Python logging library integration

  • StreamToLogger for real-time capture

2.4 Task Management Features ✅

  • Task creation UI with form builder

  • Task restart functionality

  • Task edit capability

  • Task status tracking

  • Real-time log viewing

  • Environment variable configuration for Docker

2.5 Parameter Preset System (Backend) ✅

  • Parameter preset API (CRUD operations)

  • Preset merge functionality

  • Import/export capabilities

  • System preset seeding

Note: Frontend integration for preset system completed in later sprints.

Bug Fixes & Improvements

Critical Fixes:

  • Fixed best experiment selection bug

  • Fixed model name field linkings

  • Fixed health check 503 errors

  • Fixed data display in task view

  • Refined task restart logic

  • Enhanced container log viewing

Code Organization:

  • Reorganized web backend code structure

  • Separated orchestrator from web modules

  • Formatted code with Prettier

  • Improved error handling and validation

Technical Stack

Component

Technology

Frontend

React 18, TypeScript, Vite 5

State Management

TanStack Query 5

Styling

Tailwind CSS 3

Charts

Recharts 2

Backend

FastAPI, Python 3.10+

Database

SQLite 3 with SQLAlchemy 2

Task Queue

ARQ 0.26 + Redis 7

API Docs

Swagger UI (OpenAPI)

Statistics

  • Commits since Milestone 1: 40+

  • Frontend Components: 20+ React components

  • API Endpoints: 15+ routes

  • Database Tables: 2 (tasks, experiments)

  • Lines of Code: ~12,000 total (5,000 backend + 7,000 frontend)


🎉 Milestone 3: Runtime-Agnostic Configuration Architecture & GPU-Aware Optimization

Date: 2025/11/14 (tag: milestone-3) Status: ✅ COMPLETED Timeline: 2025/11/10 → 2025/11/14 Objective: Unified configuration abstraction for quantization and parallelism across multiple runtimes, plus GPU-aware optimization

Overview

Milestone 3 achieved two major architectural breakthroughs:

  1. Runtime-Agnostic Configuration System - Unified abstraction for quantization and parallel execution across vLLM, SGLang, and TensorRT-LLM

  2. GPU-Aware Optimization - Per-GPU efficiency metrics enabling fair comparison across different parallelism strategies

These foundational changes enable portable, efficiency-aware autotuning where users specify high-level intent and the system automatically maps to runtime-specific implementations while optimizing for per-GPU efficiency.

Part 1: Runtime-Agnostic Configuration System

1.1 Quantization Configuration Abstraction ✅

Problem Solved: Different inference runtimes use incompatible CLI syntax for quantization. Users had to learn runtime-specific arguments and rewrite configurations when switching engines.

Solution: Three-Layer Abstraction Architecture

Four-Field Normalized Schema:

{
  "gemm_dtype": "fp8",           # Weight/activation quantization
  "kvcache_dtype": "fp8_e5m2",   # KV cache compression
  "attention_dtype": "auto",      # Attention compute precision
  "moe_dtype": "auto"             # MoE expert quantization
}

Modules Created:

  1. quantization_mapper.py (450 lines)

    • Runtime-specific CLI argument mapping

    • 5 production presets: default, kv-cache-fp8, dynamic-fp8, bf16-stable, aggressive-moe

    • Validation with dtype compatibility checking

    • Automatic detection of offline quantization (AWQ, GPTQ, GGUF)

  2. quantization_integration.py (350 lines)

    • Orchestrator integration layer

    • Experiment parameter preparation

    • Conflict resolution between user params and quant config

Runtime Mapping Example:

User Config                     vLLM Args                    SGLang Args
────────────────────────────────────────────────────────────────────────────
gemm_dtype: "fp8"        →      --quantization fp8           --quantization fp8
kvcache_dtype: "fp8_e5m2" →     --kv-cache-dtype fp8_e5m2   --kv-cache-dtype fp8_e5m2
attention_dtype: "fp8"    →     (inferred from gemm)         --attention-backend fp8

Grid Expansion with __quant__ Prefix:

{
  "quant_config": {
    "gemm_dtype": ["auto", "fp8"],
    "kvcache_dtype": ["auto", "fp8_e5m2"]
  }
}

Expands to 4 experiments (2×2):

  • __quant__gemm_dtype=auto, __quant__kvcache_dtype=auto

  • __quant__gemm_dtype=auto, __quant__kvcache_dtype=fp8_e5m2

  • __quant__gemm_dtype=fp8, __quant__kvcache_dtype=auto

  • __quant__gemm_dtype=fp8, __quant__kvcache_dtype=fp8_e5m2

Frontend Integration:

  • QuantizationConfigForm.tsx (612 lines)

  • Preset mode vs. Custom mode toggle

  • Real-time preview of generated parameters

  • Combination count calculation

  • Validation feedback

1.2 Parallel Configuration Abstraction ✅

Normalized Parameter Schema:

{
  "tp": 4,              # Tensor parallelism
  "pp": 1,              # Pipeline parallelism
  "dp": 2,              # Data parallelism
  "dcp": 1,             # Decode context parallelism (vLLM)
  "cp": 1,              # Context parallelism (TensorRT-LLM)
  "ep": 1,              # Expert parallelism (MoE)
  "moe_tp": 1,          # MoE tensor parallelism
  "moe_ep": 1           # MoE expert parallelism
}

Modules Created:

  1. parallel_mapper.py (520 lines)

    • 18 runtime-specific presets (6 per engine)

    • Constraint validation (e.g., SGLang: tp % dp == 0, TensorRT-LLM: no DP support)

    • world_size calculation: world_size = tp × pp × dp

  2. parallel_integration.py (280 lines)

    • Parameter grid expansion

    • Orchestrator integration

    • GPU allocation coordination

Presets Per Engine:

vLLM (6 presets):
  - single-gpu, high-throughput, large-model-tp, large-model-tp-pp
  - moe-optimized, long-context (with dcp), balanced

SGLang (6 presets):
  - single-gpu, high-throughput, large-model-tp, large-model-tp-pp
  - moe-optimized (with moe_dense_tp), balanced
  - Constraint: tp % dp == 0

TensorRT-LLM (6 presets):
  - single-gpu, large-model-tp, large-model-tp-pp
  - moe-optimized (with moe_tp, moe_ep), long-context (with cp)
  - Constraint: No data parallelism support (dp must be 1)

Runtime Mapping Example:

User Config                 vLLM Args                           SGLang Args
─────────────────────────────────────────────────────────────────────────────────
tp: 4                 →     --tensor-parallel-size 4            --tp-size 4
pp: 1                 →     --pipeline-parallel-size 1          (not supported)
dp: 2                 →     --distributed-executor-backend ray  --dp-size 2
                            --num-gpu-blocks-override

Grid Expansion with __parallel__ Prefix:

{
  "parallel_config": {
    "tp": [2, 4],
    "pp": 1,
    "dp": [1, 2]
  }
}

Expands to 4 experiments (2×2).

Frontend Integration:

  • ParallelConfigForm.tsx (similar to QuantizationConfigForm)

  • Preset mode with 18 runtime-specific presets

  • Custom mode with constraint validation

  • GPU requirement calculation

  • Real-time parameter preview

Part 2: GPU-Aware Optimization

2.1 Per-GPU Efficiency Metrics ✅

Problem Solved: Traditional throughput metrics favor higher parallelism blindly. A configuration using 8 GPUs with 100 tokens/s looks better than 2 GPUs with 60 tokens/s, but the latter is 2.4× more efficient per GPU.

Solution: Per-GPU Throughput Calculation

Formula:

per_gpu_throughput = total_throughput / gpu_count

Example Comparison:

Config A: TP=2, throughput=661.36 tokens/s → 330.68 tokens/s/GPU
Config B: TP=4, throughput=628.22 tokens/s → 157.06 tokens/s/GPU
Winner: Config A (2.1× more efficient)

Implementation:

  • GPU info recorded in database: gpu_info JSON field

  • Contains: model, count, device_ids, world_size

  • Automatic calculation during scoring

  • Frontend displays both total and per-GPU metrics

2.2 GPU Information Tracking ✅

Database Schema:

gpu_info = {
  "model": "NVIDIA A100",
  "count": 2,
  "device_ids": [0, 1],
  "world_size": 2
}

Recording Logic:

  • Captured during experiment setup

  • Stored in experiments.gpu_info column (JSON)

  • Used for per-GPU metric calculation

  • Displayed in results table

2.3 Enhanced Result Visualization ✅

Frontend Enhancements:

  • Added “GPUs” column to experiment table

  • Display: 2 (A100) or 4 (H100)

  • Tooltip shows device IDs and world size

  • Per-GPU throughput column

  • Color coding for efficiency comparison

Charts:

  • Per-GPU efficiency scatter plot

  • GPU count vs throughput line chart

  • Pareto frontier with GPU cost consideration

Technical Achievements

Code Additions:

  • Quantization System: 800 lines (mapper + integration)

  • Parallel System: 800 lines (mapper + integration)

  • GPU Tracking: 200 lines (backend + frontend)

  • Frontend Forms: 1,200 lines (Quant + Parallel components)

  • Documentation: 3 new docs (QUANTIZATION, PARALLEL, GPU_TRACKING)

Total: ~3,000 lines of new production code

Functionality:

  • ✅ Support for 3 inference runtimes (vLLM, SGLang, TensorRT-LLM)

  • ✅ 5 quantization presets + custom mode

  • ✅ 18 parallelism presets (6 per runtime)

  • ✅ Automatic runtime-specific CLI mapping

  • ✅ Constraint validation and conflict resolution

  • ✅ Per-GPU efficiency metrics

  • ✅ GPU information persistence

Documentation:

  • docs/QUANTIZATION_CONFIGURATION.md

  • docs/PARALLEL_CONFIGURATION.md

  • docs/GPU_TRACKING.md



🎉 Milestone 4: UI/UX Polish, SLO Filtering & Documentation

Date: 2025-12-01 (tag: milestone-4) Status: ✅ COMPLETED Timeline: 2025-11-15 → 2025-12-01 Objective: Transform from functional prototype to production-ready platform with professional UI, SLO filtering, and comprehensive documentation

Key Accomplishments

4.1 Frontend UI/UX Enhancements ✅

  • Real-time WebSocket updates (<100ms latency)

  • YAML import/export for task configurations

  • Auto-update notification system (GitHub releases)

  • Enhanced result visualization with SLO reference lines

  • Custom logo and branding (SVG icon + favicon)

  • Protected completed tasks (hidden edit/cleanup buttons)

  • Clickable task names for details view

  • UI refinements (width-limited controls, placeholder cleanup)

YAML Import/Export System:

// Import: Full-page drag-and-drop zone
<TaskYAMLImport onImport={(config) => populateForm(config)} />

// Export: Single-click download
<button onClick={() => exportTaskAsYAML(task)}>Export YAML</button>

Auto-Update Notifications:

  • Automatic version checking against GitHub releases

  • Notification banner when updates available

  • Build timestamp tracking: v1.0.0+20251203T195130Z

4.2 SLO-Aware Benchmarking ✅

  • Per-batch SLO filtering (filter non-compliant batches before aggregation)

  • Graceful OOM handling (partial success support)

  • Visual SLO indicators (reference lines on performance charts)

  • Detailed compliance logging per batch

Per-Batch Filtering Example:

[Benchmark] Filtering 4 batches by SLO compliance...
[Benchmark] ✗ Batch concurrency=8 violated SLO: {'p90': {'threshold': 5.0, 'actual': 6.2}}
[Benchmark] ✓ 3/4 batches passed SLO
[Benchmark] Max throughput: 145.2 req/s (from 3 SLO-compliant batches)

Graceful Degradation:

  • Experiments succeed if at least one batch completes

  • Partial results better than no results

  • OOM at high concurrency doesn’t invalidate low-concurrency data

4.3 Documentation Refinement ✅

  • Aggressive cleanup (66 → 15 files, 77% reduction)

  • Content merges (GENAI_BENCH_LOGS → TROUBLESHOOTING, etc.)

  • Reference fixes (zero broken links across all docs)

  • Focus on long-term maintainability

15 Essential Files Kept:

  • User Guides (4): QUICKSTART, DOCKER_MODE, OME_INSTALLATION, TROUBLESHOOTING

  • Architecture (3): DEPLOYMENT_ARCHITECTURE, GPU_TRACKING, ROADMAP

  • Features (4): BAYESIAN_OPTIMIZATION, SLO_SCORING, PARALLEL_EXECUTION, WEBSOCKET_IMPLEMENTATION

  • Configuration (4): UNIFIED_QUANTIZATION_PARAMETERS, PARALLEL_PARAMETERS, PRESET_QUICK_REFERENCE, PVC_STORAGE

4.4 Bug Fixes & Infrastructure ✅

  • Template parameter fix (OME InferenceService: params=parameters instead of **parameters)

  • API proxy configuration (fixed hardcoded URLs in service files)

  • Pydantic settings fix (added extra='ignore' for VITE_* variables)

4.5 Documentation Website ✅

  • Sphinx documentation with Furo theme

  • GitHub Actions workflow for automated deployment

  • MyST Parser for Markdown support

  • Auto-generated API documentation (autodoc)

  • Organized directory structure (getting-started, user-guide, features, api)

  • Dark mode support with custom branding

Technical Achievements

Code Statistics:

  • Frontend: ~800 lines (YAML I/O, auto-update, SLO visualization)

  • Backend: ~400 lines (SLO filtering, OOM handling, fixes)

  • Total New Code: ~1,200 lines

  • Documentation: Sphinx site with 15+ pages, GitHub Actions CI/CD

Components Created:

  • TaskYAMLImport.tsx (180 lines) - Drag-and-drop import with validation

  • TaskYAMLExport.tsx (80 lines) - Single-click YAML export

  • UpdateNotification.tsx (110 lines) - Auto-update banner with GitHub integration

  • versionService.ts (60 lines) - Version checking service

  • check_batch_slo_compliance() (133 lines) - Per-batch SLO validation

  • docs/conf.py - Sphinx configuration with Furo theme

  • .github/workflows/docs.yml - GitHub Pages deployment workflow

Files Modified:

  • Frontend: Tasks.tsx, TaskResults.tsx, NewTask.tsx, Logo.tsx (10+ files)

  • Backend: optimizer.py, direct_benchmark_controller.py, config.py (5 files)

  • Documentation: README.md, CLAUDE.md, ROADMAP.md (reference fixes)

Performance Impact

Metric

Before M4

After M4

Improvement

UI Response Time

2-5s polling

<100ms WebSocket

20-50x faster

Config Reusability

Manual JSON edit

YAML import/export

Instant

Update Awareness

Manual check

Auto-notification

Automatic

SLO Visibility

Numbers only

Visual ref lines

Intuitive

OOM Resilience

Experiment fails

Partial success

Graceful

Doc Files

66 files

15 files

77% reduction

Impact Summary

For Users:

  • ✅ Faster feedback: WebSocket real-time updates

  • ✅ Better visualization: SLO reference lines, enhanced charts

  • ✅ Config management: YAML import/export workflow

  • ✅ Stay updated: Automatic version checking

  • ✅ Fewer failures: Graceful OOM handling

  • ✅ Cleaner UI: Protected actions, clickable names

  • ✅ Professional branding: Custom logo and favicon

For Operators:

  • ✅ Easier troubleshooting: Per-batch SLO logging

  • ✅ Better resource utilization: Partial success support

  • ✅ Clearer documentation: 15 essential files vs 66

  • ✅ No broken links: All references verified

For Developers:

  • ✅ Maintainable docs: Focused, merged content

  • ✅ Working examples: Templates verified

  • ✅ Clear architecture: Essential docs only

  • ✅ Build tracking: Timestamp in version display


🎉 Milestone 5: Agent System & Local Deployment Mode

Date: 2025-12-22 (tag: milestone-5) Status: ✅ COMPLETED Timeline: 2025-12-01 → 2025-12-22 Objective: Introduce LLM-powered Agent System for conversational task management and add Local Deployment Mode for faster development iteration

Key Accomplishments

5.1 Agent Chat Interface ✅

  • Full-featured chat UI with streaming markdown responses (/agent)

  • Session management with persistent conversation history

  • Editable session titles with auto-generation from first message

  • IndexedDB-based message storage with backend sync

  • Server-Sent Events (SSE) for real-time streaming responses

Architecture:

┌─────────────────────┐     SSE Stream      ┌──────────────────┐
│  AgentChat.tsx      │◄───────────────────►│  /api/agent/chat │
│  (React Frontend)   │                     │  (FastAPI)       │
└─────────────────────┘                     └────────┬─────────┘
         │                                           │
         │ Markdown                                  │ OpenAI API
         ▼                                           ▼
┌─────────────────────┐                     ┌──────────────────┐
│ StreamingMarkdown   │                     │  LLM Backend     │
│ (react-markdown)    │                     │  (Configurable)  │
└─────────────────────┘                     └──────────────────┘

5.2 Tool Execution Framework ✅

  • Authorization system with AuthorizationScope enum (none, privileged, dangerous)

  • Privileged tools require user approval before execution

  • Auto-execute pending tools after authorization granted

  • Clear visual indicators for tool status in chat UI

Tool Categories Implemented:

Category

Tools

Authorization

Task Management

create_task, start_task, get_task_status, get_task_logs

None

Worker Control

restart_arq_worker

Privileged

System Utilities

sleep, get_current_time

None

GitHub Integration

search_github_issues, create_github_issue, comment_github_issue

None

HuggingFace CLI

hf_cache_scan, hf_download, hf_repo_info

None

Experiment Analysis

get_experiment_logs

None

5.3 Agent Backend Architecture ✅

  • LangChain framework for flexible model support

  • Support for Claude (Anthropic) and open-source models

  • Max iterations increased from 10 to 100 for complex tasks

  • Automatic tool result handling for multi-step operations

API Endpoints:

  • POST /api/agent/chat - SSE streaming chat endpoint

  • GET /api/agent/sessions - List all sessions

  • POST /api/agent/sessions - Create new session

  • PUT /api/agent/sessions/{id} - Update session title

  • DELETE /api/agent/sessions/{id} - Delete session

  • GET /api/agent/sessions/{id}/messages - Get session messages

5.4 Streaming Markdown Component ✅

  • Paragraph-aware streaming that preserves atomic elements

  • GitHub Flavored Markdown support (tables, code blocks, task lists)

  • Copy buttons for code blocks and tables (copies source, not rendered HTML)

  • Tailwind typography styling for consistent appearance

5.5 Local Deployment Mode ✅

  • New LocalController for subprocess-based model execution

  • Direct vLLM/SGLang server launch via python -m commands

  • Automatic port allocation (30000-30100 range)

  • Process lifecycle management with graceful shutdown

  • Log capture and streaming to task log files

Runtime Support:

  • vLLM local environment: .venv-vllm/ with CUDA 12

  • SGLang local environment: .venv-sglang/ (SM86 limitation)

  • Automatic environment detection and activation

5.6 Dataset URL Support ✅

  • Remote URL dataset loading (CSV, JSONL, compressed archives)

  • Automatic format detection and conversion

  • URL-hash based caching in ~/.local/share/autotuner/datasets/

  • Deduplication option for prompt datasets

  • genai-bench submodule updated to fork with URL support

5.7 Additional Improvements ✅

  • HuggingFace offline mode: Fixed cache path handling for air-gapped environments

  • GitHub Issue #3 fix: Resolved analyze_slo_violations() AttributeError

  • Foldable experiments list: Collapsible UI for large experiment sets

  • Comprehensive parameter presets: Runtime-specific vLLM and SGLang presets

  • Project rebranding: Logo redesign inspired by Novita.ai style

Technical Achievements

Code Statistics:

  • Agent Frontend: ~800 lines (AgentChat.tsx, StreamingMarkdown.tsx, AgentMessage.tsx)

  • Agent Backend: ~600 lines (routes/agent.py, schemas/agent.py, services/agent_service.py)

  • Tools: ~500 lines (task_tools.py, worker_tools.py, github_tools.py, hf_tools.py)

  • Local Mode: ~400 lines (local_controller.py, autotuner_worker.py updates)

  • Dataset: ~200 lines (dataset_controller.py)

  • Total New Code: ~2,500 lines

Key Components Created:

  • AgentChat.tsx (~400 lines) - Main chat interface with message history

  • StreamingMarkdown.tsx (~350 lines) - Paragraph-aware markdown renderer

  • LocalController (~300 lines) - Subprocess-based deployment controller

  • DatasetController (~200 lines) - Remote dataset fetching and caching

  • 12 agent tools across 6 categories

Performance Impact

Metric

Before M5

After M5

Improvement

Task Creation

Form-only

Conversational + Form

Natural language

Deployment Setup

Docker/K8s required

Local subprocess option

Faster iteration

Dataset Loading

Local files only

Remote URL support

More flexible

Agent Iterations

N/A

Up to 100 steps

Complex workflows

Impact Summary

For Users:

  • ✅ Natural language task creation and management via Agent

  • ✅ Faster local development without Docker/Kubernetes overhead

  • ✅ Remote dataset support for production workload testing

  • ✅ GitHub integration for issue tracking and collaboration

  • ✅ HuggingFace integration for model management

For Developers:

  • ✅ Local deployment mode for rapid iteration

  • ✅ Extensible tool framework for custom integrations

  • ✅ Comprehensive logging and debugging support

  • ✅ Session-based conversation history

Current Status: Production-Ready v0.2.0 ✅

What Works Today

Core Functionality:

  • ✅ Grid search, random search, Bayesian optimization (Optuna TPE)

  • ✅ Docker mode deployment (recommended)

  • ✅ Kubernetes/OME mode deployment

  • Local mode deployment (subprocess-based, no containers)

  • ✅ Runtime-agnostic quantization configuration (vLLM, SGLang, TensorRT-LLM)

  • ✅ Runtime-agnostic parallelism configuration (18 presets)

  • ✅ SLO-aware scoring with exponential penalties

  • ✅ GPU intelligent scheduling with per-GPU efficiency metrics

  • ✅ Checkpoint mechanism for fault tolerance

  • ✅ Multi-objective Pareto optimization

  • ✅ Model caching optimization

  • ✅ Full-stack web UI with real-time monitoring

  • Agent System with LLM-powered conversational interface

  • Remote dataset support via URL fetching

Performance:

  • ✅ 28 tasks executed successfully

  • ✅ 408 total experiments run

  • ✅ 312 successful experiments (76.5% success rate)

  • ✅ Average experiment duration: 303.6 seconds

  • ✅ Bayesian optimization: 80-87% reduction vs grid search

Infrastructure:

  • ✅ FastAPI backend with async support

  • ✅ React 18 frontend with TypeScript

  • ✅ WebSocket real-time communication (backend + frontend)

  • ✅ SQLite database with WAL mode (XDG-compliant location)

  • ✅ Redis task queue with ARQ worker

  • ✅ Docker container management

  • ✅ Kubernetes resource management

  • LangChain-based agent framework with 12 tools


Future Roadmap

🔵 Phase 6: Distributed Architecture & Parallel Execution (Planned)

Priority: High Effort: 3-4 weeks Value: ⭐⭐⭐⭐⭐

6.1 Distributed Worker Architecture

  • Central Web Manager: Single control plane for multiple workers

  • Worker Registration: Auto-discovery and registration via Redis

  • Heartbeat Monitoring: Worker health checks and failure detection

  • Work Stealing: Dynamic task redistribution across workers

  • Worker Pools: Group workers by capabilities (GPU type, region, etc.)

Architecture Design:

                    ┌─────────────────────┐
                    │  Central Web Manager│
                    │  (FastAPI + Redis)  │
                    └──────────┬──────────┘
                               │
           ┌───────────────────┼───────────────────┐
           │                   │                   │
    ┌──────▼──────┐    ┌──────▼──────┐    ┌──────▼──────┐
    │  Worker 1   │    │  Worker 2   │    │  Worker 3   │
    │  8×A100 GPUs│    │  8×H100 GPUs│    │  4×L40S GPUs│
    │  Node: gpu-1│    │  Node: gpu-2│    │  Node: gpu-3│
    └─────────────┘    └─────────────┘    └─────────────┘

Components:

  • Manager:

    • Task queue management

    • Worker registry with capabilities

    • Experiment distribution algorithm

    • Result aggregation service

    • Centralized logging

  • Worker:

    • Capability advertisement (GPU count, model, memory)

    • Experiment execution engine

    • Result reporting via REST API

    • Local checkpoint storage

    • Worker-level parallelism (max_parallel per worker)

Benefits:

  • Horizontal Scaling: Add workers to increase throughput

  • Resource Isolation: Different workers for different GPU types

  • Fault Tolerance: Worker failures don’t affect others

  • Geographic Distribution: Workers in different data centers

  • Cost Optimization: Use spot instances for workers

Implementation Plan:

  1. Week 1: Worker registration and discovery

  2. Week 2: Task distribution and scheduling

  3. Week 3: Result aggregation and monitoring

  4. Week 4: Frontend dashboard and testing

6.2 Advanced Parallel Execution

  • User-configurable max_parallel setting (currently hardcoded at 5)

  • Per-worker parallelism configuration

  • Dynamic parallelism based on GPU availability

  • Experiment dependency graph

  • Priority-based scheduling (high/normal/low priority tasks)

  • Resource reservation system (reserve GPUs for specific tasks)

Benefits:

  • Faster task completion (5-10x speedup with multiple workers)

  • Better GPU utilization across cluster

  • Configurable resource allocation per task

  • Fair scheduling with priority queues

6.3 Task Sharding & Load Balancing

  • Automatic task splitting across workers

  • Load-aware scheduling (balance by GPU count)

  • Locality-aware scheduling (prefer same-node experiments)

  • Cross-worker result aggregation

  • Consistent hashing for worker selection


🔵 Phase 7: Advanced Optimization & Runtime Features (Planned)

Priority: Medium Effort: 2-4 weeks Value: ⭐⭐⭐⭐

7.0 Agent Charting Tool

  • Add chart generation tool for Agent to visualize experiment results

  • Candidates: Matplotlib (static images), Plotly (interactive HTML)

  • Chart types: bar charts, line plots, scatter plots, heatmaps

  • Use cases:

    • Compare throughput/latency across experiments

    • Visualize parameter sensitivity

    • Generate Pareto frontier plots

    • Create SLO compliance charts

  • Output: Save charts to files or display inline in chat

Implementation Options:

Library

Pros

Cons

Matplotlib

Simple, widely used, static images

Not interactive

Plotly

Interactive, HTML export, beautiful

Larger dependency

Seaborn

Statistical plots, built on Matplotlib

Limited interactivity

7.1 Runtime-Specific Optimizations

SGLang Radix Cache Management:

  • Reset radix cache at experiment start: Clear cache before each experiment

  • Benchmark purity: Ensure fair comparison without cache pollution

  • Cache warming option: Optional pre-fill for production scenarios

  • Cache statistics tracking: Monitor hit rate and memory usage

Implementation:

# Before each experiment
def reset_sglang_radix_cache(container_id: str):
    """Reset SGLang radix cache via HTTP API"""
    response = requests.post(
        f"http://localhost:{port}/reset_cache",
        json={"cache_type": "radix"}
    )
    logger.info(f"Radix cache reset: {response.json()}")

Benefits:

  • Fair experiment comparisons (no cached KV states)

  • Reproducible benchmark results

  • Accurate TTFT measurements

  • Option to test both cold-start and warm-cache scenarios

Additional Runtime Features:

  • vLLM prefix caching control

  • TensorRT-LLM engine rebuild triggers

  • Runtime-specific profiling hooks

  • Memory defragmentation between experiments

7.2 Multi-Fidelity Optimization

  • Progressive benchmark complexity

  • Early stopping for poor configurations

  • Hyperband algorithm integration

  • Adaptive resource allocation

  • Quick validation runs (low concurrency, short duration)

  • Full benchmark only for promising configs

7.3 Transfer Learning

  • Model similarity detection (architecture, size, quantization)

  • Cross-model parameter transfer

  • Historical performance database (SQLite → PostgreSQL)

  • Meta-learning for initialization

  • Warmstart Bayesian optimization with historical data

7.4 Enhanced Multi-Objective Optimization

  • NSGA-II algorithm for Pareto frontier

  • 3+ objective support (latency, throughput, cost, energy, memory)

  • Interactive trade-off exploration

  • User preference learning

  • Weighted objective combination

  • Pareto frontier approximation with surrogate models


7.5 Enhanced Export & Data Portability

  • Export experiment results to CSV

  • Export results to JSON for analysis

  • Export results to Excel (.xlsx) format

  • Batch import multiple task configs

  • Template library (export/import task templates)

  • Share configurations via file or URL

  • YAML parser with schema validation

  • Automatic conversion between JSON ↔ YAML

  • YAML syntax highlighting in frontend

Benefits:

  • Data portability for external analysis tools (Excel, Python, R)

  • Batch operations for managing multiple tasks

  • Configuration templates for common use cases

  • Team collaboration via shared configs

  • Integration with data science workflows

Export Formats:

  • Experiment Results: .csv, .json, .xlsx

  • Task Configs: .yaml, .json

  • Templates: Zip archive with metadata

7.6 Custom Dataset Support for GenAI-Bench

  • Fetch datasets from user-specified URLs

  • Support CSV format parsing

  • Support JSONL (JSON Lines) format parsing

  • Conversion script to genai-bench format

  • Dataset validation and preprocessing

  • Automatic schema detection

  • Support for custom prompt templates

  • Integration with task configuration

Supported Input Formats:

# CSV format
prompt,max_tokens,temperature
"Explain quantum computing",100,0.7
"Write a story about AI",200,0.9
# JSONL format
{"prompt": "Explain quantum computing", "max_tokens": 100, "temperature": 0.7}
{"prompt": "Write a story about AI", "max_tokens": 200, "temperature": 0.9}

Conversion Pipeline:

# Download and convert custom dataset
python scripts/prepare_custom_dataset.py \
  --url https://example.com/dataset.csv \
  --format csv \
  --output ./data/custom_benchmark.json

# Use in task configuration
{
  "benchmark": {
    "custom_dataset": "./data/custom_benchmark.json",
    "task": "text-to-text"
  }
}

Features:

  • HTTP/HTTPS URL fetching with authentication support

  • Automatic format detection (CSV/JSONL)

  • Field mapping configuration (map CSV columns to genai-bench schema)

  • Data validation (check required fields, token limits)

  • Sampling strategies (random, stratified, sequential)

  • Dataset caching to avoid re-downloading

Benefits:

  • Use real production workloads for benchmarking

  • Test with domain-specific prompts

  • Reproducible benchmarks with versioned datasets

  • Support for custom evaluation scenarios

  • Integration with existing data pipelines

GenAI-Bench Schema Mapping:

# Required fields for genai-bench
{
  "prompt": str,           # Input text
  "output_len": int,       # Expected output length
  "input_len": int,        # Input length (auto-calculated if not provided)
  "temperature": float,    # Optional: sampling temperature
  "top_p": float,          # Optional: nucleus sampling
  "max_tokens": int        # Optional: max output tokens
}

Implementation Components:

  1. Dataset Fetcher (src/utils/dataset_fetcher.py)

    • URL download with retries

    • Authentication headers support

    • Local file caching

  2. Format Converters (src/utils/dataset_converters/)

    • csv_converter.py: CSV → genai-bench JSON

    • jsonl_converter.py: JSONL → genai-bench JSON

    • Base converter interface for extensibility

  3. Validation Module (src/utils/dataset_validator.py)

    • Schema validation

    • Token limit checking

    • Duplicate detection

  4. CLI Tool (scripts/prepare_custom_dataset.py)

    • Standalone conversion utility

    • Preview mode (show first N records)

    • Statistics reporting

🔵 Phase 8: Enterprise Features (Planned)

Priority: Low-Medium Effort: 3-5 weeks Value: ⭐⭐⭐

8.1 Multi-User Support

  • User authentication (OAuth2)

  • Role-based access control (RBAC)

  • Task ownership and sharing

  • Team workspaces

8.2 Advanced Monitoring

  • Prometheus metrics exporter

  • Grafana dashboard templates

  • Alert rules for failures

  • Performance analytics

8.3 CI/CD Integration

  • GitHub Actions workflow

  • Automated benchmarking on PR

  • Performance regression detection

  • Automated deployment

8.4 Cloud Deployment

  • AWS deployment guide (EKS)

  • GCP deployment guide (GKE)

  • Azure deployment guide (AKS)

  • Terraform modules

  • Helm charts


🟢 Phase 9: Research & Innovation (Future)

Priority: Low Effort: Variable Value: ⭐⭐⭐

9.1 Auto-Scaling Integration

  • Horizontal Pod Autoscaler (HPA) optimization

  • Vertical Pod Autoscaler (VPA) tuning

  • Knative Serving integration

  • Cost-aware scaling

9.2 Advanced Benchmarking

  • Custom benchmark scenario editor

  • Real-world traffic replay

  • Synthetic load generation

  • Multi-modal benchmarking

9.3 Model-Specific Optimization

  • Architecture-aware parameter tuning

  • Quantization-aware optimization

  • Attention mechanism tuning

  • Memory layout optimization


Maintenance & Technical Debt

Recently Fixed (2025/11/25) ✅

Database Schema Mismatch:

  • ❌ Missing columns: clusterbasemodel_config, clusterservingruntime_config, created_clusterbasemodel, created_clusterservingruntime

  • ✅ Fixed: Added ALTER TABLE statements

  • ✅ Verified: All endpoints working, HTTP 500 errors resolved

Known Issues

  1. Worker Restart Required

    • ⚠️ ARQ worker doesn’t hot-reload code changes

    • Manual restart needed after editing orchestrator.py, controllers/

    • Future: Add file watcher for auto-restart

  2. Polling-Based UI Updates

    • ⚠️ Frontend polls every 2-5 seconds

    • Inefficient for idle states

    • Future: WebSocket migration (Phase 4)

Technical Improvements

  1. Testing Coverage

    • Current: Manual testing only

    • Future: Unit tests, integration tests, E2E tests

    • Target: 80% code coverage

  2. Error Handling

    • Current: Basic try-catch blocks

    • Future: Comprehensive error taxonomy, retry logic, graceful degradation

  3. Database Migration

    • Current: Manual SQL commands

    • Future: Alembic migrations

    • Version-controlled schema changes


Success Metrics

Current Performance (Milestone 5)

Metric

Value

Target

Total Tasks

28

-

Total Experiments

408

-

Success Rate

76.5%

>80%

Avg Experiment Duration

303.6s

<300s

Bayesian Efficiency

80-87% reduction

>70%

UI Response Time

<200ms

<100ms

API Latency (P95)

<500ms

<200ms

Supported Runtimes

3 (vLLM, SGLang, TRT-LLM)

-

Deployment Modes

3 (Docker, OME, Local)

-

Agent Tools

12 (across 6 categories)

-

Future Targets (v2.0)

  • Experiment Success Rate: >90%

  • Avg Experiment Duration: <240s (20% improvement)

  • UI Response Time: <100ms (WebSocket)

  • Concurrent Experiments: >10 parallel

  • Cost Reduction: 50% fewer experiments vs grid search

  • Multi-Runtime Support: Add Triton, others


End of Roadmap | Last Updated: 2025/12/22 | Version: 0.2.0 (Milestone 5 Complete)