# SLO-Aware Objective Scoring The autotuner now supports sophisticated Service Level Objective (SLO) aware scoring with **exponential penalties** for violations and **tiered enforcement** (soft penalties vs hard failures). ## Overview The SLO-aware scoring algorithm enhances experiment evaluation by: 1. **Exponential Penalty Curves**: Creates steep score increases near SLO boundaries 2. **Tiered Enforcement**: Distinguishes between minor violations (penalty) and severe violations (hard fail) 3. **Multi-Metric Support**: Monitors P50/P90/P99 latency, TTFT (Time to First Token), and TPOT (Time Per Output Token) 4. **Configurable Per-Task**: Each task defines its own SLO thresholds and weights ## Mathematical Formula ### Base Scoring Formula ``` final_score = base_objective_score × (1 + total_penalty) ``` Where `total_penalty` is the sum of all per-metric penalties. ### Per-Metric Penalty Calculation For each SLO metric that exceeds its threshold: ```python violation_ratio = (actual_value - threshold) / threshold # Normalized percentage penalty = weight × exp(violation_ratio / steepness) ``` **Key Parameters:** - `weight`: Penalty multiplier (higher weights = more important metrics) - `steepness`: Controls curve slope (lower = steeper penalties, default: 0.1) ### Tiered Enforcement - **Minor Violations** (< fail_ratio): Exponential penalty applied to score - **Severe Violations** (≥ fail_ratio): Experiment marked as `FAILED` with score = ∞ ## Task Configuration Add an optional `slo` section to your task JSON. **All fields within the SLO configuration are optional** - you can specify only the metrics you care about. ### Full Example (All Options) ```json { "task_name": "my-slo-aware-task", "optimization": { "strategy": "grid_search", "objective": "minimize_latency" }, "slo": { "latency": { "p50": { "threshold": 2.0, "weight": 1.0, "hard_fail": false }, "p90": { "threshold": 5.0, "weight": 2.0, "hard_fail": true, "fail_ratio": 0.2 }, "p99": { "threshold": 10.0, "weight": 3.0, "hard_fail": true, "fail_ratio": 0.5 } }, "ttft": { "threshold": 1.0, "weight": 2.0, "hard_fail": false }, "tpot": { "threshold": 0.05, "weight": 2.0, "hard_fail": false }, "steepness": 0.1 } } ``` ### Minimal Example (Only Required Fields) You can specify just the metrics you want to enforce. Here's a minimal configuration with only P99 latency: ```json { "task_name": "my-minimal-slo-task", "optimization": { "strategy": "grid_search", "objective": "minimize_latency" }, "slo": { "latency": { "p99": { "threshold": 10.0 } } } } ``` ### Configuration Parameters **Important:** All SLO configuration fields are optional. You can: - Omit entire metric sections (e.g., no P50 if you only care about P99) - Omit individual metrics (e.g., only configure P90 and P99) - Omit optional parameters within metrics (weight, hard_fail, fail_ratio) #### Per-Metric SLO - **`threshold`** (required if metric specified): Maximum allowed value (in seconds) - **`weight`** (optional, default: 1.0): Penalty weight for this metric - **`hard_fail`** (optional, default: false): Enable hard failure enforcement - **`fail_ratio`** (optional, default: 0.5): Violation threshold for hard fail (e.g., 0.2 = 20% over) #### Global SLO - **`steepness`** (optional, default: 0.1): Exponential curve steepness parameter ## Example Scenarios ### Scenario 1: No SLO Violations **Metrics:** P90 = 4.0s (threshold: 5.0s) **Result:** - Penalty multiplier: 1.0 - Final score: base_score × 1.0 (no penalty) ### Scenario 2: Minor Violation (10% over) **Metrics:** P90 = 5.5s (threshold: 5.0s, weight: 2.0, steepness: 0.1) **Calculation:** ``` violation_ratio = (5.5 - 5.0) / 5.0 = 0.10 (10%) penalty = 2.0 × exp(0.10 / 0.1) = 2.0 × exp(1.0) ≈ 5.44 penalty_multiplier = 1 + 5.44 = 6.44 ``` **Result:** - Base score: 3.0s - Final score: 3.0 × 6.44 = **19.3s** (worse score) - Status: `SUCCESS` but penalized ### Scenario 3: Severe Violation (Hard Fail) **Metrics:** P90 = 6.5s (threshold: 5.0s, fail_ratio: 0.2) **Calculation:** ``` violation_ratio = (6.5 - 5.0) / 5.0 = 0.30 (30%) 30% > 20% fail_ratio → HARD FAILURE ``` **Result:** - Final score: **∞** (infinity) - Status: `FAILED` - Reason: "Hard SLO violation" ### Scenario 4: Multiple Violations (Cumulative Penalties) **Metrics:** - P50 = 2.3s (threshold: 2.0s, weight: 1.0) → +4.48 penalty - P90 = 5.5s (threshold: 5.0s, weight: 2.0) → +5.44 penalty - P99 = 11.0s (threshold: 10.0s, weight: 3.0) → +8.15 penalty - TTFT = 1.2s (threshold: 1.0s, weight: 2.0) → +14.78 penalty **Total Penalty:** 32.85 **Result:** - Base score: 2.5s - Final score: 2.5 × 33.85 = **84.6s** - Score increase: **3285%** 🔥 ## Steepness Parameter Impact The `steepness` parameter controls how aggressively penalties grow: | Steepness | 20% Violation Penalty | Behavior | |-----------|----------------------|----------| | 0.05 | 110.2x | Very steep (aggressive) | | **0.1** | **15.8x** | **Recommended default** | | 0.2 | 6.4x | Gentler curve | **Lower steepness = Steeper penalties near boundaries** ⚠️ ## Frontend Features ### Task Creation UI Navigate to **Create New Task** → Enable **SLO Configuration** toggle: - Configure P50/P90/P99 latency thresholds - Configure TTFT (Time to First Token) thresholds - Configure TPOT (Time Per Output Token) thresholds - Set penalty weights per metric - Enable hard fail enforcement with fail_ratio - Adjust steepness parameter ### Experiments View Experiments violating hard SLO constraints display: - Red "SLO" badge next to status - `slo_violation: true` flag in experiment data - Status automatically marked as `FAILED` ## Backend Implementation ### Optimizer Module (`src/utils/optimizer.py`) **New Functions:** 1. **`calculate_slo_penalty(metrics, slo_config)`** - Returns: `(penalty_multiplier, is_hard_failure, violation_details)` - Implements exponential penalty formula - Checks hard failure conditions 2. **`calculate_objective_score(results, objective, slo_config)`** - Enhanced to accept optional `slo_config` - Applies SLO penalties to base score - Returns `inf` for hard failures ### Orchestrator (`src/orchestrator.py`) - Passes `task.get("slo")` to scoring function - Marks experiments as `FAILED` when `score == inf` - Adds `slo_violation: true` flag to experiment results ## Testing Run the test suite to verify algorithm behavior: ```bash python test_slo_algorithm.py ``` **Test Coverage:** - ✓ No violations (baseline) - ✓ Minor violations (soft penalties) - ✓ Severe violations (exponential growth) - ✓ Hard failure boundary conditions - ✓ Multiple cumulative violations - ✓ Steepness parameter effects - ✓ TPOT SLO enforcement (test_tpot_slo.py) - ✓ Optional field handling (test_slo_optional_fields.py) ## Example Task See `examples/docker_task_with_slo.json` for a complete example with SLO configuration. ## Use Cases ### 1. **Production-Like Constraints** Ensure tuned configurations meet real-world SLOs: ```json "slo": { "latency": { "p99": {"threshold": 10.0, "hard_fail": true, "fail_ratio": 0.2} } } ``` ### 2. **Multi-Objective Optimization** Balance latency, TTFT, and TPOT: ```json "slo": { "latency": { "p90": {"threshold": 5.0, "weight": 1.0} }, "ttft": {"threshold": 1.0, "weight": 3.0}, // Higher weight = more important "tpot": {"threshold": 0.05, "weight": 2.0} } ``` ### 3. **Soft Boundaries for Exploration** Penalize but don't reject near-boundary configurations: ```json "slo": { "latency": { "p90": {"threshold": 5.0, "weight": 2.0, "hard_fail": false} }, "steepness": 0.15 // Gentler curve for exploration } ``` ## Design Rationale ### Why Exponential Penalties? Linear penalties don't adequately penalize configurations near SLO boundaries: | Violation | Linear (2x weight) | Exponential (weight=2, s=0.1) | |-----------|-------------------|-------------------------------| | 5% over | 1.10x | 2.30x | | 10% over | 1.20x | 3.72x | | 20% over | 1.40x | 15.78x | | 50% over | 2.00x | 297.4x | Exponential curves create **steep gradients** that guide optimization away from SLO boundaries. ### Why Tiered Enforcement? - **Soft Penalties**: Allow exploration of configurations slightly over SLO - **Hard Failures**: Reject configurations that egregiously violate critical SLOs This mirrors real-world SLO design where some violations are tolerable (warn) and others are not (page). ## Backward Compatibility Tasks without `slo` configuration continue to work unchanged. SLO scoring is fully optional and backward compatible. ## Future Enhancements - Support for throughput SLOs (minimum thresholds) - Custom penalty functions (polynomial, piecewise) - SLO violation budgets (allow N% of experiments to violate) - SLO-aware Bayesian optimization (constrained BO) ## References - **Exponential Penalty Functions**: Common in constrained optimization - **SLO Design**: Google SRE Book - Chapter 4 (Service Level Objectives) - **Tiered Enforcement**: Inspired by alerting thresholds (warn/critical) --- ## Graded Failure Penalties for Bayesian Optimization ### Problem When all experiments fail with infinite scores (`-inf` or `+inf`), Bayesian optimization cannot distinguish between parameter configurations and degrades to random search. ### Solution: Time-Based Failure Penalties Failed experiments receive **graded penalties** based on failure timing - earlier failures get worse penalties. ### Penalty Calculation Located in `src/web/workers/autotuner_worker.py`: ```python def calculate_failure_penalty(started_at, failed_at, timeout_seconds, experiment_status, error_message, objective_name): elapsed = (failed_at - started_at).total_seconds() completion_pct = min(elapsed / timeout_seconds, 1.0) # Base penalty by completion percentage if completion_pct < 0.20: base_penalty = -1000 # Very early (deployment, immediate crash) elif completion_pct < 0.60: base_penalty = -500 # Mid-stage (benchmark started but failed) elif completion_pct < 0.95: base_penalty = -200 # Late-stage (benchmark mostly done) else: base_penalty = -100 # Timeout (full duration) # Modifiers based on error type if "oom" or "memory" in error: base_penalty *= 1.5 # Resource failures if "deploy" in error: base_penalty *= 1.2 # Deployment failures if "connection" in error: base_penalty *= 0.8 # Transient issues # Invert for minimize objectives return -base_penalty if "minimize" in objective_name else base_penalty ``` ### Benefits 1. **Provides gradient**: Bayesian optimizer can distinguish parameter quality even when all fail 2. **Prioritizes stability**: Configs that run longer are preferred 3. **Contextual penalties**: Error types affect severity 4. **Enables learning**: Optimizer learns to avoid problematic parameter regions ### Example Scenarios | Failure Timing | Completion % | Base Penalty | Scenario | |---------------|-------------|--------------|----------| | 10 seconds | 2% | -1000 | Deployment failure, config clearly broken | | 200 seconds | 50% | -500 | Benchmark started but OOM | | 450 seconds | 90% | -200 | Almost complete, near-miss | | 500 seconds | 100% | -100 | Timeout, config might work with more time | This allows the optimizer to progressively learn which parameters cause early vs late failures.