SLO-Aware Objective Scoring¶
The autotuner now supports sophisticated Service Level Objective (SLO) aware scoring with exponential penalties for violations and tiered enforcement (soft penalties vs hard failures).
Overview¶
The SLO-aware scoring algorithm enhances experiment evaluation by:
Exponential Penalty Curves: Creates steep score increases near SLO boundaries
Tiered Enforcement: Distinguishes between minor violations (penalty) and severe violations (hard fail)
Multi-Metric Support: Monitors P50/P90/P99 latency, TTFT (Time to First Token), and TPOT (Time Per Output Token)
Configurable Per-Task: Each task defines its own SLO thresholds and weights
Mathematical Formula¶
Base Scoring Formula¶
final_score = base_objective_score × (1 + total_penalty)
Where total_penalty is the sum of all per-metric penalties.
Per-Metric Penalty Calculation¶
For each SLO metric that exceeds its threshold:
violation_ratio = (actual_value - threshold) / threshold # Normalized percentage
penalty = weight × exp(violation_ratio / steepness)
Key Parameters:
weight: Penalty multiplier (higher weights = more important metrics)steepness: Controls curve slope (lower = steeper penalties, default: 0.1)
Tiered Enforcement¶
Minor Violations (< fail_ratio): Exponential penalty applied to score
Severe Violations (≥ fail_ratio): Experiment marked as
FAILEDwith score = ∞
Task Configuration¶
Add an optional slo section to your task JSON. All fields within the SLO configuration are optional - you can specify only the metrics you care about.
Full Example (All Options)¶
{
"task_name": "my-slo-aware-task",
"optimization": {
"strategy": "grid_search",
"objective": "minimize_latency"
},
"slo": {
"latency": {
"p50": {
"threshold": 2.0,
"weight": 1.0,
"hard_fail": false
},
"p90": {
"threshold": 5.0,
"weight": 2.0,
"hard_fail": true,
"fail_ratio": 0.2
},
"p99": {
"threshold": 10.0,
"weight": 3.0,
"hard_fail": true,
"fail_ratio": 0.5
}
},
"ttft": {
"threshold": 1.0,
"weight": 2.0,
"hard_fail": false
},
"tpot": {
"threshold": 0.05,
"weight": 2.0,
"hard_fail": false
},
"steepness": 0.1
}
}
Minimal Example (Only Required Fields)¶
You can specify just the metrics you want to enforce. Here’s a minimal configuration with only P99 latency:
{
"task_name": "my-minimal-slo-task",
"optimization": {
"strategy": "grid_search",
"objective": "minimize_latency"
},
"slo": {
"latency": {
"p99": {
"threshold": 10.0
}
}
}
}
Configuration Parameters¶
Important: All SLO configuration fields are optional. You can:
Omit entire metric sections (e.g., no P50 if you only care about P99)
Omit individual metrics (e.g., only configure P90 and P99)
Omit optional parameters within metrics (weight, hard_fail, fail_ratio)
Per-Metric SLO¶
threshold(required if metric specified): Maximum allowed value (in seconds)weight(optional, default: 1.0): Penalty weight for this metrichard_fail(optional, default: false): Enable hard failure enforcementfail_ratio(optional, default: 0.5): Violation threshold for hard fail (e.g., 0.2 = 20% over)
Global SLO¶
steepness(optional, default: 0.1): Exponential curve steepness parameter
Example Scenarios¶
Scenario 1: No SLO Violations¶
Metrics: P90 = 4.0s (threshold: 5.0s)
Result:
Penalty multiplier: 1.0
Final score: base_score × 1.0 (no penalty)
Scenario 2: Minor Violation (10% over)¶
Metrics: P90 = 5.5s (threshold: 5.0s, weight: 2.0, steepness: 0.1)
Calculation:
violation_ratio = (5.5 - 5.0) / 5.0 = 0.10 (10%)
penalty = 2.0 × exp(0.10 / 0.1) = 2.0 × exp(1.0) ≈ 5.44
penalty_multiplier = 1 + 5.44 = 6.44
Result:
Base score: 3.0s
Final score: 3.0 × 6.44 = 19.3s (worse score)
Status:
SUCCESSbut penalized
Scenario 3: Severe Violation (Hard Fail)¶
Metrics: P90 = 6.5s (threshold: 5.0s, fail_ratio: 0.2)
Calculation:
violation_ratio = (6.5 - 5.0) / 5.0 = 0.30 (30%)
30% > 20% fail_ratio → HARD FAILURE
Result:
Final score: ∞ (infinity)
Status:
FAILEDReason: “Hard SLO violation”
Scenario 4: Multiple Violations (Cumulative Penalties)¶
Metrics:
P50 = 2.3s (threshold: 2.0s, weight: 1.0) → +4.48 penalty
P90 = 5.5s (threshold: 5.0s, weight: 2.0) → +5.44 penalty
P99 = 11.0s (threshold: 10.0s, weight: 3.0) → +8.15 penalty
TTFT = 1.2s (threshold: 1.0s, weight: 2.0) → +14.78 penalty
Total Penalty: 32.85
Result:
Base score: 2.5s
Final score: 2.5 × 33.85 = 84.6s
Score increase: 3285% 🔥
Steepness Parameter Impact¶
The steepness parameter controls how aggressively penalties grow:
Steepness |
20% Violation Penalty |
Behavior |
|---|---|---|
0.05 |
110.2x |
Very steep (aggressive) |
0.1 |
15.8x |
Recommended default |
0.2 |
6.4x |
Gentler curve |
Lower steepness = Steeper penalties near boundaries ⚠️
Frontend Features¶
Task Creation UI¶
Navigate to Create New Task → Enable SLO Configuration toggle:
Configure P50/P90/P99 latency thresholds
Configure TTFT (Time to First Token) thresholds
Configure TPOT (Time Per Output Token) thresholds
Set penalty weights per metric
Enable hard fail enforcement with fail_ratio
Adjust steepness parameter
Experiments View¶
Experiments violating hard SLO constraints display:
Red “SLO” badge next to status
slo_violation: trueflag in experiment dataStatus automatically marked as
FAILED
Backend Implementation¶
Optimizer Module (src/utils/optimizer.py)¶
New Functions:
calculate_slo_penalty(metrics, slo_config)Returns:
(penalty_multiplier, is_hard_failure, violation_details)Implements exponential penalty formula
Checks hard failure conditions
calculate_objective_score(results, objective, slo_config)Enhanced to accept optional
slo_configApplies SLO penalties to base score
Returns
inffor hard failures
Orchestrator (src/orchestrator.py)¶
Passes
task.get("slo")to scoring functionMarks experiments as
FAILEDwhenscore == infAdds
slo_violation: trueflag to experiment results
Testing¶
Run the test suite to verify algorithm behavior:
python test_slo_algorithm.py
Test Coverage:
✓ No violations (baseline)
✓ Minor violations (soft penalties)
✓ Severe violations (exponential growth)
✓ Hard failure boundary conditions
✓ Multiple cumulative violations
✓ Steepness parameter effects
✓ TPOT SLO enforcement (test_tpot_slo.py)
✓ Optional field handling (test_slo_optional_fields.py)
Example Task¶
See examples/docker_task_with_slo.json for a complete example with SLO configuration.
Use Cases¶
1. Production-Like Constraints¶
Ensure tuned configurations meet real-world SLOs:
"slo": {
"latency": {
"p99": {"threshold": 10.0, "hard_fail": true, "fail_ratio": 0.2}
}
}
2. Multi-Objective Optimization¶
Balance latency, TTFT, and TPOT:
"slo": {
"latency": {
"p90": {"threshold": 5.0, "weight": 1.0}
},
"ttft": {"threshold": 1.0, "weight": 3.0}, // Higher weight = more important
"tpot": {"threshold": 0.05, "weight": 2.0}
}
3. Soft Boundaries for Exploration¶
Penalize but don’t reject near-boundary configurations:
"slo": {
"latency": {
"p90": {"threshold": 5.0, "weight": 2.0, "hard_fail": false}
},
"steepness": 0.15 // Gentler curve for exploration
}
Design Rationale¶
Why Exponential Penalties?¶
Linear penalties don’t adequately penalize configurations near SLO boundaries:
Violation |
Linear (2x weight) |
Exponential (weight=2, s=0.1) |
|---|---|---|
5% over |
1.10x |
2.30x |
10% over |
1.20x |
3.72x |
20% over |
1.40x |
15.78x |
50% over |
2.00x |
297.4x |
Exponential curves create steep gradients that guide optimization away from SLO boundaries.
Why Tiered Enforcement?¶
Soft Penalties: Allow exploration of configurations slightly over SLO
Hard Failures: Reject configurations that egregiously violate critical SLOs
This mirrors real-world SLO design where some violations are tolerable (warn) and others are not (page).
Backward Compatibility¶
Tasks without slo configuration continue to work unchanged. SLO scoring is fully optional and backward compatible.
Future Enhancements¶
Support for throughput SLOs (minimum thresholds)
Custom penalty functions (polynomial, piecewise)
SLO violation budgets (allow N% of experiments to violate)
SLO-aware Bayesian optimization (constrained BO)
References¶
Exponential Penalty Functions: Common in constrained optimization
SLO Design: Google SRE Book - Chapter 4 (Service Level Objectives)
Tiered Enforcement: Inspired by alerting thresholds (warn/critical)
Graded Failure Penalties for Bayesian Optimization¶
Problem¶
When all experiments fail with infinite scores (-inf or +inf), Bayesian optimization cannot distinguish between parameter configurations and degrades to random search.
Solution: Time-Based Failure Penalties¶
Failed experiments receive graded penalties based on failure timing - earlier failures get worse penalties.
Penalty Calculation¶
Located in src/web/workers/autotuner_worker.py:
def calculate_failure_penalty(started_at, failed_at, timeout_seconds,
experiment_status, error_message, objective_name):
elapsed = (failed_at - started_at).total_seconds()
completion_pct = min(elapsed / timeout_seconds, 1.0)
# Base penalty by completion percentage
if completion_pct < 0.20:
base_penalty = -1000 # Very early (deployment, immediate crash)
elif completion_pct < 0.60:
base_penalty = -500 # Mid-stage (benchmark started but failed)
elif completion_pct < 0.95:
base_penalty = -200 # Late-stage (benchmark mostly done)
else:
base_penalty = -100 # Timeout (full duration)
# Modifiers based on error type
if "oom" or "memory" in error: base_penalty *= 1.5 # Resource failures
if "deploy" in error: base_penalty *= 1.2 # Deployment failures
if "connection" in error: base_penalty *= 0.8 # Transient issues
# Invert for minimize objectives
return -base_penalty if "minimize" in objective_name else base_penalty
Benefits¶
Provides gradient: Bayesian optimizer can distinguish parameter quality even when all fail
Prioritizes stability: Configs that run longer are preferred
Contextual penalties: Error types affect severity
Enables learning: Optimizer learns to avoid problematic parameter regions
Example Scenarios¶
Failure Timing |
Completion % |
Base Penalty |
Scenario |
|---|---|---|---|
10 seconds |
2% |
-1000 |
Deployment failure, config clearly broken |
200 seconds |
50% |
-500 |
Benchmark started but OOM |
450 seconds |
90% |
-200 |
Almost complete, near-miss |
500 seconds |
100% |
-100 |
Timeout, config might work with more time |
This allows the optimizer to progressively learn which parameters cause early vs late failures.