SLO-Aware Objective Scoring¶

The autotuner now supports sophisticated Service Level Objective (SLO) aware scoring with exponential penalties for violations and tiered enforcement (soft penalties vs hard failures).

Overview¶

The SLO-aware scoring algorithm enhances experiment evaluation by:

Exponential Penalty Curves: Creates steep score increases near SLO boundaries
Tiered Enforcement: Distinguishes between minor violations (penalty) and severe violations (hard fail)
Multi-Metric Support: Monitors P50/P90/P99 latency, TTFT (Time to First Token), and TPOT (Time Per Output Token)
Configurable Per-Task: Each task defines its own SLO thresholds and weights

Mathematical Formula¶

Base Scoring Formula¶

final_score = base_objective_score × (1 + total_penalty)

Where total_penalty is the sum of all per-metric penalties.

Per-Metric Penalty Calculation¶

For each SLO metric that exceeds its threshold:

violation_ratio = (actual_value - threshold) / threshold  # Normalized percentage
penalty = weight × exp(violation_ratio / steepness)

Key Parameters:

weight: Penalty multiplier (higher weights = more important metrics)
steepness: Controls curve slope (lower = steeper penalties, default: 0.1)

Tiered Enforcement¶

Minor Violations (< fail_ratio): Exponential penalty applied to score
Severe Violations (≥ fail_ratio): Experiment marked as FAILED with score = ∞

Task Configuration¶

Add an optional slo section to your task JSON. All fields within the SLO configuration are optional - you can specify only the metrics you care about.

Full Example (All Options)¶

{
  "task_name": "my-slo-aware-task",
  "optimization": {
    "strategy": "grid_search",
    "objective": "minimize_latency"
  },
  "slo": {
    "latency": {
      "p50": {
        "threshold": 2.0,
        "weight": 1.0,
        "hard_fail": false
      },
      "p90": {
        "threshold": 5.0,
        "weight": 2.0,
        "hard_fail": true,
        "fail_ratio": 0.2
      },
      "p99": {
        "threshold": 10.0,
        "weight": 3.0,
        "hard_fail": true,
        "fail_ratio": 0.5
      }
    },
    "ttft": {
      "threshold": 1.0,
      "weight": 2.0,
      "hard_fail": false
    },
    "tpot": {
      "threshold": 0.05,
      "weight": 2.0,
      "hard_fail": false
    },
    "steepness": 0.1
  }
}

Minimal Example (Only Required Fields)¶

You can specify just the metrics you want to enforce. Here’s a minimal configuration with only P99 latency:

{
  "task_name": "my-minimal-slo-task",
  "optimization": {
    "strategy": "grid_search",
    "objective": "minimize_latency"
  },
  "slo": {
    "latency": {
      "p99": {
        "threshold": 10.0
      }
    }
  }
}

Configuration Parameters¶

Important: All SLO configuration fields are optional. You can:

Omit entire metric sections (e.g., no P50 if you only care about P99)
Omit individual metrics (e.g., only configure P90 and P99)
Omit optional parameters within metrics (weight, hard_fail, fail_ratio)

Per-Metric SLO¶

threshold (required if metric specified): Maximum allowed value (in seconds)
weight (optional, default: 1.0): Penalty weight for this metric
hard_fail (optional, default: false): Enable hard failure enforcement
fail_ratio (optional, default: 0.5): Violation threshold for hard fail (e.g., 0.2 = 20% over)

Global SLO¶

steepness (optional, default: 0.1): Exponential curve steepness parameter

Example Scenarios¶

Scenario 1: No SLO Violations¶

Metrics: P90 = 4.0s (threshold: 5.0s)

Result:

Penalty multiplier: 1.0
Final score: base_score × 1.0 (no penalty)

Scenario 2: Minor Violation (10% over)¶

Metrics: P90 = 5.5s (threshold: 5.0s, weight: 2.0, steepness: 0.1)

Calculation:

violation_ratio = (5.5 - 5.0) / 5.0 = 0.10 (10%)
penalty = 2.0 × exp(0.10 / 0.1) = 2.0 × exp(1.0) ≈ 5.44
penalty_multiplier = 1 + 5.44 = 6.44

Result:

Base score: 3.0s
Final score: 3.0 × 6.44 = 19.3s (worse score)
Status: SUCCESS but penalized

Scenario 3: Severe Violation (Hard Fail)¶

Metrics: P90 = 6.5s (threshold: 5.0s, fail_ratio: 0.2)

Calculation:

violation_ratio = (6.5 - 5.0) / 5.0 = 0.30 (30%)
30% > 20% fail_ratio → HARD FAILURE

Result:

Final score: ∞ (infinity)
Status: FAILED
Reason: “Hard SLO violation”

Scenario 4: Multiple Violations (Cumulative Penalties)¶

Metrics:

P50 = 2.3s (threshold: 2.0s, weight: 1.0) → +4.48 penalty
P90 = 5.5s (threshold: 5.0s, weight: 2.0) → +5.44 penalty
P99 = 11.0s (threshold: 10.0s, weight: 3.0) → +8.15 penalty
TTFT = 1.2s (threshold: 1.0s, weight: 2.0) → +14.78 penalty

Total Penalty: 32.85

Result:

Base score: 2.5s
Final score: 2.5 × 33.85 = 84.6s
Score increase: 3285% 🔥

Steepness Parameter Impact¶

The steepness parameter controls how aggressively penalties grow:

Steepness	20% Violation Penalty	Behavior
0.05	110.2x	Very steep (aggressive)
0.1	15.8x	Recommended default
0.2	6.4x	Gentler curve

Lower steepness = Steeper penalties near boundaries ⚠️

Frontend Features¶

Task Creation UI¶

Navigate to Create New Task → Enable SLO Configuration toggle:

Configure P50/P90/P99 latency thresholds
Configure TTFT (Time to First Token) thresholds
Configure TPOT (Time Per Output Token) thresholds
Set penalty weights per metric
Enable hard fail enforcement with fail_ratio
Adjust steepness parameter

Experiments View¶

Experiments violating hard SLO constraints display:

Red “SLO” badge next to status
slo_violation: true flag in experiment data
Status automatically marked as FAILED

Backend Implementation¶

Optimizer Module (`src/utils/optimizer.py`)¶

New Functions:

calculate_slo_penalty(metrics, slo_config)
- Returns: (penalty_multiplier, is_hard_failure, violation_details)
- Implements exponential penalty formula
- Checks hard failure conditions
calculate_objective_score(results, objective, slo_config)
- Enhanced to accept optional slo_config
- Applies SLO penalties to base score
- Returns inf for hard failures

Orchestrator (`src/orchestrator.py`)¶

Passes task.get("slo") to scoring function
Marks experiments as FAILED when score == inf
Adds slo_violation: true flag to experiment results

Testing¶

Run the test suite to verify algorithm behavior:

python test_slo_algorithm.py

Test Coverage:

✓ No violations (baseline)
✓ Minor violations (soft penalties)
✓ Severe violations (exponential growth)
✓ Hard failure boundary conditions
✓ Multiple cumulative violations
✓ Steepness parameter effects
✓ TPOT SLO enforcement (test_tpot_slo.py)
✓ Optional field handling (test_slo_optional_fields.py)

Example Task¶

See examples/docker_task_with_slo.json for a complete example with SLO configuration.

Use Cases¶

1. Production-Like Constraints¶

Ensure tuned configurations meet real-world SLOs:

"slo": {
  "latency": {
    "p99": {"threshold": 10.0, "hard_fail": true, "fail_ratio": 0.2}
  }
}

2. Multi-Objective Optimization¶

Balance latency, TTFT, and TPOT:

"slo": {
  "latency": {
    "p90": {"threshold": 5.0, "weight": 1.0}
  },
  "ttft": {"threshold": 1.0, "weight": 3.0},  // Higher weight = more important
  "tpot": {"threshold": 0.05, "weight": 2.0}
}

3. Soft Boundaries for Exploration¶

Penalize but don’t reject near-boundary configurations:

"slo": {
  "latency": {
    "p90": {"threshold": 5.0, "weight": 2.0, "hard_fail": false}
  },
  "steepness": 0.15  // Gentler curve for exploration
}

Design Rationale¶

Why Exponential Penalties?¶

Linear penalties don’t adequately penalize configurations near SLO boundaries:

Violation	Linear (2x weight)	Exponential (weight=2, s=0.1)
5% over	1.10x	2.30x
10% over	1.20x	3.72x
20% over	1.40x	15.78x
50% over	2.00x	297.4x

Exponential curves create steep gradients that guide optimization away from SLO boundaries.

Why Tiered Enforcement?¶

Soft Penalties: Allow exploration of configurations slightly over SLO
Hard Failures: Reject configurations that egregiously violate critical SLOs

This mirrors real-world SLO design where some violations are tolerable (warn) and others are not (page).

Backward Compatibility¶

Tasks without slo configuration continue to work unchanged. SLO scoring is fully optional and backward compatible.

Future Enhancements¶

Support for throughput SLOs (minimum thresholds)
Custom penalty functions (polynomial, piecewise)
SLO violation budgets (allow N% of experiments to violate)
SLO-aware Bayesian optimization (constrained BO)

References¶

Exponential Penalty Functions: Common in constrained optimization
SLO Design: Google SRE Book - Chapter 4 (Service Level Objectives)
Tiered Enforcement: Inspired by alerting thresholds (warn/critical)

Graded Failure Penalties for Bayesian Optimization¶

Problem¶

When all experiments fail with infinite scores (-inf or +inf), Bayesian optimization cannot distinguish between parameter configurations and degrades to random search.

Solution: Time-Based Failure Penalties¶

Failed experiments receive graded penalties based on failure timing - earlier failures get worse penalties.

Penalty Calculation¶

Located in src/web/workers/autotuner_worker.py:

def calculate_failure_penalty(started_at, failed_at, timeout_seconds, 
                              experiment_status, error_message, objective_name):
    elapsed = (failed_at - started_at).total_seconds()
    completion_pct = min(elapsed / timeout_seconds, 1.0)

    # Base penalty by completion percentage
    if completion_pct < 0.20:
        base_penalty = -1000  # Very early (deployment, immediate crash)
    elif completion_pct < 0.60:
        base_penalty = -500   # Mid-stage (benchmark started but failed)
    elif completion_pct < 0.95:
        base_penalty = -200   # Late-stage (benchmark mostly done)
    else:
        base_penalty = -100   # Timeout (full duration)

    # Modifiers based on error type
    if "oom" or "memory" in error: base_penalty *= 1.5  # Resource failures
    if "deploy" in error: base_penalty *= 1.2            # Deployment failures
    if "connection" in error: base_penalty *= 0.8        # Transient issues

    # Invert for minimize objectives
    return -base_penalty if "minimize" in objective_name else base_penalty

Benefits¶

Provides gradient: Bayesian optimizer can distinguish parameter quality even when all fail
Prioritizes stability: Configs that run longer are preferred
Contextual penalties: Error types affect severity
Enables learning: Optimizer learns to avoid problematic parameter regions

Example Scenarios¶

Failure Timing	Completion %	Base Penalty	Scenario
10 seconds	2%	-1000	Deployment failure, config clearly broken
200 seconds	50%	-500	Benchmark started but OOM
450 seconds	90%	-200	Almost complete, near-miss
500 seconds	100%	-100	Timeout, config might work with more time

This allows the optimizer to progressively learn which parameters cause early vs late failures.

SLO-Aware Objective Scoring¶

Overview¶

Mathematical Formula¶

Base Scoring Formula¶

Per-Metric Penalty Calculation¶

Tiered Enforcement¶

Task Configuration¶

Full Example (All Options)¶

Minimal Example (Only Required Fields)¶

Configuration Parameters¶

Per-Metric SLO¶

Global SLO¶

Example Scenarios¶

Scenario 1: No SLO Violations¶

Scenario 2: Minor Violation (10% over)¶

Scenario 3: Severe Violation (Hard Fail)¶

Scenario 4: Multiple Violations (Cumulative Penalties)¶

Steepness Parameter Impact¶

Frontend Features¶

Task Creation UI¶

Experiments View¶

Backend Implementation¶

Optimizer Module (src/utils/optimizer.py)¶

Orchestrator (src/orchestrator.py)¶

Testing¶

Example Task¶

Use Cases¶

1. Production-Like Constraints¶

2. Multi-Objective Optimization¶

3. Soft Boundaries for Exploration¶

Design Rationale¶

Why Exponential Penalties?¶

Why Tiered Enforcement?¶

Backward Compatibility¶

Future Enhancements¶

References¶

Graded Failure Penalties for Bayesian Optimization¶

Problem¶

Solution: Time-Based Failure Penalties¶

Penalty Calculation¶

Benefits¶

Example Scenarios¶

Optimizer Module (`src/utils/optimizer.py`)¶

Orchestrator (`src/orchestrator.py`)¶