SLO-Aware Objective Scoring

The autotuner now supports sophisticated Service Level Objective (SLO) aware scoring with exponential penalties for violations and tiered enforcement (soft penalties vs hard failures).

Overview

The SLO-aware scoring algorithm enhances experiment evaluation by:

  1. Exponential Penalty Curves: Creates steep score increases near SLO boundaries

  2. Tiered Enforcement: Distinguishes between minor violations (penalty) and severe violations (hard fail)

  3. Multi-Metric Support: Monitors P50/P90/P99 latency, TTFT (Time to First Token), and TPOT (Time Per Output Token)

  4. Configurable Per-Task: Each task defines its own SLO thresholds and weights

Mathematical Formula

Base Scoring Formula

final_score = base_objective_score × (1 + total_penalty)

Where total_penalty is the sum of all per-metric penalties.

Per-Metric Penalty Calculation

For each SLO metric that exceeds its threshold:

violation_ratio = (actual_value - threshold) / threshold  # Normalized percentage
penalty = weight × exp(violation_ratio / steepness)

Key Parameters:

  • weight: Penalty multiplier (higher weights = more important metrics)

  • steepness: Controls curve slope (lower = steeper penalties, default: 0.1)

Tiered Enforcement

  • Minor Violations (< fail_ratio): Exponential penalty applied to score

  • Severe Violations (≥ fail_ratio): Experiment marked as FAILED with score = ∞

Task Configuration

Add an optional slo section to your task JSON. All fields within the SLO configuration are optional - you can specify only the metrics you care about.

Full Example (All Options)

{
  "task_name": "my-slo-aware-task",
  "optimization": {
    "strategy": "grid_search",
    "objective": "minimize_latency"
  },
  "slo": {
    "latency": {
      "p50": {
        "threshold": 2.0,
        "weight": 1.0,
        "hard_fail": false
      },
      "p90": {
        "threshold": 5.0,
        "weight": 2.0,
        "hard_fail": true,
        "fail_ratio": 0.2
      },
      "p99": {
        "threshold": 10.0,
        "weight": 3.0,
        "hard_fail": true,
        "fail_ratio": 0.5
      }
    },
    "ttft": {
      "threshold": 1.0,
      "weight": 2.0,
      "hard_fail": false
    },
    "tpot": {
      "threshold": 0.05,
      "weight": 2.0,
      "hard_fail": false
    },
    "steepness": 0.1
  }
}

Minimal Example (Only Required Fields)

You can specify just the metrics you want to enforce. Here’s a minimal configuration with only P99 latency:

{
  "task_name": "my-minimal-slo-task",
  "optimization": {
    "strategy": "grid_search",
    "objective": "minimize_latency"
  },
  "slo": {
    "latency": {
      "p99": {
        "threshold": 10.0
      }
    }
  }
}

Configuration Parameters

Important: All SLO configuration fields are optional. You can:

  • Omit entire metric sections (e.g., no P50 if you only care about P99)

  • Omit individual metrics (e.g., only configure P90 and P99)

  • Omit optional parameters within metrics (weight, hard_fail, fail_ratio)

Per-Metric SLO

  • threshold (required if metric specified): Maximum allowed value (in seconds)

  • weight (optional, default: 1.0): Penalty weight for this metric

  • hard_fail (optional, default: false): Enable hard failure enforcement

  • fail_ratio (optional, default: 0.5): Violation threshold for hard fail (e.g., 0.2 = 20% over)

Global SLO

  • steepness (optional, default: 0.1): Exponential curve steepness parameter

Example Scenarios

Scenario 1: No SLO Violations

Metrics: P90 = 4.0s (threshold: 5.0s)

Result:

  • Penalty multiplier: 1.0

  • Final score: base_score × 1.0 (no penalty)

Scenario 2: Minor Violation (10% over)

Metrics: P90 = 5.5s (threshold: 5.0s, weight: 2.0, steepness: 0.1)

Calculation:

violation_ratio = (5.5 - 5.0) / 5.0 = 0.10 (10%)
penalty = 2.0 × exp(0.10 / 0.1) = 2.0 × exp(1.0) ≈ 5.44
penalty_multiplier = 1 + 5.44 = 6.44

Result:

  • Base score: 3.0s

  • Final score: 3.0 × 6.44 = 19.3s (worse score)

  • Status: SUCCESS but penalized

Scenario 3: Severe Violation (Hard Fail)

Metrics: P90 = 6.5s (threshold: 5.0s, fail_ratio: 0.2)

Calculation:

violation_ratio = (6.5 - 5.0) / 5.0 = 0.30 (30%)
30% > 20% fail_ratio → HARD FAILURE

Result:

  • Final score: (infinity)

  • Status: FAILED

  • Reason: “Hard SLO violation”

Scenario 4: Multiple Violations (Cumulative Penalties)

Metrics:

  • P50 = 2.3s (threshold: 2.0s, weight: 1.0) → +4.48 penalty

  • P90 = 5.5s (threshold: 5.0s, weight: 2.0) → +5.44 penalty

  • P99 = 11.0s (threshold: 10.0s, weight: 3.0) → +8.15 penalty

  • TTFT = 1.2s (threshold: 1.0s, weight: 2.0) → +14.78 penalty

Total Penalty: 32.85

Result:

  • Base score: 2.5s

  • Final score: 2.5 × 33.85 = 84.6s

  • Score increase: 3285% 🔥

Steepness Parameter Impact

The steepness parameter controls how aggressively penalties grow:

Steepness

20% Violation Penalty

Behavior

0.05

110.2x

Very steep (aggressive)

0.1

15.8x

Recommended default

0.2

6.4x

Gentler curve

Lower steepness = Steeper penalties near boundaries ⚠️

Frontend Features

Task Creation UI

Navigate to Create New Task → Enable SLO Configuration toggle:

  • Configure P50/P90/P99 latency thresholds

  • Configure TTFT (Time to First Token) thresholds

  • Configure TPOT (Time Per Output Token) thresholds

  • Set penalty weights per metric

  • Enable hard fail enforcement with fail_ratio

  • Adjust steepness parameter

Experiments View

Experiments violating hard SLO constraints display:

  • Red “SLO” badge next to status

  • slo_violation: true flag in experiment data

  • Status automatically marked as FAILED

Backend Implementation

Optimizer Module (src/utils/optimizer.py)

New Functions:

  1. calculate_slo_penalty(metrics, slo_config)

    • Returns: (penalty_multiplier, is_hard_failure, violation_details)

    • Implements exponential penalty formula

    • Checks hard failure conditions

  2. calculate_objective_score(results, objective, slo_config)

    • Enhanced to accept optional slo_config

    • Applies SLO penalties to base score

    • Returns inf for hard failures

Orchestrator (src/orchestrator.py)

  • Passes task.get("slo") to scoring function

  • Marks experiments as FAILED when score == inf

  • Adds slo_violation: true flag to experiment results

Testing

Run the test suite to verify algorithm behavior:

python test_slo_algorithm.py

Test Coverage:

  • ✓ No violations (baseline)

  • ✓ Minor violations (soft penalties)

  • ✓ Severe violations (exponential growth)

  • ✓ Hard failure boundary conditions

  • ✓ Multiple cumulative violations

  • ✓ Steepness parameter effects

  • ✓ TPOT SLO enforcement (test_tpot_slo.py)

  • ✓ Optional field handling (test_slo_optional_fields.py)

Example Task

See examples/docker_task_with_slo.json for a complete example with SLO configuration.

Use Cases

1. Production-Like Constraints

Ensure tuned configurations meet real-world SLOs:

"slo": {
  "latency": {
    "p99": {"threshold": 10.0, "hard_fail": true, "fail_ratio": 0.2}
  }
}

2. Multi-Objective Optimization

Balance latency, TTFT, and TPOT:

"slo": {
  "latency": {
    "p90": {"threshold": 5.0, "weight": 1.0}
  },
  "ttft": {"threshold": 1.0, "weight": 3.0},  // Higher weight = more important
  "tpot": {"threshold": 0.05, "weight": 2.0}
}

3. Soft Boundaries for Exploration

Penalize but don’t reject near-boundary configurations:

"slo": {
  "latency": {
    "p90": {"threshold": 5.0, "weight": 2.0, "hard_fail": false}
  },
  "steepness": 0.15  // Gentler curve for exploration
}

Design Rationale

Why Exponential Penalties?

Linear penalties don’t adequately penalize configurations near SLO boundaries:

Violation

Linear (2x weight)

Exponential (weight=2, s=0.1)

5% over

1.10x

2.30x

10% over

1.20x

3.72x

20% over

1.40x

15.78x

50% over

2.00x

297.4x

Exponential curves create steep gradients that guide optimization away from SLO boundaries.

Why Tiered Enforcement?

  • Soft Penalties: Allow exploration of configurations slightly over SLO

  • Hard Failures: Reject configurations that egregiously violate critical SLOs

This mirrors real-world SLO design where some violations are tolerable (warn) and others are not (page).

Backward Compatibility

Tasks without slo configuration continue to work unchanged. SLO scoring is fully optional and backward compatible.

Future Enhancements

  • Support for throughput SLOs (minimum thresholds)

  • Custom penalty functions (polynomial, piecewise)

  • SLO violation budgets (allow N% of experiments to violate)

  • SLO-aware Bayesian optimization (constrained BO)

References

  • Exponential Penalty Functions: Common in constrained optimization

  • SLO Design: Google SRE Book - Chapter 4 (Service Level Objectives)

  • Tiered Enforcement: Inspired by alerting thresholds (warn/critical)


Graded Failure Penalties for Bayesian Optimization

Problem

When all experiments fail with infinite scores (-inf or +inf), Bayesian optimization cannot distinguish between parameter configurations and degrades to random search.

Solution: Time-Based Failure Penalties

Failed experiments receive graded penalties based on failure timing - earlier failures get worse penalties.

Penalty Calculation

Located in src/web/workers/autotuner_worker.py:

def calculate_failure_penalty(started_at, failed_at, timeout_seconds, 
                              experiment_status, error_message, objective_name):
    elapsed = (failed_at - started_at).total_seconds()
    completion_pct = min(elapsed / timeout_seconds, 1.0)

    # Base penalty by completion percentage
    if completion_pct < 0.20:
        base_penalty = -1000  # Very early (deployment, immediate crash)
    elif completion_pct < 0.60:
        base_penalty = -500   # Mid-stage (benchmark started but failed)
    elif completion_pct < 0.95:
        base_penalty = -200   # Late-stage (benchmark mostly done)
    else:
        base_penalty = -100   # Timeout (full duration)

    # Modifiers based on error type
    if "oom" or "memory" in error: base_penalty *= 1.5  # Resource failures
    if "deploy" in error: base_penalty *= 1.2            # Deployment failures
    if "connection" in error: base_penalty *= 0.8        # Transient issues

    # Invert for minimize objectives
    return -base_penalty if "minimize" in objective_name else base_penalty

Benefits

  1. Provides gradient: Bayesian optimizer can distinguish parameter quality even when all fail

  2. Prioritizes stability: Configs that run longer are preferred

  3. Contextual penalties: Error types affect severity

  4. Enables learning: Optimizer learns to avoid problematic parameter regions

Example Scenarios

Failure Timing

Completion %

Base Penalty

Scenario

10 seconds

2%

-1000

Deployment failure, config clearly broken

200 seconds

50%

-500

Benchmark started but OOM

450 seconds

90%

-200

Almost complete, near-miss

500 seconds

100%

-100

Timeout, config might work with more time

This allows the optimizer to progressively learn which parameters cause early vs late failures.