Agent-almanac define-slo-sli-sla
git clone https://github.com/pjt222/agent-almanac
T=$(mktemp -d) && git clone --depth=1 https://github.com/pjt222/agent-almanac "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/define-slo-sli-sla" ~/.claude/skills/pjt222-agent-almanac-define-slo-sli-sla-fb5851 && rm -rf "$T"
skills/define-slo-sli-sla/SKILL.mdDefine SLO/SLI/SLA
Establish measurable reliability targets with Service Level Objectives, track them with indicators, and manage error budgets.
When to Use
- Defining reliability targets for customer-facing services or APIs
- Establishing clear expectations between service providers and consumers
- Balancing feature velocity with system reliability through error budgets
- Creating objective criteria for incident severity and response
- Migrating from arbitrary uptime goals to data-driven reliability metrics
- Implementing Site Reliability Engineering (SRE) practices
- Measuring and improving service quality over time
Inputs
- Required: Service description and critical user journeys
- Required: Historical metrics data (request rates, latencies, error rates)
- Optional: Existing SLA commitments to customers
- Optional: Business requirements for service availability and performance
- Optional: Incident history and customer impact data
Procedure
See Extended Examples for complete configuration files and templates.
Step 1: Understand SLI, SLO, and SLA Hierarchy
Learn the relationship and differences between these three concepts.
Definitions:
SLI (Service Level Indicator) - **What**: A quantitative measure of service behavior - **Example**: Request success rate, request latency, system throughput - **Measurement**: `successful_requests / total_requests * 100` SLO (Service Level Objective) - **What**: Target value or range for an SLI over a time window - **Example**: 99.9% of requests succeed in 30-day window - **Purpose**: Internal reliability target to guide operations SLA (Service Level Agreement) - **What**: Contractual commitment with consequences for missing SLO - **Example**: 99.9% uptime SLA with refunds if breached - **Purpose**: External promise to customers with penalties
Hierarchy:
SLA (99.9% uptime, customer refunds) ├─ SLO (99.95% success rate, internal target) │ └─ SLI (actual measured: 99.97% success rate) └─ Error Budget (0.05% failures allowed per month)
Key principle: SLO should be stricter than SLA to provide buffer before customer impact.
Example:
- SLA: 99.9% availability (customer promise)
- SLO: 99.95% availability (internal target)
- Buffer: 0.05% cushion before SLA breach
Expected: Team understands differences, agreement on which metrics become SLIs, alignment on SLO targets.
On failure:
- Review Google SRE book chapters on SLI/SLO/SLA
- Conduct workshop with stakeholders to align on definitions
- Start with simple success-rate SLI before complex latency SLOs
Step 2: Select Appropriate SLIs
Choose SLIs that reflect user experience and business impact.
The Four Golden Signals (Google SRE):
-
Latency: Time to serve a request
# P95 latency histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) ) -
Traffic: Demand on the system
# Requests per second sum(rate(http_requests_total[5m])) -
Errors: Rate of failed requests
# Error rate percentage sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 -
Saturation: How "full" the system is
# CPU saturation avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
Common SLI patterns:
# Availability SLI availability: description: "Percentage of successful requests" query: | sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) good_threshold: 0.999 # 99.9% # Latency SLI latency: description: "P99 request latency under 500ms" query: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) ) < 0.5 good_threshold: 0.95 # 95% of windows meet target # Throughput SLI throughput: description: "Requests processed per second" query: | sum(rate(http_requests_total[5m])) good_threshold: 1000 # Minimum 1000 req/s # Data freshness SLI (for batch jobs) freshness: description: "Data updated within last hour" query: | (time() - max(data_last_updated_timestamp)) < 3600 good_threshold: 1 # Always fresh
SLI selection criteria:
- User-visible: Reflects actual user experience
- Measurable: Can be quantified from existing metrics
- Actionable: Team can improve it through engineering work
- Meaningful: Correlates with customer satisfaction
- Simple: Easy to understand and explain
Avoid:
- Internal system metrics not visible to users (CPU, memory)
- Vanity metrics that don't predict customer impact
- Overly complex composite scores
Expected: 2-4 SLIs selected per service, covering availability and latency at minimum, team agreement on measurement queries.
On failure:
- Map user journey to identify critical failure points
- Analyze incident history: which metrics predicted customer impact?
- Validate SLI with A/B test: degrade metric, measure customer complaints
- Start with simple availability SLI, add complexity iteratively
Step 3: Set SLO Targets and Time Windows
Define realistic and achievable reliability targets.
SLO specification format:
service: user-api slos: - name: availability objective: 99.9 description: | 99.9% of requests return non-5xx status codes # ... (see EXAMPLES.md for complete configuration)
Time window selection:
Common windows:
- 30 days (monthly): Typical for external SLAs
- 7 days (weekly): Faster feedback for engineering teams
- 1 day (daily): High-frequency services requiring rapid response
Example 30-day window error budget:
SLO: 99.9% availability over 30 days Allowed failures: 0.1% Total requests per month: 100M Error budget: 100,000 failed requests Daily budget: ~3,333 failed requests
Setting realistic targets:
-
Baseline current performance:
# Check actual availability over past 90 days avg_over_time( (sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])))[90d:5m] ) # Result: 99.95% → Set SLO at 99.9% (safer than current) -
Calculate cost of nines:
99% → 7.2 hours downtime/month (low reliability) 99.9% → 43 minutes downtime/month (good) 99.95% → 22 minutes downtime/month (very good) 99.99% → 4.3 minutes downtime/month (expensive) 99.999% → 26 seconds downtime/month (very expensive) -
Balance user happiness vs engineering cost:
- Too strict: Expensive, slows feature development
- Too loose: Poor user experience, customer churn
- Sweet spot: Slightly better than user expectations
Expected: SLO targets set with business stakeholder buy-in, documented with rationale, error budget calculated.
On failure:
- Start with achievable target (e.g., 99% if current is 98.5%)
- Iterate SLO targets quarterly based on actual performance
- Get executive sponsorship for realistic targets vs "five nines" demands
- Document cost-benefit analysis for each additional nine
Step 4: Implement SLO Monitoring with Sloth
Use Sloth to generate Prometheus recording rules and alerts from SLO specs.
Install Sloth:
# Binary installation wget https://github.com/slok/sloth/releases/download/v0.11.0/sloth-linux-amd64 chmod +x sloth-linux-amd64 sudo mv sloth-linux-amd64 /usr/local/bin/sloth # Or Docker docker pull ghcr.io/slok/sloth:latest
Create Sloth SLO specification (
slos/user-api.yml):
version: "prometheus/v1" service: "user-api" labels: team: "platform" tier: "1" slos: # ... (see EXAMPLES.md for complete configuration)
Generate Prometheus rules:
# Generate recording and alerting rules sloth generate -i slos/user-api.yml -o prometheus/rules/user-api-slo.yml # Validate generated rules promtool check rules prometheus/rules/user-api-slo.yml
Generated recording rules (excerpt):
groups: - name: sloth-slo-sli-recordings-user-api-requests-availability interval: 30s rules: # SLI: Ratio of good events - record: slo:sli_error:ratio_rate5m # ... (see EXAMPLES.md for complete configuration)
Generated alerts:
groups: - name: sloth-slo-alerts-user-api-requests-availability rules: # Fast burn: 2% budget consumed in 1 hour - alert: UserAPIHighErrorRate expr: | # ... (see EXAMPLES.md for complete configuration)
Load rules into Prometheus:
# prometheus.yml rule_files: - "rules/user-api-slo.yml"
Reload Prometheus:
curl -X POST http://localhost:9090/-/reload
Expected: Sloth generates multi-window multi-burn-rate alerts, recording rules evaluate successfully, alerts fire appropriately during incidents.
On failure:
- Validate YAML syntax with
yamllint slos/user-api.yml - Check Sloth version compatibility (v0.11+ recommended)
- Verify Prometheus recording rule evaluation:
curl http://localhost:9090/api/v1/rules - Test with synthetic error injection to trigger alerts
- Check Sloth documentation for SLI event query format
Step 5: Build Error Budget Dashboards
Visualize SLO compliance and error budget consumption in Grafana.
Grafana dashboard JSON (excerpt):
{ "dashboard": { "title": "SLO Dashboard - User API", "panels": [ { "type": "stat", # ... (see EXAMPLES.md for complete configuration)
Key metrics to visualize:
- SLO target vs current SLI
- Error budget remaining (percentage and absolute)
- Burn rate (how fast budget is depleting)
- Historical SLI trends (30-day rolling window)
- Time to exhaustion (if current burn rate continues)
Error budget policy dashboard (markdown panel):
## Error Budget Policy **Current Status**: 78% budget remaining ### If Error Budget > 50% - ✅ Full speed ahead on new features # ... (see EXAMPLES.md for complete configuration)
Expected: Dashboards show real-time SLO compliance, error budget depletion visible, team can make informed decisions about feature velocity.
On failure:
- Verify recording rules exist:
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.name | contains("slo:"))' - Check Prometheus datasource in Grafana has correct URL
- Validate query results in Explore view before adding to dashboard
- Ensure time range set to appropriate window (e.g., 30d for monthly SLOs)
Step 6: Establish Error Budget Policy
Define organizational process for managing error budgets.
Error budget policy template:
service: user-api slo: availability: 99.9% latency_p99: 200ms window: 30 days # ... (see EXAMPLES.md for complete configuration)
Automate policy enforcement:
# Example: Deployment gate script import requests import sys def check_error_budget(service): # Query Prometheus for error budget # ... (see EXAMPLES.md for complete configuration)
Integrate into CI/CD pipeline:
# .github/workflows/deploy.yml jobs: check-error-budget: runs-on: ubuntu-latest steps: - name: Check SLO Error Budget run: | python scripts/check_error_budget.py user-api - name: Deploy if: success() run: | kubectl apply -f deploy/
Expected: Clear policy documented, automated gates prevent risky deployments during budget depletion, team alignment on reliability priorities.
On failure:
- Start with manual policy enforcement (Slack reminders)
- Gradually automate with soft gates (warnings, not blocks)
- Get executive buy-in before hard gates (blocking deployments)
- Review policy effectiveness quarterly, adjust thresholds as needed
Validation
- SLIs selected reflect user experience and business impact
- SLO targets set with stakeholder agreement and documented rationale
- Prometheus recording rules generate SLI metrics successfully
- Multi-burn-rate alerts configured and tested with synthetic errors
- Grafana dashboards show real-time SLO compliance and error budget
- Error budget policy documented and communicated to team
- Automated gates prevent risky deployments during budget depletion
- Weekly/monthly SLO review meetings scheduled
- Incident retrospectives include SLO impact analysis
- SLO compliance reports shared with stakeholders
Common Pitfalls
- Overly strict SLOs: Setting "five nines" without cost analysis leads to burnout and slowed feature velocity. Start achievable, iterate up.
- Too many SLIs: Tracking 10+ indicators creates confusion. Focus on 2-4 critical user-facing metrics.
- SLO without SLA buffer: Setting SLO equal to SLA leaves no margin for error before customer impact. Keep 0.05-0.1% buffer.
- Ignoring error budget: Tracking SLOs but not acting on budget depletion defeats the purpose. Enforce error budget policy.
- Vanity metrics as SLIs: Using internal metrics (CPU, memory) instead of user-visible metrics (latency, errors) misaligns priorities.
- No stakeholder buy-in: Engineering-only SLOs without product/business agreement lead to conflicts. Get executive sponsorship.
- Static SLOs: Never reviewing or adjusting targets as system evolves. Revisit quarterly based on actual performance and user feedback.
Related Skills
- Configure Prometheus to collect metrics for SLI calculationsetup-prometheus-monitoring
- Integrate SLO burn rate alerts with Alertmanager for on-call notificationsconfigure-alerting-rules
- Visualize SLO compliance and error budget consumptionbuild-grafana-dashboards
- Include SLO impact in runbooks for prioritizing incident responsewrite-incident-runbook