Gsd-skill-creator sre-patterns

Provides Site Reliability Engineering best practices for SLOs, SLIs, SLAs, error budgets, toil reduction, reliability reviews, and capacity planning. Use when defining service objectives, measuring reliability, reducing toil, planning capacity, or when user mentions 'SRE', 'SLO', 'SLI', 'SLA', 'error budget', 'toil', 'reliability', 'on-call', 'capacity planning'.

install
source · Clone the upstream repo
git clone https://github.com/Tibsfox/gsd-skill-creator
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Tibsfox/gsd-skill-creator "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/skills/ops/sre-patterns" ~/.claude/skills/tibsfox-gsd-skill-creator-sre-patterns && rm -rf "$T"
manifest: examples/skills/ops/sre-patterns/SKILL.md
source content

SRE Patterns

Best practices for building and operating reliable systems using Site Reliability Engineering principles.

SLO / SLI / SLA Definitions

These three concepts form the foundation of SRE. They are distinct and frequently confused.

ConceptDefinitionOwnerExample
SLI (Service Level Indicator)A quantitative measurement of a service attributeEngineering99.2% of requests completed in < 300ms
SLO (Service Level Objective)A target value or range for an SLIEngineering + Product99.5% of requests must complete in < 300ms
SLA (Service Level Agreement)A contract with consequences for missing an SLOBusiness + Legal99.9% uptime or customer receives service credits

Relationship

SLI (what you measure)
 --> SLO (what you target, always stricter than SLA)
  --> SLA (what you promise externally, with penalties)

Key rule: SLO must be stricter than SLA. If your SLA promises 99.9% uptime, your internal SLO should target 99.95%. The gap is your safety margin.

SLI Specification

SLIs must be precise, measurable, and tied to user experience. Vague indicators lead to meaningless objectives.

SLI Types by Service Category

Service TypeSLI CategoryGood EventValid Event
Request-drivenAvailabilityResponse status < 500All HTTP requests
Request-drivenLatencyResponse time < 300msAll HTTP requests
Data pipelineFreshnessData age < 10 minutesAll data records
Data pipelineCorrectnessRecords with no processing errorsAll processed records
Storage systemDurabilityObjects retrievable after writeAll stored objects
Storage systemThroughputRead operations < 50msAll read operations

SLI Specification Example

# sli-specification.yaml
service: payment-api
slis:
  - name: availability
    description: Proportion of successful requests
    specification:
      good_event: "HTTP response status code is not 5xx"
      valid_event: "All HTTP requests to /api/v1/payments/*"
      measurement_source: load_balancer_logs
      measurement_window: rolling_28_days
    implementation:
      numerator: "count(status < 500)"
      denominator: "count(all requests)"
      exclude:
        - health_check_endpoints
        - synthetic_monitoring_requests

  - name: latency
    description: Proportion of requests served within threshold
    specification:
      good_event: "HTTP response completes within 300ms"
      valid_event: "All non-background HTTP requests"
      measurement_source: server_side_metrics
      measurement_window: rolling_28_days
    implementation:
      numerator: "count(duration_ms <= 300)"
      denominator: "count(all requests)"
      percentile_targets:
        p50: 100ms
        p95: 250ms
        p99: 500ms

Error Budget Calculation

The error budget is the inverse of your SLO -- the amount of unreliability you can tolerate.

Formula

Error Budget = 1 - SLO target

Example:
  SLO = 99.9% availability
  Error Budget = 1 - 0.999 = 0.1%

  In a 30-day month (43,200 minutes):
  Budget = 43,200 * 0.001 = 43.2 minutes of downtime allowed

Error Budget by SLO Level

SLO TargetError Budget (30 days)Error Budget (per quarter)Practical Meaning
99.0%7 hours 12 min21 hours 36 minTolerates significant outages
99.5%3 hours 36 min10 hours 48 minWeekly maintenance window feasible
99.9%43 minutes 12 sec2 hours 9 minNo room for long outages
99.95%21 minutes 36 sec1 hour 5 minRequires high automation
99.99%4 minutes 19 sec12 min 58 secRequires redundancy at every layer

Error Budget Burn Rate

Burn rate measures how fast you are consuming your error budget relative to the budget period.

# error_budget_tracking.py

def calculate_burn_rate(
    error_count: int,
    total_requests: int,
    slo_target: float,
    budget_period_hours: float,
    elapsed_hours: float
) -> dict:
    """Calculate error budget consumption and burn rate."""
    error_budget = 1.0 - slo_target
    current_error_rate = error_count / total_requests if total_requests > 0 else 0
    budget_consumed = current_error_rate / error_budget if error_budget > 0 else float('inf')

    # Burn rate: 1.0 means consuming budget exactly on pace
    # > 1.0 means burning faster than sustainable
    expected_consumed = elapsed_hours / budget_period_hours
    burn_rate = budget_consumed / expected_consumed if expected_consumed > 0 else 0

    remaining_budget = max(0, error_budget - current_error_rate)
    hours_remaining = (
        (remaining_budget / current_error_rate) * elapsed_hours
        if current_error_rate > 0 else float('inf')
    )

    return {
        "slo_target": slo_target,
        "error_budget_total": error_budget,
        "current_error_rate": current_error_rate,
        "budget_consumed_pct": budget_consumed * 100,
        "burn_rate": burn_rate,
        "hours_until_exhausted": hours_remaining,
    }

# Example usage:
# 99.9% SLO, 720-hour budget period (30 days), 168 hours elapsed (1 week)
result = calculate_burn_rate(
    error_count=150,
    total_requests=500_000,
    slo_target=0.999,
    budget_period_hours=720,
    elapsed_hours=168
)
# burn_rate > 1.0 --> alerting threshold

Error Budget Policy

An error budget policy defines what happens when the budget is consumed. Without a policy, the budget is just a number.

# error-budget-policy.yaml
service: payment-api
slo: 99.9% availability (rolling 28 days)
policy_owner: payments-team-lead
approved_by: vp-engineering
effective_date: 2025-01-15

budget_thresholds:
  - level: normal
    condition: "budget_consumed < 50%"
    actions:
      - Continue normal feature development
      - Standard deployment cadence (daily)
      - Routine reliability improvements as scheduled

  - level: caution
    condition: "budget_consumed >= 50% AND < 75%"
    actions:
      - Review recent deployments for reliability impact
      - Increase monitoring alert sensitivity
      - Prioritize known reliability-related bugs
      - Reduce deployment frequency to twice per week

  - level: critical
    condition: "budget_consumed >= 75% AND < 100%"
    actions:
      - Halt all non-reliability feature work
      - Require SRE approval for every deployment
      - Deploy only bug fixes and reliability improvements
      - Conduct focused reliability review within 48 hours
      - Notify stakeholders of SLO risk

  - level: exhausted
    condition: "budget_consumed >= 100%"
    actions:
      - Feature freeze until budget replenishes or root cause resolved
      - Emergency reliability review within 24 hours
      - Postmortem required for each new incident
      - All deployments require SRE sign-off and canary phase
      - Weekly status report to VP Engineering

escalation:
  budget_dispute: "Escalate to VP Engineering for arbitration"
  exemptions: "Launch exemptions require VP+ approval with risk acceptance"

Toil Measurement and Reduction

Toil is work that is manual, repetitive, automatable, reactive, and scales linearly with service growth. Toil is the enemy of reliability engineering.

Toil Characteristics

CharacteristicDescriptionExample
ManualRequires a human to performSSH into server to restart service
RepetitiveDone more than once or twiceWeekly cert rotation by hand
AutomatableA machine could do itCopying logs to analysis bucket
ReactiveTriggered by an event, not proactiveResponding to disk full alerts
No enduring valueDoes not improve the serviceRe-running failed batch jobs
Scales with serviceMore instances = more workManually updating configs per server

Toil Measurement Framework

# toil-tracking.yaml
team: platform-sre
measurement_period: 2025-Q1
target_toil_budget: 30%  # Max 30% of team time on toil

categories:
  - name: incident_response
    hours_per_week: 8
    toil_percentage: 60%    # 60% of incident response is toil
    toil_hours: 4.8
    examples:
      - Manually restarting crashed services
      - Clearing stuck queue items
      - Responding to known-cause alerts without automation

  - name: deployment_support
    hours_per_week: 6
    toil_percentage: 40%
    toil_hours: 2.4
    examples:
      - Manual pre-deploy checklist verification
      - Running migration scripts by hand
      - Monitoring dashboards during every deploy

  - name: capacity_management
    hours_per_week: 4
    toil_percentage: 75%
    toil_hours: 3.0
    examples:
      - Manually resizing instances
      - Tracking disk usage via spreadsheets
      - Filing tickets to request quota increases

  - name: access_provisioning
    hours_per_week: 3
    toil_percentage: 90%
    toil_hours: 2.7
    examples:
      - Creating accounts across multiple systems
      - Rotating credentials manually
      - Revoking access for departing employees

summary:
  total_team_hours_per_week: 200  # 5 engineers * 40 hours
  total_toil_hours_per_week: 12.9
  toil_percentage: 6.45%
  status: within_budget
  top_reduction_targets:
    - access_provisioning   # 90% toil -- automate with IaC/SCIM
    - capacity_management   # 75% toil -- autoscaling policies
    - incident_response     # 60% toil -- self-healing automation

Capacity Planning

Capacity planning ensures services can handle expected and unexpected load without degradation.

Capacity Planning Model

# capacity_model.py

from dataclasses import dataclass
from typing import Optional

@dataclass
class CapacityPlan:
    service: str
    current_peak_rps: float
    growth_rate_monthly: float       # e.g., 0.05 for 5% per month
    headroom_target: float           # e.g., 0.30 for 30% headroom
    max_rps_per_instance: float
    current_instances: int
    burst_multiplier: float = 2.0    # Expected burst over peak

    def projected_peak(self, months_ahead: int) -> float:
        """Project peak RPS N months from now."""
        return self.current_peak_rps * (1 + self.growth_rate_monthly) ** months_ahead

    def required_capacity(self, months_ahead: int) -> float:
        """Required RPS capacity including headroom and burst tolerance."""
        projected = self.projected_peak(months_ahead)
        with_burst = projected * self.burst_multiplier
        with_headroom = with_burst / (1 - self.headroom_target)
        return with_headroom

    def required_instances(self, months_ahead: int) -> int:
        """Number of instances needed."""
        import math
        capacity = self.required_capacity(months_ahead)
        return math.ceil(capacity / self.max_rps_per_instance)

    def months_until_scaling_needed(self) -> Optional[int]:
        """Months until current instance count is insufficient."""
        current_max = self.current_instances * self.max_rps_per_instance
        for month in range(1, 37):
            if self.required_capacity(month) > current_max:
                return month
        return None  # Sufficient for 3+ years

    def report(self, horizon_months: int = 6) -> str:
        lines = [f"Capacity Plan: {self.service}", "=" * 40]
        lines.append(f"Current peak: {self.current_peak_rps:.0f} RPS")
        lines.append(f"Current instances: {self.current_instances}")
        lines.append(f"Growth rate: {self.growth_rate_monthly*100:.1f}%/month")
        lines.append("")

        for m in [1, 3, 6, 12]:
            if m <= horizon_months:
                needed = self.required_instances(m)
                delta = needed - self.current_instances
                flag = " ** SCALE NEEDED **" if delta > 0 else ""
                lines.append(f"  +{m:2d} months: {needed} instances (delta: {delta:+d}){flag}")

        scaling_month = self.months_until_scaling_needed()
        if scaling_month:
            lines.append(f"\nScaling needed in: {scaling_month} month(s)")
        else:
            lines.append("\nCapacity sufficient for 3+ years")
        return "\n".join(lines)


# Example:
plan = CapacityPlan(
    service="payment-api",
    current_peak_rps=1200,
    growth_rate_monthly=0.08,
    headroom_target=0.30,
    max_rps_per_instance=500,
    current_instances=10,
    burst_multiplier=2.0
)
print(plan.report(horizon_months=12))

Reliability Review Process

Reliability reviews are structured evaluations of a service's production readiness and ongoing operational health.

Pre-Launch Review

Review AreaKey QuestionsPass Criteria
SLOs definedAre SLIs and SLOs documented?At least availability + latency SLOs
MonitoringAre dashboards and alerts configured?SLO-based alerts with multi-window burn rate
Incident responseIs there a runbook?Documented runbook with escalation paths
CapacityCan it handle 2x current load?Load test results proving headroom
DependenciesAre failure modes mapped?Dependency map with fallback behavior
RollbackCan you revert within 5 minutes?Tested rollback procedure
Data integrityAre backups tested?Backup restore tested within last 30 days
SecurityHas threat modeling been done?Threat model documented, critical items resolved

Ongoing Review Cadence

Weekly:  Error budget review (automated dashboard)
Monthly: Service health review (SRE + dev team, 30 min)
Quarterly: Full reliability review (cross-functional, 2 hours)
Annually: Architecture review (principal engineers + SRE, half day)

Monthly Service Health Review Template

## Service Health Review: [service-name]
Date: [date]
Attendees: [list]

### SLO Performance
- Availability SLO: [target] | Actual: [value] | Budget remaining: [%]
- Latency SLO: [target] | Actual: [value] | Budget remaining: [%]

### Incidents This Period
| Date | Severity | Duration | Budget Impact | Postmortem |
|------|----------|----------|--------------|------------|

### Toil Report
- Toil hours this period: [X]
- Top toil sources: [list]
- Automation tickets filed: [count]

### Action Items
| Item | Owner | Due Date | Status |
|------|-------|----------|--------|

### Capacity Outlook
- Current utilization: [%]
- Scaling needed by: [date or N/A]

On-Call Best Practices

On-Call Structure

PracticeRecommendationRationale
Rotation length1 weekLong enough for context, short enough to avoid burnout
Team sizeMinimum 6-8 engineersEnsures no one is on-call more than 1 in 6 weeks
Handoff30-minute overlap meetingTransfer context on active issues
EscalationPrimary -> Secondary -> Team Lead -> ManagerClear chain prevents ambiguity
Response time5 min acknowledge, 15 min start investigationDocumented in on-call agreement
CompensationTime off in lieu or pay differentialOn-call without compensation causes attrition

Alert Quality

Good alert:
  - Actionable (something a human must do NOW)
  - Tied to an SLO (not a system metric)
  - Has a runbook link
  - Fires infrequently (< 2 per shift)

Bad alert:
  - Informational (log it, don't page)
  - Fires often and gets ignored (alert fatigue)
  - No runbook (engineer wastes time figuring out what to do)
  - Not tied to user impact

Multi-Window Burn Rate Alerting

Alert on error budget burn rate rather than raw error counts. Use multiple windows to balance sensitivity with false positive rate.

Alert SeverityBurn RateLong WindowShort WindowAction
Page (urgent)14.4x1 hour5 minutesImmediate investigation
Page (less urgent)6x6 hours30 minutesInvestigate within 30 min
Ticket3x3 days6 hoursFix within 1 business day
Log1x28 days3 daysReview at next planning

Incident Management

Severity Levels

SeverityDefinitionResponseExample
SEV1Service down, all users affectedImmediate, all-handsTotal outage of payment processing
SEV2Significant degradation, many users affectedImmediate, on-call teamLatency 10x normal, 30% errors
SEV3Partial degradation, some users affectedWithin 1 hourOne region experiencing failures
SEV4Minor issue, few users affectedNext business dayCosmetic issue in dashboard

Postmortem Structure

Every SEV1 and SEV2 incident gets a blameless postmortem within 72 hours.

## Postmortem: [Incident Title]
Date: [incident date]
Duration: [start time] to [end time] ([total duration])
Severity: [SEV level]
Author: [name]
Reviewers: [names]

### Summary
[1-2 sentence description of what happened and impact]

### Impact
- Users affected: [number or percentage]
- Revenue impact: [if applicable]
- Error budget consumed: [percentage]
- SLO status: [still within / breached]

### Timeline
| Time (UTC) | Event |
|-----------|-------|
| HH:MM | First alert fired |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Full recovery confirmed |

### Root Cause
[Technical description of what caused the incident]

### Contributing Factors
- [Factor 1]
- [Factor 2]

### What Went Well
- [Thing 1]
- [Thing 2]

### What Could Be Improved
- [Thing 1]
- [Thing 2]

### Action Items
| Action | Type | Owner | Bug/Ticket | Due |
|--------|------|-------|-----------|-----|
| [action] | prevent | [name] | [link] | [date] |
| [action] | detect | [name] | [link] | [date] |
| [action] | mitigate | [name] | [link] | [date] |

Anti-Patterns

Anti-PatternProblemFix
SLOs set by management without engineering inputUnrealistic targets that create constant fire-fightingSLOs must be data-driven and jointly owned by eng + product
100% availability SLOImpossible to maintain, blocks all deploymentsHighest practical target is 99.999%; most services need 99.9%
SLIs measured at the server, not at the userMisses network issues, CDN failures, client-side errorsMeasure SLIs at the load balancer or client where possible
Error budget without a policyBudget is tracked but nothing changes when it is consumedWrite and enforce a formal error budget policy document
Alerting on raw metrics instead of SLOsAlert fatigue from non-user-impacting eventsUse multi-window burn rate alerting tied to SLOs
Treating all toil as unavoidableTeam spends 60%+ time on repetitive manual workMeasure toil, set a budget (< 50%), automate top sources
No postmortems or blame-focused postmortemsSame incidents recur; engineers hide mistakesBlameless postmortems with tracked action items
Capacity planning by gut feelOver-provisioning wastes money; under-provisioning causes outagesModel growth, load test regularly, maintain 30% headroom
On-call without runbooksEngineers waste time investigating known issuesEvery alert must link to a runbook with diagnosis steps
On-call hero cultureOne person handles everything, burns out, leaves with all contextMinimum team size of 6-8, mandatory rotation, no opt-out
Monitoring everything, alerting on nothing usefulThousands of metrics, dashboards no one looks atFocus on the 3-5 SLIs that represent user experience
SLAs more aggressive than SLOsNo safety margin; internal failures immediately become contract breachesSLOs must be stricter than SLAs by a meaningful margin
Skipping reliability reviews before launchProduction issues discovered by users, not engineersMandatory pre-launch review with documented pass criteria

SRE Maturity Checklist

Level 1: Foundations

  • SLIs defined for every user-facing service
  • SLOs documented and published to stakeholders
  • Basic monitoring and dashboards in place
  • On-call rotation established with at least 6 engineers
  • Incident response process documented
  • Postmortem process established (blameless)

Level 2: Operational

  • Error budgets calculated and tracked automatically
  • Error budget policy written and enforced
  • Multi-window burn rate alerting configured
  • Toil measured and tracked quarterly
  • Runbooks exist for every on-call alert
  • Monthly service health reviews conducted
  • Capacity planning done quarterly with growth projections

Level 3: Proactive

  • Toil consistently below 30% of team time
  • Automated remediation for top 5 incident categories
  • Chaos engineering practiced regularly
  • Load testing integrated into release pipeline
  • Dependency failure modes mapped and tested
  • SLO-informed release decisions (feature freeze when budget low)
  • Cross-team reliability standards established

Level 4: Optimized

  • SLO performance consistently within target for 4+ quarters
  • Error budget rarely exhausted (< 1 per quarter)
  • Toil below 15% of team time
  • Automated capacity planning with autoscaling
  • Proactive reliability improvements outpace reactive work
  • Reliability culture embedded across all engineering teams
  • Regular architecture reviews with reliability as primary lens