git clone https://github.com/openclaw/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/1kalin/afrexai-sre-platform" ~/.claude/skills/openclaw-skills-afrexai-sre-platform && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/1kalin/afrexai-sre-platform" ~/.openclaw/skills/openclaw-skills-afrexai-sre-platform && rm -rf "$T"
skills/1kalin/afrexai-sre-platform/SKILL.mdSRE & Incident Management Platform
Complete Site Reliability Engineering system — from SLO definition through incident response, chaos engineering, and operational excellence. Zero dependencies.
Phase 1: Reliability Assessment
Before building anything, assess where you are.
Service Catalog Entry
service: name: "" tier: "" # critical | important | standard | experimental owner_team: "" oncall_rotation: "" dependencies: upstream: [] # services we call downstream: [] # services that call us data_classification: "" # public | internal | confidential | restricted deployment_frequency: "" # daily | weekly | biweekly | monthly architecture: "" # monolith | microservice | serverless | hybrid language: "" infra: "" # k8s | ECS | Lambda | VM | bare-metal traffic_pattern: "" # steady | diurnal | spiky | seasonal peak_rps: 0 storage_gb: 0 monthly_cost_usd: 0
Maturity Assessment (Score 1-5 per dimension)
| Dimension | 1 (Ad-hoc) | 3 (Defined) | 5 (Optimized) | Score |
|---|---|---|---|---|
| SLOs | No SLOs defined | SLOs exist, reviewed quarterly | Data-driven SLOs, auto error budgets | |
| Monitoring | Basic health checks | Golden signals + dashboards | Full observability, anomaly detection | |
| Incident Response | No runbooks, hero culture | Documented process, postmortems | Automated detection, structured ICS | |
| Automation | Manual deployments | CI/CD pipeline, some automation | Self-healing, auto-scaling, GitOps | |
| Chaos Engineering | No testing | Basic failure injection | Continuous chaos in production | |
| Capacity Planning | Reactive scaling | Quarterly forecasting | Predictive auto-scaling | |
| Toil Management | >50% toil | Toil tracked, reduction plans | <25% toil, systematic elimination | |
| On-Call Health | Burnout, 24/7 individuals | Rotation exists, escalation paths | Balanced load, <2 pages/shift |
Score interpretation:
- 8-16: Firefighting mode — start with SLOs + incident process
- 17-24: Foundation built — add chaos engineering + toil reduction
- 25-32: Maturing — optimize error budgets + capacity planning
- 33-40: Advanced — focus on predictive reliability + culture
Phase 2: SLI/SLO Framework
SLI Selection by Service Type
| Service Type | Primary SLI | Secondary SLIs |
|---|---|---|
| API/Backend | Request success rate | Latency p50/p95/p99, throughput |
| Frontend/Web | Page load (LCP) | FID/INP, CLS, error rate |
| Data Pipeline | Freshness | Correctness, completeness, throughput |
| Storage | Durability | Availability, latency |
| Streaming | Processing latency | Throughput, ordering, data loss rate |
| Batch Job | Success rate | Duration, SLA compliance |
| ML Model | Prediction latency | Accuracy drift, feature freshness |
SLI Specification Template
sli: name: "request_success_rate" description: "Proportion of valid requests served successfully" type: "availability" # availability | latency | quality | freshness measurement: good_events: "HTTP responses with status < 500" total_events: "All HTTP requests excluding health checks" source: "load balancer access logs" aggregation: "sum(good) / sum(total) over rolling 28-day window" exclusions: - "Health check endpoints (/healthz, /readyz)" - "Synthetic monitoring traffic" - "Requests from blocked IPs" - "4xx responses (client errors)"
SLO Target Selection Guide
| Nines | Uptime % | Downtime/month | Appropriate for |
|---|---|---|---|
| 2 nines | 99% | 7h 18m | Internal tools, dev environments |
| 2.5 | 99.5% | 3h 39m | Non-critical services, backoffice |
| 3 nines | 99.9% | 43m 50s | Standard production services |
| 3.5 | 99.95% | 21m 55s | Important customer-facing services |
| 4 nines | 99.99% | 4m 23s | Critical services, payments, auth |
| 5 nines | 99.999% | 26s | Life-safety, financial clearing |
Rules for setting targets:
- Start lower than you think — you can always tighten
- SLO < SLA (always have buffer — typically 0.1-0.5% margin)
- Internal SLO < External SLO (catch problems before customers do)
- Each nine costs ~10x more to achieve
- If you can't measure it, you can't SLO it
SLO Document Template
slo: service: "" sli: "" target: 99.9 # percentage window: "28d" # rolling window error_budget: 0.1 # 100% - target error_budget_minutes: 40 # per 28-day window burn_rate_alerts: - name: "fast_burn" burn_rate: 14.4 # exhausts budget in 2 hours short_window: "5m" long_window: "1h" severity: "page" - name: "medium_burn" burn_rate: 6.0 # exhausts budget in ~5 hours short_window: "30m" long_window: "6h" severity: "page" - name: "slow_burn" burn_rate: 1.0 # exhausts budget in 28 days short_window: "6h" long_window: "3d" severity: "ticket" review_cadence: "monthly" owner: "" stakeholders: [] escalation_when_budget_exhausted: - "Halt non-critical deployments" - "Redirect engineering to reliability work" - "Escalate to VP Engineering if no improvement in 48h"
Phase 3: Error Budget Management
Error Budget Policy
error_budget_policy: service: "" budget_states: healthy: condition: "remaining_budget > 50%" actions: - "Normal development velocity" - "Feature work prioritized" - "Chaos experiments allowed" warning: condition: "remaining_budget 25-50%" actions: - "Increase monitoring scrutiny" - "Review recent changes for risk" - "Limit risky deployments to business hours" - "No chaos experiments" critical: condition: "remaining_budget 0-25%" actions: - "Feature freeze — reliability work only" - "All deployments require SRE approval" - "Mandatory rollback plan for every change" - "Daily error budget review" exhausted: condition: "remaining_budget <= 0" actions: - "Complete deployment freeze" - "All engineering redirected to reliability" - "VP Engineering notified" - "Postmortem required for budget exhaustion" - "Freeze maintained until budget recovers to 10%" exceptions: - "Security patches always allowed" - "Regulatory compliance changes always allowed" - "Data loss prevention always allowed" reset: "Rolling 28-day window (no manual resets)"
Burn Rate Calculation
Burn rate = (error rate observed) / (error rate allowed by SLO) Example: - SLO: 99.9% (error budget = 0.1%) - Current error rate: 0.5% - Burn rate = 0.5% / 0.1% = 5x At 5x burn rate → budget exhausted in 28d / 5 = 5.6 days
Error Budget Dashboard
Track weekly:
| Metric | Current | Trend | Status |
|---|---|---|---|
| Budget remaining (%) | ↑↓→ | 🟢🟡🔴 | |
| Budget consumed this week | |||
| Burn rate (1h / 6h / 24h) | |||
| Incidents consuming budget | |||
| Top error contributor | |||
| Projected exhaustion date |
Phase 4: Monitoring & Alerting Architecture
Four Golden Signals
| Signal | What to Measure | Alert When |
|---|---|---|
| Latency | p50, p95, p99 response time | p99 > 2x baseline for 5 min |
| Traffic | Requests/sec, concurrent users | >30% drop (indicates upstream issue) OR >50% spike |
| Errors | 5xx rate, timeout rate, exception rate | Error rate > SLO burn rate threshold |
| Saturation | CPU, memory, disk, connections, queue depth | >80% sustained for 10 min |
USE Method (Infrastructure)
For every resource, track:
- Utilization: % of capacity used (0-100%)
- Saturation: queue depth / wait time (0 = no waiting)
- Errors: error count / error rate
RED Method (Services)
For every service, track:
- Rate: requests per second
- Errors: failed requests per second
- Duration: latency distribution
Alert Design Rules
- Every alert must have a runbook link — no exceptions
- Every alert must be actionable — if you can't act on it, delete it
- Symptoms over causes — alert on "users can't check out" not "database CPU high"
- Multi-window, multi-burn-rate — avoid single-threshold alerts
- Page only for customer impact — everything else is a ticket
- Alert fatigue = death — review alert volume monthly; target <5 pages/week per service
Alert Severity Guide
| Severity | Response Time | Notification | Examples |
|---|---|---|---|
| P0/Page | <5 min | PagerDuty + phone | SLO burn rate critical, data loss, security breach |
| P1/Urgent | <30 min | Slack + PagerDuty | Degraded service, elevated errors, capacity warning |
| P2/Ticket | Next business day | Ticket auto-created | Slow burn, non-critical component down |
| P3/Log | Weekly review | Dashboard only | Informational, trend detection |
Structured Log Standard
{ "timestamp": "2026-02-17T11:24:00.000Z", "level": "error", "service": "payment-api", "trace_id": "abc123", "span_id": "def456", "message": "Payment processing failed", "error_type": "TimeoutException", "error_message": "Gateway timeout after 30s", "http_method": "POST", "http_path": "/api/v1/payments", "http_status": 504, "duration_ms": 30012, "customer_id": "cust_xxx", "payment_id": "pay_yyy", "amount_cents": 4999, "retry_count": 2, "environment": "production", "host": "payment-api-7b4d9-xk2p1", "region": "us-east-1" }
Phase 5: Incident Response Framework
Severity Classification Matrix
| Impact: 1 User | Impact: <25% Users | Impact: >25% Users | Impact: All Users | |
|---|---|---|---|---|
| Core function down | SEV3 | SEV2 | SEV1 | SEV1 |
| Degraded performance | SEV4 | SEV3 | SEV2 | SEV1 |
| Non-core feature down | SEV4 | SEV3 | SEV3 | SEV2 |
| Cosmetic/minor | SEV4 | SEV4 | SEV3 | SEV3 |
Auto-escalation triggers:
- Any data loss → SEV1 minimum
- Security breach with PII → SEV1
- Revenue-impacting → SEV1 or SEV2
- SLA breach imminent → auto-escalate one level
Incident Command System (ICS)
| Role | Responsibility | Assigned |
|---|---|---|
| Incident Commander (IC) | Owns resolution, makes decisions, manages timeline | |
| Communications Lead | Status updates, stakeholder comms, customer-facing | |
| Operations Lead | Hands-on-keyboard, executing fixes | |
| Subject Matter Expert | Deep knowledge of affected system | |
| Scribe | Documenting timeline, actions, decisions |
IC Rules:
- IC does NOT debug — IC coordinates
- IC makes final decisions when team disagrees
- IC can escalate severity at any time
- IC owns handoff if rotation changes
- IC calls end-of-incident
Incident Response Workflow
DETECT → TRIAGE → RESPOND → MITIGATE → RESOLVE → REVIEW Step 1: DETECT (0-5 min) ├── Alert fires OR user report received ├── On-call acknowledges within SLA └── Quick assessment: is this real? What severity? Step 2: TRIAGE (5-15 min) ├── Classify severity using matrix above ├── Assign IC and roles ├── Open incident channel (#inc-YYYY-MM-DD-title) ├── Post initial status update └── Start timeline document Step 3: RESPOND (15 min - ongoing) ├── IC briefs team: "Here's what we know, here's what we don't" ├── Operations Lead begins investigation ├── Check: recent deployments? Config changes? Dependency issues? ├── Parallel investigation tracks if needed └── 15-minute check-ins for SEV1, 30-min for SEV2 Step 4: MITIGATE (ASAP) ├── Priority: STOP THE BLEEDING ├── Options (fastest first): │ ├── Rollback last deployment │ ├── Feature flag disable │ ├── Traffic shift / failover │ ├── Scale up / circuit breaker │ └── Manual data fix ├── Mitigated ≠ Resolved — temporary fix is OK └── Update status: "Impact mitigated, root cause investigation ongoing" Step 5: RESOLVE ├── Root cause identified and fixed ├── Verification: SLIs back to normal for 30+ minutes ├── All-clear communicated └── IC declares incident resolved Step 6: REVIEW (within 5 business days) ├── Blameless postmortem written ├── Action items assigned with owners and deadlines ├── Postmortem review meeting └── Action items tracked to completion
Communication Templates
Initial notification (internal):
🔴 INCIDENT: [Title] Severity: SEV[X] Impact: [Who/what is affected] Status: Investigating IC: [Name] Channel: #inc-[date]-[slug] Next update: [time]
Customer-facing status:
[Service] - Investigating increased error rates We are currently investigating reports of [symptom]. Some users may experience [user-visible impact]. Our team is actively working on a resolution. We will provide an update within [time].
Resolution notification:
✅ RESOLVED: [Title] Duration: [X hours Y minutes] Impact: [Summary] Root cause: [One sentence] Postmortem: [Link] (within 5 business days)
Phase 6: Postmortem Framework
Blameless Postmortem Template
postmortem: title: "" date: "" severity: "" # SEV1-4 duration: "" # total incident duration authors: [] reviewers: [] status: "draft" # draft | in-review | final summary: | One paragraph: what happened, what was the impact, how was it resolved. impact: users_affected: 0 duration_minutes: 0 revenue_impact_usd: 0 slo_budget_consumed_pct: 0 data_loss: false customer_tickets: 0 timeline: - time: "" event: "" # Chronological, every significant event # Include detection time, escalation, mitigation attempts root_cause: | Technical explanation of WHY it happened. Go deep — surface causes are not root causes. contributing_factors: - "" # What made it worse or delayed resolution? detection: how_detected: "" # alert | user report | manual check time_to_detect_minutes: 0 could_have_detected_sooner: "" resolution: how_resolved: "" time_to_mitigate_minutes: 0 time_to_resolve_minutes: 0 what_went_well: - "" # Explicitly call out what worked what_went_wrong: - "" where_we_got_lucky: - "" # Things that could have made it worse action_items: - id: "AI-001" type: "" # prevent | detect | mitigate | process description: "" owner: "" priority: "" # P0 | P1 | P2 deadline: "" status: "open" # open | in-progress | done ticket: ""
Root Cause Analysis Methods
Five Whys (simple incidents):
- Why did users see errors? → API returned 500s
- Why did API return 500s? → Database connection pool exhausted
- Why was pool exhausted? → Long-running query held connections
- Why was query long-running? → Missing index on new column
- Why was index missing? → Migration didn't include index; no query performance review in CI
→ Root cause: No automated query performance check in deployment pipeline → Action: Add query plan analysis to CI for migration PRs
Fishbone / Ishikawa (complex incidents):
Categories to investigate: ├── People: Training? Fatigue? Communication? ├── Process: Runbook? Escalation? Change management? ├── Technology: Bug? Config? Capacity? Dependency? ├── Environment: Network? Cloud provider? Third party? ├── Monitoring: Detection gap? Alert fatigue? Dashboard gap? └── Testing: Test coverage? Load testing? Chaos testing?
Contributing Factor Categories:
| Category | Questions |
|---|---|
| Trigger | What change or event started it? |
| Propagation | Why did it spread? Why wasn't it contained? |
| Detection | Why wasn't it caught earlier? |
| Resolution | What slowed the fix? |
| Process | What process gaps contributed? |
Postmortem Review Meeting (60 min)
1. Timeline walk-through (15 min) - Author presents chronology - Attendees add context ("I remember seeing X at this point") 2. Root cause deep-dive (15 min) - Do we agree on root cause? - Are there additional contributing factors? 3. Action item review (20 min) - Are these the RIGHT actions? - Are they prioritized correctly? - Do owners agree on deadlines? 4. Process improvements (10 min) - Could we have detected this sooner? - Could we have resolved this faster? - What would have prevented this entirely?
Phase 7: Chaos Engineering
Chaos Maturity Model
| Level | Name | Activities |
|---|---|---|
| 0 | None | No chaos testing |
| 1 | Exploratory | Manual fault injection in staging |
| 2 | Systematic | Scheduled chaos experiments in staging |
| 3 | Production | Controlled chaos in production (Game Days) |
| 4 | Continuous | Automated chaos in production with safety controls |
Chaos Experiment Template
experiment: name: "" hypothesis: "When [fault], the system will [expected behavior]" steady_state: metrics: - name: "" baseline: "" acceptable_range: "" method: fault_type: "" # network | compute | storage | dependency | data target: "" # which service/component blast_radius: "" # single pod | single AZ | percentage of traffic duration: "" safety: abort_conditions: - "SLO burn rate exceeds 10x" - "Customer-visible errors detected" - "Alert fires that we didn't expect" rollback_plan: "" required_approvals: [] results: outcome: "" # confirmed | disproved | inconclusive observations: [] action_items: []
Chaos Experiment Library
| Category | Experiment | Validates |
|---|---|---|
| Network | Add 200ms latency to DB calls | Timeout handling, circuit breakers |
| Network | Drop 5% of packets to downstream | Retry logic, error handling |
| Network | DNS resolution failure | Caching, fallback, error messages |
| Compute | Kill random pod every 10 min | Auto-restart, load balancing |
| Compute | CPU stress to 95% on 1 node | Auto-scaling, graceful degradation |
| Compute | Fill disk to 95% | Disk monitoring, log rotation, alerts |
| Storage | Increase DB latency 5x | Connection pool handling, timeouts |
| Storage | Simulate cache failure (Redis down) | Cache-aside pattern, DB fallback |
| Dependency | Block external API (payment provider) | Circuit breaker, queuing, retry |
| Dependency | Return 429s from auth service | Rate limit handling, backoff |
| Data | Clock skew on subset of nodes | Timestamp handling, ordering |
| Scale | 10x traffic spike over 5 minutes | Auto-scaling speed, queue depth |
Game Day Runbook
PRE-GAME (1 week before): □ Experiment designed and reviewed □ Steady-state metrics identified □ Abort conditions defined □ All participants briefed □ Runbacks tested in staging □ Stakeholders notified GAME DAY: □ Verify steady state (15 min baseline) □ Announce in #engineering: "Chaos Game Day starting" □ Inject fault □ Observe and document □ If abort condition hit → rollback immediately □ Run for planned duration □ Remove fault □ Verify recovery to steady state POST-GAME (same day): □ Results documented □ Surprises noted □ Action items created □ Share findings in team meeting
Phase 8: Toil Management
Toil Identification
Definition: Work that is manual, repetitive, automatable, tactical, without enduring value, and scales linearly with service growth.
Toil Inventory Template
toil_item: name: "" category: "" # deployment | scaling | config | data | access | monitoring | recovery frequency: "" # daily | weekly | monthly | per-incident time_per_occurrence_min: 0 occurrences_per_month: 0 total_hours_per_month: 0 teams_affected: [] automation_difficulty: "" # low | medium | high automation_value: 0 # hours saved per month priority_score: 0 # value / difficulty
Toil Reduction Priority Matrix
| Low Effort | Medium Effort | High Effort | |
|---|---|---|---|
| High Value (>10 hrs/mo) | DO FIRST | DO SECOND | PLAN |
| Med Value (2-10 hrs/mo) | DO SECOND | PLAN | EVALUATE |
| Low Value (<2 hrs/mo) | QUICK WIN | SKIP | SKIP |
Common Toil Targets (Ranked by Impact)
- Manual deployments → CI/CD pipeline + GitOps
- Access provisioning → Self-service + auto-approval for low-risk
- Certificate renewals → Auto-renewal (cert-manager, Let's Encrypt)
- Scaling decisions → HPA + predictive auto-scaling
- Log investigation → Structured logging + correlation + dashboards
- Data fixes → Self-service admin tools + validation at ingestion
- Config changes → Config-as-code + automated rollout
- Incident response → Automated runbooks for known issues
- Capacity reporting → Automated dashboards + forecasting
- On-call triage → Noise reduction + auto-remediation for known patterns
Toil Budget Rule
Target: <25% of SRE time spent on toil. Track monthly. If above 25%, prioritize automation over all feature work.
Phase 9: Capacity Planning
Capacity Model Template
capacity_model: service: "" bottleneck_resource: "" # CPU | memory | storage | connections | bandwidth current_state: peak_utilization_pct: 0 headroom_pct: 0 cost_per_month_usd: 0 growth_forecast: metric: "" # MAU | requests/sec | storage_gb current: 0 monthly_growth_pct: 0 projected_6mo: 0 projected_12mo: 0 scaling_strategy: type: "" # horizontal | vertical | hybrid auto_scaling: true min_instances: 0 max_instances: 0 scale_up_threshold: 80 # % utilization scale_down_threshold: 30 cooldown_seconds: 300 cost_projection: current_monthly: 0 projected_6mo_monthly: 0 projected_12mo_monthly: 0
Capacity Planning Cadence
| Frequency | Action |
|---|---|
| Daily | Review auto-scaling events, check for anomalies |
| Weekly | Review utilization trends, spot-check headroom |
| Monthly | Update growth model, review cost projections |
| Quarterly | Full capacity review, budget planning, architecture check |
| Pre-launch | Load test to 2x expected peak, verify scaling |
Load Testing Benchmarks
| Scenario | Method | Duration | Target |
|---|---|---|---|
| Baseline | Steady load at current peak | 30 min | Establish metrics |
| Growth | 2x current peak | 15 min | Verify scaling works |
| Spike | 10x normal in 60 seconds | 5 min | Circuit breakers hold |
| Soak | 1.5x normal load | 4 hours | No memory leaks, degradation |
| Stress | Ramp until failure | Until break | Find actual limits |
Phase 10: On-Call Excellence
On-Call Health Metrics
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| Pages per shift | <2 | 2-5 | >5 |
| Off-hours pages | <1/week | 1-3/week | >3/week |
| Time to acknowledge | <5 min | 5-15 min | >15 min |
| Time to mitigate | <30 min | 30-60 min | >60 min |
| False positive rate | <10% | 10-30% | >30% |
| Escalation rate | <20% | 20-40% | >40% |
| On-call satisfaction | >4/5 | 3-4/5 | <3/5 |
On-Call Rotation Best Practices
- Minimum rotation size: 5 people (one week on, four weeks off)
- No back-to-back weeks unless team is too small (fix the team size)
- Follow-the-sun for global teams (no one pages at 3 AM if avoidable)
- Primary + secondary on-call always
- Handoff document at rotation change — open issues, recent deploys, known risks
- Compensation — on-call pay, time off in lieu, or equivalent
On-Call Handoff Template
## On-Call Handoff: [Date] ### Open Issues - [Issue]: [Status, next steps] ### Recent Changes (last 7 days) - [Deployment/config change]: [Risk level, rollback plan] ### Known Risks - [Event/condition]: [What to watch for] ### Scheduled Maintenance - [When]: [What, duration, rollback plan] ### Runbook Updates - [Any new/updated runbooks since last rotation]
Runbook Template
runbook: title: "" alert_name: "" # exact alert that triggers this last_updated: "" owner: "" overview: | What this alert means in plain English. impact: | What users/systems are affected and how. diagnosis: - step: "Check service health" command: "" expected: "" if_unexpected: "" - step: "Check recent deployments" command: "" expected: "" if_unexpected: "Rollback: [command]" - step: "Check dependencies" command: "" expected: "" if_unexpected: "" mitigation: - option: "Rollback" when: "Recent deployment suspected" steps: [] - option: "Scale up" when: "Traffic spike" steps: [] - option: "Failover" when: "Single component failure" steps: [] escalation: after_minutes: 30 contact: "" context_to_provide: ""
Phase 11: Reliability Review & Governance
Weekly SRE Review (30 min)
1. SLO Status (5 min) - Budget remaining per service - Any burn rate alerts this week? 2. Incident Review (10 min) - Incidents this week: count, severity, duration - Open postmortem action items: status check 3. On-Call Health (5 min) - Pages this week (total, off-hours, false positives) - Any on-call feedback? 4. Reliability Work (10 min) - Automation shipped this week - Toil reduced (hours saved) - Chaos experiments run - Capacity concerns
Monthly Reliability Report
monthly_report: period: "" slo_summary: services_meeting_slo: 0 services_breaching_slo: 0 worst_performing: "" incidents: total: 0 by_severity: { SEV1: 0, SEV2: 0, SEV3: 0, SEV4: 0 } mttr_minutes: 0 mttd_minutes: 0 repeat_incidents: 0 error_budget: services_in_healthy: 0 services_in_warning: 0 services_in_critical: 0 services_exhausted: 0 toil: hours_spent: 0 hours_automated_away: 0 pct_of_sre_time: 0 on_call: total_pages: 0 off_hours_pages: 0 false_positive_pct: 0 avg_ack_time_min: 0 action_items: open: 0 completed_this_month: 0 overdue: 0 highlights: [] concerns: [] next_month_priorities: []
Production Readiness Review Checklist
Before any new service goes to production:
| Category | Check | Status |
|---|---|---|
| SLOs | SLIs defined and measured | |
| SLOs | SLO targets set with stakeholder agreement | |
| SLOs | Error budget policy documented | |
| Monitoring | Golden signals dashboarded | |
| Monitoring | Alerting configured with runbooks | |
| Monitoring | Structured logging implemented | |
| Monitoring | Distributed tracing enabled | |
| Incidents | On-call rotation established | |
| Incidents | Escalation paths documented | |
| Incidents | Runbooks for top 5 failure modes | |
| Capacity | Load tested to 2x expected peak | |
| Capacity | Auto-scaling configured and tested | |
| Capacity | Resource limits set (CPU, memory) | |
| Resilience | Graceful degradation implemented | |
| Resilience | Circuit breakers for dependencies | |
| Resilience | Retry with exponential backoff | |
| Resilience | Timeout configured for all external calls | |
| Deploy | Rollback tested and documented | |
| Deploy | Canary/blue-green deployment ready | |
| Deploy | Feature flags for risky features | |
| Security | Authentication and authorization | |
| Security | Secrets in vault (not env vars) | |
| Security | Dependencies scanned | |
| Data | Backup and restore tested | |
| Data | Data retention policy defined | |
| Docs | Architecture diagram current | |
| Docs | API documentation published | |
| Docs | Operational runbook complete |
Phase 12: Advanced Patterns
Self-Healing Automation
auto_remediation: - trigger: "pod_crash_loop" condition: "restart_count > 3 in 10 min" action: "Delete pod, let scheduler reschedule" escalate_if: "Still crashing after 3 auto-remediations" - trigger: "disk_usage_high" condition: "disk_usage > 85%" action: "Run log cleanup script, archive old data" escalate_if: "Still above 85% after cleanup" - trigger: "connection_pool_exhausted" condition: "available_connections = 0" action: "Kill idle connections, increase pool temporarily" escalate_if: "Pool exhausted again within 1 hour" - trigger: "certificate_expiring" condition: "days_until_expiry < 14" action: "Trigger cert renewal" escalate_if: "Renewal fails"
Multi-Region Reliability
| Strategy | Complexity | RTO | Cost |
|---|---|---|---|
| Active-passive | Low | Minutes | 1.5x |
| Active-active read | Medium | Seconds | 1.8x |
| Active-active full | High | Near-zero | 2-3x |
| Cell-based | Very high | Per-cell | 2-4x |
Decision guide:
- SLO < 99.9% → Single region with good backups
- SLO 99.9-99.95% → Active-passive with automated failover
- SLO > 99.95% → Active-active (read or full)
- SLO > 99.99% → Cell-based architecture
Reliability Culture Indicators
Healthy signals:
- Postmortems are blameless and well-attended
- Error budgets are respected (feature freeze actually happens)
- On-call is shared fairly and compensated
- Toil is tracked and reducing quarter-over-quarter
- Chaos experiments happen regularly
- Teams own their reliability (not just SRE)
Warning signs:
- "Hero culture" — same person always saves the day
- Postmortems are blame-focused or skipped
- Error budget exhaustion doesn't change behavior
- On-call is dreaded, same 2 people always paged
- "We'll fix reliability after this feature ships" (always)
- SRE team is just an ops team with a new name
Quality Scoring Rubric (0-100)
| Dimension | Weight | 0-2 | 3-4 | 5 |
|---|---|---|---|---|
| SLO Coverage | 20% | No SLOs | SLOs for critical services | All services with SLOs, error budgets, reviews |
| Monitoring | 15% | Basic health checks | Golden signals + dashboards | Full observability stack + anomaly detection |
| Incident Response | 15% | Ad-hoc, no process | ICS roles, runbooks, postmortems | Structured ICS, blameless culture, action tracking |
| Automation | 15% | Manual everything | CI/CD + some automation | Self-healing, GitOps, <25% toil |
| Chaos Engineering | 10% | None | Staging experiments | Continuous production chaos with safety |
| Capacity Planning | 10% | Reactive | Quarterly forecasting | Predictive, auto-scaling, cost-optimized |
| On-Call Health | 10% | Burnout, hero culture | Fair rotation, <5 pages/shift | Balanced, compensated, <2 pages/shift |
| Documentation | 5% | Nothing written | Runbooks exist | Complete, current, tested runbooks |
Natural Language Commands
- "Assess reliability for [service]" → Run maturity assessment
- "Define SLOs for [service]" → Walk through SLI selection + SLO setting
- "Check error budget for [service]" → Calculate current budget status
- "Start incident for [description]" → Create incident channel, assign IC, begin workflow
- "Write postmortem for [incident]" → Generate structured postmortem
- "Plan chaos experiment for [service]" → Design experiment with hypothesis
- "Audit toil for [team]" → Inventory and prioritize toil
- "Review on-call health" → Analyze page volume, satisfaction, fairness
- "Production readiness review for [service]" → Run full checklist
- "Monthly reliability report" → Generate comprehensive report
- "Design runbook for [alert]" → Create structured runbook
- "Plan capacity for [service] growing at [X%]" → Build capacity model