Marketplace operating-production-services
install
source · Clone the upstream repo
git clone https://github.com/aiskillstore/marketplace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aiskillstore/marketplace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/asmayaseen/operating-production-services" ~/.claude/skills/aiskillstore-marketplace-operating-production-services && rm -rf "$T"
manifest:
skills/asmayaseen/operating-production-services/SKILL.mdsource content
Operating Production Services
Production reliability patterns: measure what matters, learn from failures, improve systematically.
Quick Reference
| Need | Go To |
|---|---|
| Define reliability targets | SLOs & Error Budgets |
| Write incident report | Postmortem Templates |
| Set up SLO alerting | references/slo-alerting.md |
SLOs & Error Budgets
The Hierarchy
SLA (Contract) → SLO (Target) → SLI (Measurement)
Common SLIs
# Availability: successful requests / total requests sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d])) # Latency: requests below threshold / total requests sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))
SLO Targets Reality Check
| SLO % | Downtime/Month | Downtime/Year |
|---|---|---|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43 minutes | 8.76 hours |
| 99.95% | 22 minutes | 4.38 hours |
| 99.99% | 4.3 minutes | 52 minutes |
Don't aim for 100%. Each nine costs exponentially more.
Error Budget
Error Budget = 1 - SLO Target
Example: 99.9% SLO = 0.1% error budget = 43 minutes/month
Policy:
| Budget Remaining | Action |
|---|---|
| > 50% | Normal velocity |
| 10-50% | Postpone risky changes |
| < 10% | Freeze non-critical changes |
| 0% | Feature freeze, fix reliability |
See references/slo-alerting.md for Prometheus recording rules and multi-window burn rate alerts.
Postmortem Templates
The Blameless Principle
| Blame-Focused | Blameless |
|---|---|
| "Who caused this?" | "What conditions allowed this?" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |
When to Write Postmortems
- SEV1/SEV2 incidents
- Customer-facing outages > 15 minutes
- Data loss or security incidents
- Near-misses that could have been severe
- Novel failure modes
Standard Template
# Postmortem: [Incident Title] **Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEVX ## Executive Summary One paragraph: what happened, impact, root cause, resolution. ## Timeline (UTC) | Time | Event | |------|-------| | HH:MM | First alert fired | | HH:MM | On-call acknowledged | | HH:MM | Root cause identified | | HH:MM | Fix deployed | | HH:MM | Service recovered | ## Root Cause Analysis ### 5 Whys 1. Why did service fail? → [Answer] 2. Why did [1] happen? → [Answer] 3. Why did [2] happen? → [Answer] 4. Why did [3] happen? → [Answer] 5. Why did [4] happen? → [Root cause] ## Impact - Customers affected: X - Duration: X minutes - Revenue impact: $X - Support tickets: X ## Action Items | Priority | Action | Owner | Due | Ticket | |----------|--------|-------|-----|--------| | P0 | [Immediate fix] | @name | Date | XXX-123 | | P1 | [Prevent recurrence] | @name | Date | XXX-124 | | P2 | [Improve detection] | @name | Date | XXX-125 |
Quick Template (Minor Incidents)
# Quick Postmortem: [Title] **Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEV3 ## What Happened One sentence description. ## Timeline - HH:MM - Trigger - HH:MM - Detection - HH:MM - Resolution ## Root Cause One sentence. ## Fix - Immediate: [What was done] - Long-term: [Ticket XXX-123]
Postmortem Meeting Guide
Structure (60 min)
- Opening (5 min) - Remind: "We're here to learn, not blame"
- Timeline (15 min) - Walk through events chronologically
- Analysis (20 min) - What failed? Why? What allowed it?
- Action Items (15 min) - Prioritize, assign owners, set dates
- Closing (5 min) - Summarize learnings, confirm owners
Facilitation Tips
- Redirect blame to systems: "What made this mistake possible?"
- Time-box tangents
- Document dissenting views
- Encourage quiet participants
Anti-Patterns
| Don't | Do Instead |
|---|---|
| Aim for 100% SLO | Accept error budget exists |
| Skip small incidents | Small incidents reveal patterns |
| Orphan action items | Every item needs owner + date + ticket |
| Blame individuals | Ask "what conditions allowed this?" |
| Create busywork actions | Actions should prevent recurrence |
Verification
Run:
python scripts/verify.py
References
- references/slo-alerting.md - Prometheus rules, burn rate alerts, Grafana dashboards