Claude-skill-registry actionable-alerting-runbook-design
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/actionable-alerting-runbook-design" ~/.claude/skills/majiayu000-claude-skill-registry-actionable-alerting-runbook-design && rm -rf "$T"
manifest:
skills/data/actionable-alerting-runbook-design/SKILL.mdsource content
Actionable Alerting and Runbook Design
This skill provides expertise in designing alerts and runbooks for effective incident response.
Overview
Good alerting enables quick incident detection and resolution. Bad alerting causes fatigue and missed issues.
Alerting Principles
What Makes an Alert Actionable?
- Specific: Clear about what's wrong
- Contextual: Includes relevant information
- Timely: Fires before users notice
- Actionable: Recipient can do something about it
- Linked: Points to runbook or dashboard
Alert Anti-Patterns
- Flapping alerts: Constantly firing and resolving
- Too sensitive: Alerts on normal variance
- No runbook: Alert with no remediation guidance
- Wrong audience: Alerting people who can't help
Runbook Structure
# Alert: High API Error Rate ## Summary API error rate exceeds 5% for 5 minutes ## Impact Users experiencing failed requests ## Diagnosis Steps 1. Check error logs: [link] 2. Check recent deployments: [link] 3. Check database health: [link] ## Remediation Steps 1. If recent deployment, rollback: `kubectl rollout undo...` 2. If database issue, scale: `gcloud sql instances patch...` 3. If unknown, escalate to: @team-leads ## Escalation - L1: On-call engineer - L2: Team lead (if not resolved in 15min) - L3: VP Engineering (if customer impact > 30min)
Best Practices
- Alert on symptoms, not causes
- Use multi-window alerting to reduce noise
- Include dashboards and runbook links in alerts
- Review and prune alerts quarterly
- Track alert-to-incident ratio
[Content to be expanded based on plugin_spec_agentient-observability.md specifications]