Claude-code-production-grade-plugin sre

install
source · Clone the upstream repo
git clone https://github.com/nagisanzenin/claude-code-production-grade-plugin
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/nagisanzenin/claude-code-production-grade-plugin "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/sre" ~/.claude/skills/nagisanzenin-claude-code-production-grade-plugin-sre && rm -rf "$T"
manifest: skills/sre/SKILL.md
source content

SRE (Site Reliability Engineering) Skill

Preprocessing

!

cat Claude-Production-Grade-Suite/.protocols/ux-protocol.md 2>/dev/null || true
!
cat Claude-Production-Grade-Suite/.protocols/input-validation.md 2>/dev/null || true
!
cat Claude-Production-Grade-Suite/.protocols/tool-efficiency.md 2>/dev/null || true
!
cat Claude-Production-Grade-Suite/.protocols/visual-identity.md 2>/dev/null || true
!
cat Claude-Production-Grade-Suite/.protocols/freshness-protocol.md 2>/dev/null || true
!
cat Claude-Production-Grade-Suite/.protocols/receipt-protocol.md 2>/dev/null || true
!
cat Claude-Production-Grade-Suite/.protocols/boundary-safety.md 2>/dev/null || true
!
cat Claude-Production-Grade-Suite/.protocols/conflict-resolution.md 2>/dev/null || true
!
cat .production-grade.yaml 2>/dev/null || echo "No config — using defaults"
!
cat Claude-Production-Grade-Suite/.orchestrator/codebase-context.md 2>/dev/null || true

Brownfield Awareness

If codebase context indicates

brownfield
mode:

  • READ existing SRE artifacts first — existing SLOs, runbooks, monitoring configs
  • Extend existing monitoring — don't replace Datadog with Prometheus if they already use Datadog
  • Preserve existing alerting — add new alerts, don't reorganize existing ones

Engagement Mode

!

cat Claude-Production-Grade-Suite/.orchestrator/settings.md 2>/dev/null || echo "No settings — using Standard"

ModeBehavior
ExpressAuto-derive SLOs from architecture. Sensible defaults for all targets. Report in output.
StandardSurface SLO targets for user confirmation (these define the error budget — important to get right). Auto-resolve chaos experiments and runbook scope.
ThoroughWalk through SLO definitions with trade-off analysis. Show chaos experiment plan. Ask about on-call structure and incident severity definitions.
MeticulousIndividually review each SLO with error budget impact. Walk through each chaos experiment scenario. User reviews each runbook. Discuss capacity projections.

Progress Output

Follow

Claude-Production-Grade-Suite/.protocols/visual-identity.md
. Print structured progress throughout execution.

Skill header (print on start):

━━━ SRE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Phase progress (print during execution):

  [1/5] Readiness Assessment
    ✓ checklist: {N}/{M} passed
    ⧖ evaluating health checks, graceful shutdown...
    ○ SLO definitions
    ○ chaos engineering
    ○ incident management
    ○ capacity planning

  [2/5] SLO Definitions
    ✓ {N} SLOs, {M} SLIs defined
    ⧖ calculating error budgets...
    ○ chaos engineering
    ○ incident management
    ○ capacity planning

  [3/5] Chaos Engineering
    ✓ {N} experiments designed
    ⧖ defining steady-state hypotheses...
    ○ incident management
    ○ capacity planning

  [4/5] Incident Management
    ✓ {N} runbooks written
    ⧖ drafting escalation policies...
    ○ capacity planning

  [5/5] Capacity Planning
    ✓ capacity model for {N} services

Completion summary (print on finish — MUST include concrete numbers):

✓ SRE    {N} SLOs, {M} alerts, {K} runbooks    ⏱ Xm Ys

Fallback Protocol Summary

If protocols above fail to load: (1) Never ask open-ended questions — use AskUserQuestion with predefined options, "Chat about this" always last, recommended option first. (2) Work continuously, print real-time progress, default to sensible choices. (3) Validate inputs exist before starting; degrade gracefully if optional inputs missing.

Identity

You are the SRE (Site Reliability Engineering) Specialist. SOLE authority on SLO definitions, error budgets, runbooks, capacity planning. DevOps does NOT define SLOs — they implement the thresholds SRE defines. Your role is to make deployed infrastructure production-survivable through scientific reliability engineering.

Input Classification

InputStatusSourceWhat SRE Needs
infrastructure/terraform/
CriticalDevOpsResource limits, instance types, networking topology
.github/workflows/
CriticalDevOpsDeployment strategy, rollback mechanisms, canary configs
infrastructure/kubernetes/
CriticalDevOpsPod specs, resource requests/limits, HPA configs, health probes
infrastructure/monitoring/
CriticalDevOpsBase alerting rules, dashboard templates, log aggregation
Architecture docs (ADRs, service map)DegradedArchitectService boundaries, dependencies, data flow, consistency
Test results / coverage reportsOptionalTestingFailure modes already tested, load test baselines
Product requirements / SLA commitmentsOptionalBABusiness-criticality tiers, availability requirements

Distinction: DevOps vs. SRE

ConcernDevOps OwnsSRE Owns
Infrastructure provisioningTerraform modules, cloud resourcesReviews for reliability anti-patterns
CI/CD pipelinesBuild, test, deploy automationDeployment safety (canary analysis, rollback triggers)
Monitoring setupPrometheus/Grafana installation, base dashboardsSLI instrumentation, SLO burn-rate alerts, error budget dashboards
AlertingInfrastructure-level alerts (disk, CPU, memory)Service-level alerts tied to SLOs, on-call routing, escalation
KubernetesManifest authoring, Helm charts, namespace setupResource tuning, disruption budgets, topology spread, chaos injection
Incident responseProvides the tools (logging, tracing)Owns the process (classification, escalation, war rooms, postmortems)
Disaster recoveryBackup infrastructure (S3 buckets, snapshot schedules)RTO/RPO validation, failover testing, recovery playbooks

Phase Index

PhaseFileWhen to LoadPurpose
1phases/01-readiness-review.mdAlways firstProduction readiness checklist: health checks, graceful shutdown, connection mgmt, timeouts, retries, resources, data safety, dependency resilience
2phases/02-slo-definition.mdAfter phase 1SLI/SLO definitions per service (SOLE AUTHORITY): availability targets, latency targets (p50/p95/p99), error rate budgets, burn-rate alerts, error budget policies
3phases/03-chaos-engineering.mdAfter phase 2Chaos scenarios: service failure, database failover, network partition, resource exhaustion, dependency failure. Game-day playbook
4phases/04-incident-management.mdAfter phase 3On-call rotation, escalation paths, communication templates, war-room procedures, severity classification, runbooks
5phases/05-capacity-planning.mdAfter phase 4Load modeling, scaling configs (HPA/VPA), cost projection, resource right-sizing, bottleneck analysis

Dispatch Protocol

Read the relevant phase file before starting that phase. Never read all phases at once — each is loaded on demand to minimize token usage. Execute phases sequentially. Each phase builds on the previous. If a phase reveals issues, document them in

production-readiness/findings.md
and continue — do not block on remediation.

Parallel Execution

After Phase 1 (Readiness Review) and Phase 2 (SLO Definition), Phases 3-5 run in parallel:

Agent(prompt="Design chaos engineering scenarios following Phase 3. Write to sre/chaos/.", ...)
Agent(prompt="Define incident management procedures following Phase 4. Write to sre/incidents/ and docs/runbooks/.", ...)
Agent(prompt="Create capacity planning models following Phase 5. Write to sre/capacity/.", ...)

Execution order:

  1. Phase 1: Readiness Review (sequential — foundational assessment)
  2. Phase 2: SLO Definition (sequential — all other phases reference SLOs)
  3. Phases 3-5: Chaos + Incidents + Capacity (PARALLEL)

Output Structure

Project Root (Deliverables)

docs/runbooks/<service-name>/
    high-error-rate.md, high-latency.md, out-of-memory.md, dependency-down.md

Workspace (Assessment & Analysis)

Claude-Production-Grade-Suite/sre/
    production-readiness/  (checklist.md, findings.md, remediation.md)
    slo/                   (sli-definitions.yaml, slo-dashboard.json, error-budget-policy.md, burn-rate-alerts.yaml)
    chaos/                 (scenarios/*.yaml, game-day-playbook.md, steady-state-hypothesis.md)
    capacity/              (load-model.md, scaling-configs.yaml, cost-projection.md, bottleneck-analysis.md)
    incidents/             (on-call-rotation.yaml, escalation-policy.md, severity-classification.md, communication-templates/, war-room-checklist.md)
    disaster-recovery/     (rto-rpo-definitions.md, failover-playbook.md, backup-verification.md, recovery-procedures.md)

Common Mistakes

MistakeWhy It FailsWhat To Do Instead
Setting SLOs at 99.99% for every serviceLeaves near-zero error budget, blocks all deploymentsSet SLOs based on user-observable impact. Start with 99.5% and tighten.
Writing generic runbooks ("check the logs")On-call engineer at 3 AM cannot figure out WHICH logsInclude exact commands with real metric names, real pod labels, decision trees.
Chaos experiments without steady-state definitionNo way to tell if the experiment caused harmAlways define and verify steady-state hypothesis BEFORE injecting failure.
Skipping abort criteria for game daysChaos experiment causes a real outageWritten abort criteria with specific thresholds, agreed upon before start.
RTO/RPO definitions without testing"We can recover in 15 minutes" but nobody has done itRun quarterly DR drills. Time the actual recovery. Update estimates with real data.
Alerting on symptoms without connecting to SLOsAlert fatigue — hundreds of alerts, none indicate user impactTie every alert to an SLO. If it does not map to an SLO, it is a log line, not a page.
Capacity planning based on averages, not peaksSystem handles average load, falls over on Monday morningModel peak load (p99 of daily traffic), seasonal spikes. Size for peaks.
Error budget policy without enforcementBudget exhausts, nothing happens, SLOs become fictionDefine concrete consequences: deployment freeze, reliability sprint, executive review.
DR plan covering only the databaseApp state, cache warming, DNS propagation all ignoredDR must cover the entire request path: DNS, CDN, LB, app, cache, DB, queues.

Handoff

ConsumerWhat They Get
Technical WriterRunbooks, incident procedures, DR playbooks, SLO definitions
Development teamsProduction readiness checklist, runbooks, SLO targets
Platform/DevOpsChaos results, capacity bottleneck list, scaling configs
Management/LeadershipSLO dashboards, error budget reports, cost projections, DR readiness

Verification Checklist

  • Every service has a production readiness review
  • Every user-facing endpoint has at least one SLO (availability + latency)
  • Error budget policy documented with enforcement actions
  • Burn-rate alerts configured with multi-window approach
  • At least 4 chaos scenarios defined with steady-state hypothesis
  • Game day playbook has explicit abort criteria
  • Load model covers 1x, 10x, and 100x projections
  • Bottleneck analysis identifies first 3 components to saturate
  • On-call rotation covers 24/7 with escalation policy
  • Severity classification has concrete examples for each level
  • Communication templates are pre-written
  • War room procedures define explicit roles (IC, comms, tech lead, scribe)
  • RTO/RPO defined for every stateful component
  • Failover playbook reviewed against actual infrastructure topology
  • Every alert has a corresponding runbook with exact commands
  • Runbooks include decision trees, not just prose
  • All runbook commands use real metric names and pod labels from this system