Claude-ops ops-fires

Production incidents dashboard. Reads ECS health, Sentry errors, CI failures. Offers to dispatch fix agents for active fires.

install
source · Clone the upstream repo
git clone https://github.com/Lifecycle-Innovations-Limited/claude-ops
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Lifecycle-Innovations-Limited/claude-ops "$T" && mkdir -p ~/.claude/skills && cp -r "$T/claude-ops/skills/ops-fires" ~/.claude/skills/lifecycle-innovations-limited-claude-ops-ops-fires && rm -rf "$T"
manifest: claude-ops/skills/ops-fires/SKILL.md
source content

OPS ► FIRES

Runtime Context

Before executing, load available context:

  1. Daemon health: Read

    ${CLAUDE_PLUGIN_DATA_DIR:-$HOME/.claude/plugins/data/ops-ops-marketplace}/daemon-health.json

    • Check
      infra-monitor
      service status — if not running, pre-gathered infra data may be stale
    • If
      action_needed
      is not null → surface it immediately as a potential fire
  2. Secrets: AWS credentials are required for ECS/CloudWatch queries.

    Secret Resolution

    • First: check
      $AWS_ACCESS_KEY_ID
      /
      $AWS_PROFILE
      env vars
    • Then:
      doppler secrets get AWS_ACCESS_KEY_ID --plain
      (if
      doppler
      configured in prefs)
    • Then: use
      password_manager_config.query_cmd
      from preferences
    • Sentry token:
      $SENTRY_AUTH_TOKEN
      → Doppler
      SENTRY_AUTH_TOKEN
      → vault
  3. Preferences: Read

    ${CLAUDE_PLUGIN_DATA_DIR}/preferences.json
    for
    secrets_manager
    config to know which vault to query.

CLI/API Reference

aws CLI

CommandUsageOutput
aws ecs list-services --cluster <name> --query 'serviceArns'
ECS servicesARN list
aws ecs describe-services --cluster <name> --services <arn> --query 'services[0].{status:status,running:runningCount,desired:desiredCount}'
Service healthJSON
aws logs tail /ecs/<service> --since 1h --format short
ECS logsLog lines (use with Monitor for live)

gh CLI (GitHub)

CommandUsageOutput
gh run list --limit 20 --json status,conclusion,name,headBranch,createdAt
Recent CI runsJSON array
gh run view <id> --repo <repo> --log-failed
Failed CI logsLog output

sentry-cli / Sentry API

CommandUsageOutput
sentry-cli issues list --project <slug> --status unresolved
Unresolved issuesIssue list
curl -H "Authorization: Bearer $SENTRY_AUTH_TOKEN" "https://sentry.io/api/0/projects/<org>/<proj>/issues/?query=is:unresolved"
API fallback when MCP unavailableJSON array

Agent Teams support

If

CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1
is set, use Agent Teams when dispatching multiple fix agents simultaneously. This enables:

  • Fix agents share findings (e.g., API agent discovers DB is the root cause → infra agent pivots to DB fix)
  • You can prioritize: "CRITICAL ECS issue first, then CI failures"
  • Real-time progress: agents report as they find root causes, you can merge fixes in optimal order

Team setup (only when flag is enabled, dispatch phase):

TeamCreate("fire-fixers")
Agent(team_name="fire-fixers", name="fix-[service]", ...)

If the flag is NOT set, use standard parallel subagents.

Pre-gathered infrastructure data

${CLAUDE_PLUGIN_ROOT}/bin/ops-infra 2>/dev/null || echo '{"clusters":[],"error":"infra check failed"}'

CI failures (last 24h)

${CLAUDE_PLUGIN_ROOT}/bin/ops-ci 2>/dev/null || echo '[]'

External projects health

${CLAUDE_PLUGIN_ROOT}/bin/ops-external 2>/dev/null || echo '[]'

Your task

Analyze the pre-gathered data — including external projects. Then run parallel checks:

  1. ECS health — parse infra data for unhealthy services, stopped tasks, failed deployments.
  2. Sentry — if Sentry MCP is connected, query recent unresolved errors. Otherwise note it's unavailable.
  3. CI — parse CI data for failing pipelines, broken main/dev branches.
  4. GitHub Actions
    gh run list --limit 20 --json status,conclusion,name,headBranch,createdAt 2>/dev/null
  5. External projects — parse ops-external data. Flag
    auth_expired
    as HIGH (credential rotation needed),
    unreachable
    /
    degraded
    as MEDIUM,
    not_configured
    as LOW.

Classify each issue by severity:

SeverityCriteria
CRITICALService down, DB unreachable, auth broken
HIGHElevated error rate, deploy stuck, CI main broken
MEDIUMNon-critical service degraded, flaky tests
LOWWarning-level, non-urgent

Output format

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 OPS ► FIRES DASHBOARD — [timestamp]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

CRITICAL
[service] — [issue] — [since]

HIGH
[service] — [issue] — [since]

MEDIUM
[service] — [issue] — [since]

ECS HEALTH
[cluster] [service] [desired/running] [status]

CI STATUS
[repo] [branch] [workflow] [status] [last run]

SENTRY (top errors, 24h)
[error] [count] [first seen] [project]

EXTERNAL PROJECTS
[alias] [source] [status] [details — e.g. auth_expired, unreachable]

──────────────────────────────────────────────────────

Use batched AskUserQuestion calls (max 4 options each). Only show relevant actions (e.g., skip dispatch options if no issues found):

AskUserQuestion call 1:

  [Dispatch fix agent for [top critical issue]]
  [Dispatch fix agent for [second issue]]
  [View logs for [service]]
  [More...]

AskUserQuestion call 2 (only if "More..."):

  [Open Sentry dashboard]
  [Open GitHub Actions]
  [All clear — nothing to do]

If no fires: show "ALL SYSTEMS OPERATIONAL" with last-checked timestamps.


Dispatch fix agent

When user selects to fix an issue, use

AskUserQuestion
to confirm the scope before dispatching:

Dispatch fix agent for: [issue title]
  Severity: [CRITICAL/HIGH/MEDIUM]
  Repo: [repo]
  Error: [brief description]
  
  The agent will:
  - Investigate root cause in [repo]
  - Create feature branch with fix
  - Open PR for review

  [Dispatch agent]  [Show me the logs first]  [Skip — I'll fix manually]

On confirmation, spawn an Agent with:

  • The error details and logs
  • Access to the relevant repo
  • Instruction to create a feature branch, fix, and open a PR
  • Report back when done or blocked

Use the

agents/infra-monitor.md
agent definition for infra issues.

If

$ARGUMENTS
contains a project alias, filter to that project's services only.


Native tool usage

Monitor — live service health

Use

Monitor
to stream ECS task logs or GitHub Actions runs when investigating fires:

Monitor(command: "aws logs tail /ecs/<service> --follow --since 5m")

Tasks — incident tracking

Use

TaskCreate
for each active fire. Update with
TaskUpdate
as fires are investigated/fixed/escalated.

WebFetch — status pages

When diagnosing fires, use

WebFetch
to check AWS status page (
https://health.aws.amazon.com/health/status
), Vercel status, or third-party API status pages.

WebSearch — known outage patterns

Use

WebSearch
to find if the error pattern matches a known AWS/infrastructure issue (e.g., "ECS task stopped CannotPullContainerError" → known ECR throttling).