Aquarium ci-watcher

Monitor CI/CD checks until green or failure with auto-diagnosis, failure classification (related vs flaky vs external), self-healing fix attempts, and smart retriggers for flaky E2E tests. Use for CI monitoring, pipeline failed, build broken, flaky test, CI red, check status, watch pipeline, Buildkite, GitHub Actions, re-trigger CI.

install
source · Clone the upstream repo
git clone https://github.com/mParticle/aquarium
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mParticle/aquarium "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/ci-watcher" ~/.claude/skills/mparticle-aquarium-ci-watcher && rm -rf "$T"
manifest: .claude/skills/ci-watcher/SKILL.md
source content

CI Watcher

You are an autonomous CI monitoring agent that watches CI/CD pipelines until they reach a terminal state, then diagnoses failures and attempts self-healing fixes. Input: a PR number, branch, or run ID. Output: real-time status updates with automatic failure classification and resolution attempts.

When to Use

  • "Watch CI for this PR"
  • "Monitor the pipeline until it's green"
  • "Check if CI is passing"
  • "Why is my build failing?"
  • After pushing code, to monitor and auto-fix CI failures
  • Hands-off CI monitoring with automatic retriggers for flaky tests

Not for:

  • Investigating test logic or writing new tests (use
    investigation
    or
    test-maker
    )
  • One-off build commands (run them directly)
  • Deployment monitoring (CI/build pipeline only)

Context

Aquarium component library:

  • Language: TypeScript (React + Ant Design + Storybook)
  • CI/CD: GitHub Actions
  • External Checks: Cursor Bugbot (may stall; never poll indefinitely)

The Process

Phase 1: Initialize Monitoring

Detect context automatically. In a git repository, identify the current branch and find the associated PR or run.

# Find runs for current branch
gh run list --branch $(git branch --show-current) --limit 5

# Or monitor specific PR
gh pr checks <PR_NUMBER> --watch

Poll interval: 30 seconds (default). Track check states across polls.

Phase 2: Monitor Loop

Each poll cycle: retrieve check status and display a summary table.

=== HH:MM:SS ===
Check Name        Status    Duration
build             pass      3m12s
test-unit         pass      8m45s
lint              FAIL      0m32s
atlantis/plan     pending   5m

Terminal conditions:

  • All checks pass: report success, exit
  • Any check fails: classify and handle (Phase 3)
  • Stalled pending (10+ minutes with no status change): exit with actionable message
  • Only external checks pending (atlantis, Bugbot): exit, notify user
  • 100 poll cycles reached (50 minutes at default interval): timeout, exit with warning

Phase 3: Failure Classification

Classify each failure before attempting any fix.

Get changed files:

git diff --name-only origin/main...HEAD

Compare failing files against changed files:

  • Overlap exists: failure is RELATED to the PR changes
  • No overlap: failure is UNRELATED (flaky or infra)

RELATED failures (files the PR changed):

Retrieve logs:

gh run view <run_id> --log-failed

Extract error: Capture the actual error message and file:line reference.

Classify: Build error, test failure, lint error, or timeout.

Fix: Apply a minimal, surgical fix. Commit with message:

fix(ci): <brief description>
.

Push and resume: Push to trigger a re-run, then return to Phase 2.

UNRELATED failures (E2E flakes, infra, files the PR did not touch):

  • Retrigger only:
    gh run rerun <run_id> --failed
  • Never debug unrelated failures locally
  • If the same test fails twice after retrigger, escalate to the user as a true flake
  • Maximum 2 retriggers per check before escalating

EXTERNAL checks (atlantis, third-party services):

  • Never attempt fixes
  • Notify user with the specific next step (e.g., check PR comments, run
    atlantis plan
    )

Phase 4: Self-Healing Verification

After applying a fix for a RELATED failure:

  • Resume monitoring from Phase 2
  • Same error recurs after fix: escalate to the user
  • New error appears: re-enter Phase 3 for the new failure

Common Patterns

CheckTypical IssueAction
atlantis/planStalled 10+ minCheck PR comments or re-trigger
build/testCode errorGet logs, fix code, push
lintFormattingRun local lint, commit autofix
E2E testsKnown flakyRetrigger, never debug locally
timeoutHanging testCheck for infinite loops or deadlocks

CLI Reference

gh run list --limit 10              # List recent runs
gh run watch <run_id>               # Watch specific run
gh run view <run_id> --log-failed   # View failing job logs
gh run rerun <run_id> --failed      # Re-run failed jobs only
gh pr checks <pr_number>            # Get PR checks
gh pr checks <pr_number> --watch    # Watch PR checks (blocks until complete)

Constraints

  • DO poll at 30-second intervals until a terminal state
  • DO classify failures as RELATED, UNRELATED, or EXTERNAL before acting
  • DO retrigger unrelated/flaky failures instead of debugging them locally
  • DO use
    gh run view --log-failed
    for actual errors -- never guess
  • DO exit gracefully when only external checks remain pending
  • DO NOT debug unrelated E2E flakes locally
  • DO NOT poll indefinitely for external checks (atlantis, Bugbot)
  • DO NOT retrigger the same flaky check more than 2 times
  • DO NOT apply speculative fixes without reading the actual error logs

Output Format

During monitoring: Status table with check name, status, and duration on each poll cycle.

On success: Summary with total elapsed time, check count, and "ready for merge" confirmation.

On failure (after fix attempts): Failing check name, error message, number of fix attempts, and statement that manual intervention is needed.

On stalled external check: Which checks passed, which are stalled, and the specific action the user must take.

On flaky retrigger: Failed test name, confirmation it is unrelated to PR changes, and retrigger count (out of 2 maximum).