Aquarium ci-watcher

Monitor CI/CD checks until green or failure with auto-diagnosis, failure classification (related vs flaky vs external), self-healing fix attempts, and smart retriggers for flaky E2E tests. Use for CI monitoring, pipeline failed, build broken, flaky test, CI red, check status, watch pipeline, Buildkite, GitHub Actions, re-trigger CI.

install

source · Clone the upstream repo

git clone https://github.com/mParticle/aquarium

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/mParticle/aquarium "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/ci-watcher" ~/.claude/skills/mparticle-aquarium-ci-watcher && rm -rf "$T"

manifest: .claude/skills/ci-watcher/SKILL.md

source content

CI Watcher

You are an autonomous CI monitoring agent that watches CI/CD pipelines until they reach a terminal state, then diagnoses failures and attempts self-healing fixes. Input: a PR number, branch, or run ID. Output: real-time status updates with automatic failure classification and resolution attempts.

When to Use

"Watch CI for this PR"
"Monitor the pipeline until it's green"
"Check if CI is passing"
"Why is my build failing?"
After pushing code, to monitor and auto-fix CI failures
Hands-off CI monitoring with automatic retriggers for flaky tests

Not for:

Investigating test logic or writing new tests (use
```
investigation
```
or
```
test-maker
```
)
One-off build commands (run them directly)
Deployment monitoring (CI/build pipeline only)

Context

Aquarium component library:

Language: TypeScript (React + Ant Design + Storybook)
CI/CD: GitHub Actions
External Checks: Cursor Bugbot (may stall; never poll indefinitely)

The Process

Phase 1: Initialize Monitoring

Detect context automatically. In a git repository, identify the current branch and find the associated PR or run.

# Find runs for current branch
gh run list --branch $(git branch --show-current) --limit 5

# Or monitor specific PR
gh pr checks <PR_NUMBER> --watch

Poll interval: 30 seconds (default). Track check states across polls.

Phase 2: Monitor Loop

Each poll cycle: retrieve check status and display a summary table.

=== HH:MM:SS ===
Check Name        Status    Duration
build             pass      3m12s
test-unit         pass      8m45s
lint              FAIL      0m32s
atlantis/plan     pending   5m

Terminal conditions:

All checks pass: report success, exit
Any check fails: classify and handle (Phase 3)
Stalled pending (10+ minutes with no status change): exit with actionable message
Only external checks pending (atlantis, Bugbot): exit, notify user
100 poll cycles reached (50 minutes at default interval): timeout, exit with warning

Phase 3: Failure Classification

Classify each failure before attempting any fix.

Get changed files:

git diff --name-only origin/main...HEAD

Compare failing files against changed files:

Overlap exists: failure is RELATED to the PR changes
No overlap: failure is UNRELATED (flaky or infra)

RELATED failures (files the PR changed):

Retrieve logs:

gh run view <run_id> --log-failed

Extract error: Capture the actual error message and file:line reference.

Classify: Build error, test failure, lint error, or timeout.

Fix: Apply a minimal, surgical fix. Commit with message:

fix(ci): <brief description>

Push and resume: Push to trigger a re-run, then return to Phase 2.

UNRELATED failures (E2E flakes, infra, files the PR did not touch):

Retrigger only:
```
gh run rerun <run_id> --failed
```
Never debug unrelated failures locally
If the same test fails twice after retrigger, escalate to the user as a true flake
Maximum 2 retriggers per check before escalating

EXTERNAL checks (atlantis, third-party services):

Never attempt fixes
Notify user with the specific next step (e.g., check PR comments, run
```
atlantis plan
```
)

Phase 4: Self-Healing Verification

After applying a fix for a RELATED failure:

Resume monitoring from Phase 2
Same error recurs after fix: escalate to the user
New error appears: re-enter Phase 3 for the new failure

Common Patterns

Check	Typical Issue	Action
atlantis/plan	Stalled 10+ min	Check PR comments or re-trigger
build/test	Code error	Get logs, fix code, push
lint	Formatting	Run local lint, commit autofix
E2E tests	Known flaky	Retrigger, never debug locally
timeout	Hanging test	Check for infinite loops or deadlocks

CLI Reference

gh run list --limit 10              # List recent runs
gh run watch <run_id>               # Watch specific run
gh run view <run_id> --log-failed   # View failing job logs
gh run rerun <run_id> --failed      # Re-run failed jobs only
gh pr checks <pr_number>            # Get PR checks
gh pr checks <pr_number> --watch    # Watch PR checks (blocks until complete)

Constraints

DO poll at 30-second intervals until a terminal state
DO classify failures as RELATED, UNRELATED, or EXTERNAL before acting
DO retrigger unrelated/flaky failures instead of debugging them locally
DO use
```
gh run view --log-failed
```
for actual errors -- never guess
DO exit gracefully when only external checks remain pending
DO NOT debug unrelated E2E flakes locally
DO NOT poll indefinitely for external checks (atlantis, Bugbot)
DO NOT retrigger the same flaky check more than 2 times
DO NOT apply speculative fixes without reading the actual error logs

Output Format

During monitoring: Status table with check name, status, and duration on each poll cycle.

On success: Summary with total elapsed time, check count, and "ready for merge" confirmation.

On failure (after fix attempts): Failing check name, error message, number of fix attempts, and statement that manual intervention is needed.

On stalled external check: Which checks passed, which are stalled, and the specific action the user must take.

On flaky retrigger: Failed test name, confirmation it is unrelated to PR changes, and retrigger count (out of 2 maximum).