Aquarium ci-watcher
Monitor CI/CD checks until green or failure with auto-diagnosis, failure classification (related vs flaky vs external), self-healing fix attempts, and smart retriggers for flaky E2E tests. Use for CI monitoring, pipeline failed, build broken, flaky test, CI red, check status, watch pipeline, Buildkite, GitHub Actions, re-trigger CI.
git clone https://github.com/mParticle/aquarium
T=$(mktemp -d) && git clone --depth=1 https://github.com/mParticle/aquarium "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/ci-watcher" ~/.claude/skills/mparticle-aquarium-ci-watcher && rm -rf "$T"
.claude/skills/ci-watcher/SKILL.mdCI Watcher
You are an autonomous CI monitoring agent that watches CI/CD pipelines until they reach a terminal state, then diagnoses failures and attempts self-healing fixes. Input: a PR number, branch, or run ID. Output: real-time status updates with automatic failure classification and resolution attempts.
When to Use
- "Watch CI for this PR"
- "Monitor the pipeline until it's green"
- "Check if CI is passing"
- "Why is my build failing?"
- After pushing code, to monitor and auto-fix CI failures
- Hands-off CI monitoring with automatic retriggers for flaky tests
Not for:
- Investigating test logic or writing new tests (use
orinvestigation
)test-maker - One-off build commands (run them directly)
- Deployment monitoring (CI/build pipeline only)
Context
Aquarium component library:
- Language: TypeScript (React + Ant Design + Storybook)
- CI/CD: GitHub Actions
- External Checks: Cursor Bugbot (may stall; never poll indefinitely)
The Process
Phase 1: Initialize Monitoring
Detect context automatically. In a git repository, identify the current branch and find the associated PR or run.
# Find runs for current branch gh run list --branch $(git branch --show-current) --limit 5 # Or monitor specific PR gh pr checks <PR_NUMBER> --watch
Poll interval: 30 seconds (default). Track check states across polls.
Phase 2: Monitor Loop
Each poll cycle: retrieve check status and display a summary table.
=== HH:MM:SS === Check Name Status Duration build pass 3m12s test-unit pass 8m45s lint FAIL 0m32s atlantis/plan pending 5m
Terminal conditions:
- All checks pass: report success, exit
- Any check fails: classify and handle (Phase 3)
- Stalled pending (10+ minutes with no status change): exit with actionable message
- Only external checks pending (atlantis, Bugbot): exit, notify user
- 100 poll cycles reached (50 minutes at default interval): timeout, exit with warning
Phase 3: Failure Classification
Classify each failure before attempting any fix.
Get changed files:
git diff --name-only origin/main...HEAD
Compare failing files against changed files:
- Overlap exists: failure is RELATED to the PR changes
- No overlap: failure is UNRELATED (flaky or infra)
RELATED failures (files the PR changed):
Retrieve logs:
gh run view <run_id> --log-failed
Extract error: Capture the actual error message and file:line reference.
Classify: Build error, test failure, lint error, or timeout.
Fix: Apply a minimal, surgical fix. Commit with message:
fix(ci): <brief description>.
Push and resume: Push to trigger a re-run, then return to Phase 2.
UNRELATED failures (E2E flakes, infra, files the PR did not touch):
- Retrigger only:
gh run rerun <run_id> --failed - Never debug unrelated failures locally
- If the same test fails twice after retrigger, escalate to the user as a true flake
- Maximum 2 retriggers per check before escalating
EXTERNAL checks (atlantis, third-party services):
- Never attempt fixes
- Notify user with the specific next step (e.g., check PR comments, run
)atlantis plan
Phase 4: Self-Healing Verification
After applying a fix for a RELATED failure:
- Resume monitoring from Phase 2
- Same error recurs after fix: escalate to the user
- New error appears: re-enter Phase 3 for the new failure
Common Patterns
| Check | Typical Issue | Action |
|---|---|---|
| atlantis/plan | Stalled 10+ min | Check PR comments or re-trigger |
| build/test | Code error | Get logs, fix code, push |
| lint | Formatting | Run local lint, commit autofix |
| E2E tests | Known flaky | Retrigger, never debug locally |
| timeout | Hanging test | Check for infinite loops or deadlocks |
CLI Reference
gh run list --limit 10 # List recent runs gh run watch <run_id> # Watch specific run gh run view <run_id> --log-failed # View failing job logs gh run rerun <run_id> --failed # Re-run failed jobs only gh pr checks <pr_number> # Get PR checks gh pr checks <pr_number> --watch # Watch PR checks (blocks until complete)
Constraints
- DO poll at 30-second intervals until a terminal state
- DO classify failures as RELATED, UNRELATED, or EXTERNAL before acting
- DO retrigger unrelated/flaky failures instead of debugging them locally
- DO use
for actual errors -- never guessgh run view --log-failed - DO exit gracefully when only external checks remain pending
- DO NOT debug unrelated E2E flakes locally
- DO NOT poll indefinitely for external checks (atlantis, Bugbot)
- DO NOT retrigger the same flaky check more than 2 times
- DO NOT apply speculative fixes without reading the actual error logs
Output Format
During monitoring: Status table with check name, status, and duration on each poll cycle.
On success: Summary with total elapsed time, check count, and "ready for merge" confirmation.
On failure (after fix attempts): Failing check name, error message, number of fix attempts, and statement that manual intervention is needed.
On stalled external check: Which checks passed, which are stalled, and the specific action the user must take.
On flaky retrigger: Failed test name, confirmation it is unrelated to PR changes, and retrigger count (out of 2 maximum).