GAAI-framework ci-watch-and-fix
Watch GitHub Actions CI after PR creation, detect failures, extract logs, apply minimal fixes, and re-push — keeping the delivery session alive until CI resolves or escalating after 3 cycles. Activate immediately after gh pr create and before marking the story done.
git clone https://github.com/Fr-e-d/GAAI-framework
T=$(mktemp -d) && git clone --depth=1 https://github.com/Fr-e-d/GAAI-framework "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.gaai/core/skills/delivery/ci-watch-and-fix" ~/.claude/skills/fr-e-d-gaai-framework-ci-watch-and-fix && rm -rf "$T"
.gaai/core/skills/delivery/ci-watch-and-fix/SKILL.mdCI Watch and Fix
Purpose / When to Activate
Owner: Delivery Orchestrator.
Activate immediately after
and before marking the story gh pr create
.done
This skill keeps the delivery session alive through GitHub Actions CI execution. It detects failures, fetches logs, applies minimal fixes, and re-pushes — up to 3 remediation cycles. If CI does not converge within 3 cycles, it escalates without marking the story
done.
Do NOT use
gh pr checks --watch. Active polling is mandatory to ensure the log file receives periodic output and the daemon heartbeat monitor does not falsely kill the session. See AC7.
External Dependencies
CLI authenticated withgh
+repo
scopes (already present in the project environment — no additional setup required).actions:read
Process
Initialization
cycle = 0 flaky_retry_used = false previous_failure_signatures = {} # map: check_name → error_message_hash
Step 0 — Branch Protection Check (once, before loop)
# Determine if CI is a hard gate or advisory # gh api returns 403 on repos without branch protection (free/private) bp_status = gh api repos/<repo>/branches/staging/protection --jq '.required_status_checks' 2>&1 if bp_status contains "403" OR bp_status contains "404" OR bp_status is empty: ci_is_advisory = true echo "[ci-watch-and-fix] No branch protection on staging — CI is advisory" >> $LOG_DIR/<story-id>.log else: ci_is_advisory = false echo "[ci-watch-and-fix] Branch protection active — CI is a hard gate" >> $LOG_DIR/<story-id>.log
Main Loop (max 3 cycles)
while cycle < 3: cycle += 1 # Heartbeat — always write a log line at the start of each cycle echo "[ci-watch-and-fix] cycle ${cycle}/3 — polling PR #<pr-number> checks" >> $LOG_DIR/<story-id>.log # Step 1 — Poll PR checks run: gh pr checks <pr-number> --repo <repo> # Step 1b — No checks registered? # If no CI checks are registered on the PR (no workflows triggered), # treat as advisory pass — nothing to wait for. if no checks exist: echo "[ci-watch-and-fix] No CI checks registered — CI PASS (no checks)" >> $LOG_DIR/<story-id>.log exit loop → return CI PASS # Step 2 — All passing? if all checks pass: echo "[ci-watch-and-fix] CI PASS — all checks green" >> $LOG_DIR/<story-id>.log exit loop → return CI PASS # Step 3 — Identify failed checks and their run IDs for each failed check: get run_id from check # Step 4 — Fetch failure logs (truncated to last 3000 chars per job) raw_log = gh run view <run-id> --repo <repo> --log-failed failure_log = last 3000 chars of raw_log # Step 4b — Pre-existing infra failure detection (fast-path) # Detect infrastructure-level failures that code changes cannot fix. # These are pre-existing conditions unrelated to the story's changes. INFRA_PATTERNS = [ "recent account payments have failed", "spending limit needs to be increased", "Actions minutes", "Actions quota", "not started because", # job queuing failure (billing gate) "out of Actions minutes", ] if any(pattern matches failure_log) for any failed job: if ci_is_advisory: echo "[ci-watch-and-fix] Infra failure detected but CI is advisory (no branch protection) — CI PASS (advisory skip)" >> $LOG_DIR/<story-id>.log exit loop → return CI PASS (advisory) else: echo "[ci-watch-and-fix] Infra failure detected AND branch protection active — ESCALATE (cannot merge)" >> $LOG_DIR/<story-id>.log convergence_failure_reason = "Pre-existing infrastructure failure: GitHub Actions billing/quota limit. Branch protection prevents merge without CI PASS." goto ESCALATE # Step 5 — Flaky test detection signature = hash(check_name + first_100_chars_of_failure_log) if signature in previous_failure_signatures: # Same failure seen in a previous cycle → suspected flaky if flaky_retry_used: # Already used the one flaky retry → escalate goto ESCALATE else: flaky_retry_used = true echo "[ci-watch-and-fix] suspected flaky test in <check_name> — pushing empty commit retry" >> $LOG_DIR/<story-id>.log git commit --allow-empty -m "ci: retry (suspected flaky)" (in worktree) git push origin <story_branch> (in worktree) sleep 60 continue # next cycle without applying code changes else: previous_failure_signatures[signature] = true # Step 6 — Analyze and fix (non-flaky failures) analyze failure_log to identify root cause apply minimal corrective code changes (in worktree — do not expand scope) git add → git commit -m "fix(ci/<story-id>): <description>" (in worktree) # Push all fixes git push origin <story_branch> (in worktree) # Step 7 — Wait then re-poll echo "[ci-watch-and-fix] fixes pushed — waiting 60s before re-poll" >> $LOG_DIR/<story-id>.log sleep 60 # Exhausted 3 cycles without CI PASS goto ESCALATE
Heartbeat Rule
The daemon heartbeat monitor kills sessions silent for >30 minutes. This skill MUST emit at least one line to
$LOG_DIR/<story-id>.log every 5 minutes during CI wait time. During the 60-second sleep between cycles, this is not an issue. If a single CI run takes >5 minutes to complete, emit periodic heartbeat lines:
# During long CI waits, poll every 60s and emit a heartbeat line each time while ci_running: sleep 60 echo "[ci-watch-and-fix] waiting for CI — elapsed: <N>s" >> $LOG_DIR/<story-id>.log check if checks are still in_progress
Escalation Path (AC3)
Trigger when: (cycle > 3 AND CI not passing, OR flaky retries exhausted) AND
ci_is_advisory == false.
When
ci_is_advisory == true, infra failures and exhausted retries produce CI PASS (advisory) — never CI FAIL. The merge proceeds. The escalation path below only applies when branch protection is active.
ESCALATE: # 1. Produce ci_remediation_report report_path = docs/ci-failures/<story-id>-<timestamp>.md write report containing: - story_id - pr_number - total_cycles_attempted - flaky_retry_used - per-cycle summary: - cycle number - checks that failed - failure log excerpt (last 500 chars) - fix attempted (or "flaky retry" / "none") - convergence_failure_reason: why CI did not converge # 2. Commit the report to the PR branch (in worktree) git add <report_path> git commit -m "ci(<story-id>): CI remediation report — convergence failed" git push origin <story_branch> # 3. Return CI FAIL — do NOT mark story done # The delivery wrapper's on_exit trap will mark the story failed (non-zero exit) return CI FAIL
NEVER mark the story
when returning CI FAIL.done
Flaky Test Detection Heuristic (AC4)
A CI failure is classified as likely flaky if:
- The same CI check fails in the current cycle AND
- A previous cycle saw a failure in that same check with an identical error message (matched via the first 100 characters of the failure log for that check)
When a likely-flaky failure is detected:
- Do NOT apply code changes
- Push an empty commit to re-trigger CI:
git commit --allow-empty -m "ci: retry (suspected flaky)" - Count this as consuming the flaky retry slot (max 1 flaky retry total per story)
- If the flaky retry slot is already used and the same failure recurs → escalate
Fix Principles
When applying corrective code changes for non-flaky failures:
- Minimal change only — fix what CI identifies, nothing more
- No scope expansion — do not refactor, add features, or change behavior beyond the CI failure
- Commit message convention:
— distinguishable from feature commitsfix(ci/<story-id>): <description> - Truncate logs: analyze only the last 3000 chars of each failed job log to stay within context limits
Outputs
CI PASS:
status: CI PASS cycles_used: <n> flaky_retry_used: <true|false>
CI PASS (advisory):
status: CI PASS advisory: true reason: <"no_checks" | "infra_failure_advisory"> note: "CI failed but branch protection is not active — merge permitted"
The Delivery Orchestrator treats
CI PASS (advisory) identically to CI PASS — it proceeds to merge. The advisory flag is logged for traceability but does not block the delivery.
CI FAIL:
status: CI FAIL cycles_used: 3 flaky_retry_used: <true|false> escalation_reason: <why convergence failed> remediation_report: docs/ci-failures/<story-id>-<timestamp>.md
CI FAIL is only returned when branch protection is active AND CI cannot pass. When branch protection is absent, infra failures produce CI PASS (advisory) instead.
Non-Goals
This skill must NOT:
- Modify acceptance criteria or product scope
- Apply fixes to pre-existing CI failures unrelated to this story's changes
- Attempt to fix infrastructure failures (missing secrets, missing bindings, quota limits, billing limits) — these are detected via Step 4b fast-path (do NOT burn retry cycles). When
, they produce CI PASS (advisory). When branch protection is active, they produce ESCALATE.ci_is_advisory - Merge the PR (that is the Orchestrator's responsibility after CI PASS)
- Use
(heartbeat requirement — see AC7)gh pr checks --watch
Quality Checks
- Every cycle emits at least one heartbeat line to
$LOG_DIR/<story-id>.log - Flaky detection compares against previous cycle signatures, not just the current cycle
- Escalation report is committed before returning CI FAIL
- Story is never marked
on CI FAILdone - Log truncation is applied before analysis (max 3000 chars per job)