Awesome-omni-skill ci-fix-pipeline
Self-healing CI pipeline -- 3-attempt retry budget with strategy rotation, inbox-wait for results, autonomous fix loop with escalation
git clone https://github.com/diegosouzapw/awesome-omni-skill
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/devops/ci-fix-pipeline" ~/.claude/skills/diegosouzapw-awesome-omni-skill-ci-fix-pipeline && rm -rf "$T"
skills/devops/ci-fix-pipeline/SKILL.mdCI Fix Pipeline
Overview
Autonomous pipeline that fetches GitHub Actions CI failures and fixes them -- ALL failures by default. No selective mode. Failures beyond
max_fix_files trigger sub-ticket creation and
continue with the remaining fixable failures.
v2.0 Self-Healing Mode (OMN-2829): When
--self-heal is enabled, the pipeline enters a
multi-attempt repair loop with strategy rotation. Each attempt uses a different fix strategy
(targeted -> broad lint -> regenerate). Between attempts, the pipeline uses inbox-wait (not
polling) to detect CI re-run results. The node_ci_repair_effect ONEX node orchestrates this
loop.
Workflow (standard): Fetch CI failures -> Slack start -> Classify + Sub-ticket large-scope -> Fix ALL fixable -> Commit -> Slack complete -> ModelSkillResult
Workflow (self-healing):
_bin/ci-status.sh -> detect failures -> attempt 1 (targeted fix) -> push -> inbox-wait -> if still failing -> attempt 2 (broad lint fix) -> push -> inbox-wait -> if still failing -> attempt 3 (regenerate) -> push -> inbox-wait -> inbox notification on success/exhaustion
Announce at start: "I'm using the ci-fix-pipeline skill to fix CI failures."
Policy Defaults
policy: fix_all: true # always fix all -- no selective mode max_fix_files: 10 # files in scope trigger sub-ticket (not skip) fix_preexisting_in_touched: true # fix pre-existing issues in touched files slack_on_start: true # notify Slack before fixing slack_on_complete: true # notify Slack with fix summary max_attempts: 3 # self-heal: retry budget (1-3) self_heal: false # self-heal: enable multi-attempt loop
Quick Start
/ci-fix-pipeline # Fix all CI failures on current branch /ci-fix-pipeline --pr 42 # Fix failures for PR #42 /ci-fix-pipeline --ticket-id OMN-1234 # Include ticket context in Slack messages /ci-fix-pipeline --no-slack # Suppress Slack notifications /ci-fix-pipeline --skip-patterns "test_*" # Skip jobs/steps matching pattern /ci-fix-pipeline --max-fix-files 20 # Raise the sub-ticket threshold /ci-fix-pipeline --self-heal --pr 42 # Self-healing mode with retry loop /ci-fix-pipeline --self-heal --max-attempts 2 # Limit to 2 attempts
Arguments
| Argument | Default | Description |
|---|---|---|
| none | PR number for CI failure fetch |
| current | Branch name for CI failure fetch |
| none | Comma-separated job/step name patterns to skip |
| 10 | Files-in-scope threshold; above this -> sub-ticket |
| false | Disable Slack notifications |
| none | Context ticket ID for Slack messages |
| false | Enable self-healing retry loop with strategy rotation |
| 3 | Maximum repair attempts (only with --self-heal) |
Execution Phases
Phase 1: Fetch CI Failures
Dispatch to polymorphic agent:
Task( subagent_type="onex:polymorphic-agent", description="Fetch CI failures for ci-fix-pipeline", prompt="Fetch CI failures using the ci-failures skill. Run: ${CLAUDE_PLUGIN_ROOT}/skills/ci-failures/ci-quick-review {N | branch_name} Return the raw JSON from ci-quick-review (pass through unchanged). The response has structure: {\"repository\": str, \"pr_number\": int, \"summary\": {\"total\": N, \"critical\": N, \"major\": N, \"minor\": N}, \"failures\": [{\"workflow\": str, \"job\": str, \"job_id\": str, \"step\": str, \"severity\": str, \"workflow_id\": str, \"job_url\": str}], \"fetched_at\": str}" )
Branch resolution: After Phase 1, the orchestrator must have a branch name for use in later phases. Resolve as follows:
- If
was provided: use that value directly.--branch - If
was provided (no--pr
): run--branch
to get the branch name.gh pr view {N} --json headRefName --jq '.headRefName' - If neither was provided: use the current branch (already known from
invocation).ci-quick-review
Phase 2: Slack Start Notification
If
slack_on_start: true and Slack is available, notify:
ci-fix-pipeline starting PR/Branch: {context} Failures found: {N} ({critical} critical, {major} major, {minor} minor) Ticket: {ticket_id if provided}
Skip silently if Slack unavailable (non-blocking).
Phase 3: Classify and Route Failures
For each failure:
-
Skip check: Does the failure
orjob
name match anystep
pattern?--skip-patterns- Yes → mark as
, record reasonskipped - No → continue
- Yes → mark as
-
Scope check: Does the failure
touch more thanjob
files? Determine scope by inspecting the job logs (viamax_fix_files
) to count affected files. If log inspection is unavailable, treat scope as within threshold.gh api repos/{repo}/actions/jobs/{job_id}/logs- Scope > max_fix_files → mark as
, create Linear sub-ticket inline (see Sub-Ticket Creation below), continue to next failurecapped - Scope ≤ max_fix_files → add to fix queue
- Scope > max_fix_files → mark as
Result: failures split into
skipped, capped, and to_fix buckets.
Phase 4: Fix Failures
Dispatch one polymorphic agent per severity group (critical first, then major, then minor) for all
to_fix failures:
Task( subagent_type="onex:polymorphic-agent", description="Fix {severity} CI failures", prompt="**AGENT REQUIREMENT**: You MUST be a polymorphic-agent. Fix the following {severity} CI failures: {failures_list} Instructions: 1. Read each affected file 2. Apply the fix 3. If fix_preexisting_in_touched is true: also fix any pre-existing lint/mypy issues in those files (only files already in scope — not a full repo scan) 4. Do NOT commit Return classification for each failure: {\"fixed\": [failure_ids], \"architectural\": [failure_ids], \"unfixable\": [failure_ids]}" )
Post-fix architectural check: For each failure returned as
architectural by the fix agent:
- Send a Slack message (via
in omnibase_infra) describing the architectural change and asking for human approval. Include the failure description, the proposed fix, and the files affected. Wait for a reply (poll or webhook callback).HandlerSlackWebhook - Approved (human replies "approve" or "yes") → apply fix; Declined (any other reply or timeout after 10 min) → mark
escalated
Phase 5: Commit Fixes
Skip this phase if
count is 0 (no code was changed — all failures were skipped, capped, or unfixable). Proceed directly to Phase 6 with fixed
commit: null in ModelSkillResult.
Otherwise, orchestrator stages and commits inline (no dispatch needed):
git add <changed_files> git commit -m "fix(ci): resolve {N} {severity} failures [{ticket_id}]"
Commit message format:
fix(ci): resolve {N} {severity} failures [{ticket_id}]
where {severity} is the highest severity fixed (e.g., critical, major, minor) and
{ticket_id} is the value from --ticket-id (or omitted if not provided).
Sub-Ticket Creation
For each
capped failure (scope > max_fix_files), created inline during Phase 3:
# current_team: resolved from --ticket-id parent team (via mcp__linear-server__get_issue), # or from the first team returned by mcp__linear-server__list_teams if no ticket is provided. mcp__linear-server__create_issue( title=f"CI: {failure.job} — {failure.step} (large scope)", team=current_team, description=f""" ## CI Failure Requiring Human Review **Job**: {failure.job} **Step**: {failure.step} **Severity**: {failure.severity} **Scope**: Exceeds max_fix_files={max_fix_files} threshold **Triggered by**: ci-fix-pipeline run for {ticket_id or branch} **Job URL**: {failure.job_url} ## Definition of Done - [ ] All affected files reviewed and fixed - [ ] CI passing on {branch} """, parentId=ticket_id if ticket_id else None, labels=["ci-failure", "needs-human"] )
Phase 6: Slack Complete Notification
If
slack_on_complete: true, notify with diff summary:
ci-fix-pipeline complete Fixed: {N} failures Skipped: {M} failures (patterns: {patterns}) Sub-tickets created: {K} (large-scope failures) Escalated: {L} (architectural — declined or timed out) Ticket: {ticket_id if provided} Branch: {branch}
ModelSkillResult Output
Emits to
~/.claude/skill-results/{context_id}/ci-fix-pipeline.json
where {context_id} is the Claude session ID (from $CLAUDE_SESSION_ID env var) or default
if the session ID is unavailable:
{ "status": "completed|capped|escalated|failed", "fixed_count": 5, "skipped_count": 1, "capped_count": 2, "escalated_count": 0, "unfixable_count": 0, "sub_tickets": ["OMN-XXXX", "OMN-XYYY"], "commit": "abc1234", "branch": "feature/my-branch", "ticket_id": "OMN-1234" }
Status values:
— All fixable failures resolvedcompleted
— Some failures deferred to sub-tickets; fixed what was in scopecapped
— One or more architectural failures declined or timed out; human review requiredescalated
— CI fetch failed or commit failed; pipeline haltedfailed
Failure Handling
| Error | Behavior |
|---|---|
| CI fetch fails | Hard exit with , reason in output |
| Fix agent fails | Log failure, mark as , continue with others |
| Sub-ticket creation fails | Log warning, continue (non-blocking) |
| Slack unavailable | Skip notification, continue (non-blocking) |
| Commit fails | Exit with , leave changes staged |
Sub-Ticket Threshold Policy
The
max_fix_files threshold is a routing decision, not a skip:
- Failures within threshold: fixed autonomously
- Failures above threshold: sub-ticket created, pipeline continues with remaining
This ensures large-scope failures are never silently dropped — they are tracked in Linear.
Self-Healing Mode (OMN-2829)
When
--self-heal is enabled, the pipeline wraps the standard fix flow in a multi-attempt
retry loop orchestrated by node_ci_repair_effect.
Architecture
CI fails -> _bin/ci-status.sh --pr N --repo ORG/REPO -> parse failure JSON -> node_ci_repair_effect.execute_effect(event) -> for attempt in 1..max_attempts: strategy = RepairStrategy.for_attempt(attempt) -> dispatch fix agent with strategy-specific prompt -> git add + commit + push -> inbox-wait for CI re-run result (not polling) -> if CI passing: inbox notification "repaired" -> exit -> if CI failing: rotate strategy, continue loop -> if all attempts exhausted: inbox notification "exhausted" -> exit
Strategy Rotation
Each attempt uses a progressively broader fix strategy:
| Attempt | Strategy | Description |
|---|---|---|
| 1 | | Parse error logs, fix only the specific failing lines/files |
| 2 | | Run ruff/mypy auto-fix on all files touched by the PR |
| 3 | | Rewrite failing sections, apply all auto-fixers |
CI Status Extraction
Self-healing mode uses
_bin/ci-status.sh instead of the heavier ci-failures/ci-quick-review
for fast, structured CI status checks between attempts:
# Fetch CI status as structured JSON ${CLAUDE_PLUGIN_ROOT}/_bin/ci-status.sh --pr 42 --repo OmniNode-ai/omniclaude
Output:
{ "status": "failing", "pr_number": 42, "repo": "OmniNode-ai/omniclaude", "branch": "jonah/omn-2829-self-healing-ci", "run_id": "12345678", "failed_jobs": [ { "job_id": "56174634733", "job_name": "lint / ruff", "step": "Run ruff check", "conclusion": "failure", "log_excerpt": "..." } ], "failure_summary": "1 job(s) failed: lint / ruff", "fetched_at": "2026-02-26T13:00:00Z" }
Inbox-Wait Pattern
After each fix attempt pushes code, the pipeline waits for CI results using inbox-wait rather than fixed-interval polling:
- Push fix commit to branch
- Call
which:node_ci_repair_effect.wait_for_ci_rerun()- Polls
at 30s intervals_bin/ci-status.sh - Detects when a new
appears (different from pre-push run)run_id - Waits for terminal state (
orpassing
)failing - Times out after 5 minutes (configurable)
- Polls
- If
: record success, send inbox notification, exitpassing - If
: rotate strategy, begin next attemptfailing
Self-Healing Fix Dispatch
Task( subagent_type="onex:polymorphic-agent", description="ci-fix-pipeline: self-heal attempt {N}/{max} for PR #{pr_number}", prompt="**SELF-HEALING CI REPAIR -- Attempt {N}/{max}** Strategy: {strategy_name} {strategy_prompt} CI Failure Details: {failure_json} Branch: {branch} Repo: {repo} Instructions: 1. Read the failure logs carefully 2. Apply the fix strategy described above 3. Stage and commit with message: fix(ci): self-heal attempt {N} -- {strategy_name} [{ticket_id}] 4. Push to the branch Report: files changed, fix description, confidence level." )
ONEX Node: node_ci_repair_effect
Tier: EVENT_BUS+ Type: Effect node (external I/O) Location:
plugins/onex/hooks/lib/node_ci_repair_effect.py
Provides:
-- Initialize repair run stateexecute_effect(event)
-- Record attempt outcomerecord_attempt_result(run_state, attempt, ...)
-- Record error and notifyfinalize_with_error(run_state, error)
-- Wrapper aroundfetch_ci_status(pr, repo, branch)_bin/ci-status.sh
-- Inbox-wait for new CI runwait_for_ci_rerun(pr, repo, branch, prev_run_id)
-- Write tosend_inbox_notification(run_state, message)~/.claude/inbox/
/save_repair_state(run_state)
-- State persistenceload_repair_state(run_id)
Self-Healing ModelSkillResult
When
--self-heal is active, the ModelSkillResult includes additional fields:
{ "status": "repaired|exhausted|failed", "repair_run_id": "ci-repair-42-1740000000", "attempts_used": 2, "max_attempts": 3, "strategy_used": "broad_lint_fix", "fixed_count": 5, "skipped_count": 0, "commit": "abc1234", "branch": "jonah/omn-2829-self-healing-ci", "ticket_id": "OMN-2829", "inbox_notification_sent": true }
Status values (self-healing):
-- CI fixed within the retry budgetrepaired
-- All attempts used, CI still failingexhausted
-- Error during repair (fetch failed, commit failed, etc.)failed
Verification
To verify self-healing works end-to-end:
- Push a deliberately failing commit (e.g., syntax error, failing test)
- Run:
/ci-fix-pipeline --self-heal --pr <N> --ticket-id OMN-2829 - Confirm:
- Attempt 1 uses
strategytargeted_fix - If still failing, attempt 2 uses
broad_lint_fix - If still failing, attempt 3 uses
regenerate_and_fix - Inbox notification sent on success or exhaustion
- State persisted to
~/.claude/state/ci-repair/
- Attempt 1 uses
See Also
skill -- fetch and analyze CI failures (read-only)ci-failures
skill -- poll CI status and auto-fix (OMN-2523)ci-watch
skill -- review and fix local code changeslocal-review
skill -- end-to-end ticket pipeline (Phase 4 invokes ci-watch)ticket-pipeline
-- lightweight CI status extraction script_bin/ci-status.sh
-- ONEX effect node for self-healing orchestrationnode_ci_repair_effect
in omnibase_infra -- Slack delivery infrastructureHandlerSlackWebhook