Skilllibrary long-run-watchdog
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/05-agentic-orchestration-and-autonomy/long-run-watchdog" ~/.claude/skills/merceralex397-collab-skilllibrary-long-run-watchdog && rm -rf "$T"
manifest:
05-agentic-orchestration-and-autonomy/long-run-watchdog/SKILL.mdsource content
Purpose
Detect when a long-running agent task has stalled, entered an infinite loop, exhausted resources, or stopped making meaningful progress — then intervene with a graduated recovery policy before the task wastes further compute or produces corrupt output.
When to use
- An autonomous agent run has exceeded its expected wall-clock time.
- Output logs show repeated identical actions (loop signature).
- A heartbeat or progress metric has flatlined for N intervals.
- Resource consumption (tokens, API calls, disk writes) is spiking without corresponding output.
- The orchestrator needs pre-defined escalation rules for unattended runs.
Do NOT use when
- The task is a single prompt/response with no iteration.
- The agent completed normally but the user dislikes the result (use review skills instead).
- You are actively debugging a failure that already terminated (use
).multi-agent-debugging - The run is interactive with a human in the loop providing guidance each step.
Operating procedure
- Collect run metadata. Read the agent's task description, expected duration, and any
configured timeout values. List them in a summary table:
.| Field | Value | - Define heartbeat contract. Set a heartbeat interval (default: 60 s for short tasks, 300 s for long tasks). Record the metric that constitutes "progress" — e.g., new file written, test passed, commit created, lines of output produced.
- Scan recent output for loop signatures. Run
or string-match on the last 200 lines of agent output. Flag if any 5-line block repeats 3+ times consecutively.grep - Check progress metric delta. Compare the progress metric at T-now vs T-minus-one-interval. If delta is zero for 2 consecutive intervals, mark the run as STALLED.
- Check resource burn rate. Count total tokens or API calls consumed in the last interval. If burn rate exceeds 3× the average of prior intervals with zero output delta, mark as RUNAWAY.
- Apply recovery action based on severity:
- STALLED → Send a nudge prompt: restate the goal and last successful checkpoint.
- RUNAWAY → Pause the agent, snapshot current state, and escalate to the orchestrator.
- LOOP_DETECTED → Inject a "break the pattern" prompt with an alternative approach suggestion.
- TIMEOUT → Terminate the run, preserve artifacts, and write a summary of progress so far.
- Log the intervention. Write a structured record:
.{ timestamp, run_id, condition, action_taken, outcome } - Verify recovery. After a nudge or pattern-break, wait one full interval and re-run steps 3-5. If the condition persists after 2 recovery attempts, escalate to TIMEOUT.
- Produce a watchdog report. Summarize: total run time, intervals monitored, conditions detected, actions taken, final status (recovered / terminated / escalated).
Decision rules
- Never terminate a run on the first stall detection; always attempt at least one nudge first.
- If the agent has produced partial valuable output, snapshot it before any destructive action.
- A loop is only confirmed when the same action sequence repeats 3+ times — not 2.
- Resource burn rate thresholds are relative to the run's own history, not absolute numbers.
- Escalation to a human or parent orchestrator is mandatory after 2 failed recovery attempts.
Output requirements
- Watchdog Status — one of: HEALTHY, STALLED, RUNAWAY, LOOP_DETECTED, TIMEOUT, RECOVERED.
- Heartbeat Log — table of intervals with progress metric values.
- Intervention Record — each action taken with timestamp and outcome.
- Recovery Recommendation — if terminated, a concrete suggestion for what to change on retry.
References
— how to snapshot agent state.references/checkpoint-rules.md
— escalation chain definitions.references/failure-escalation.md
— expected heartbeat clauses in delegation.references/delegate-contracts.md
Related skills
— for post-mortem analysis after a watchdog termination.multi-agent-debugging
— for quality gates that prevent runaway progress.verification-before-advance
— for escalations that reach a human operator.human-interrupt-handling
— for broader run lifecycle management.autonomous-run-control
Failure handling
- If agent output is inaccessible (no logs), treat as STALLED immediately and escalate.
- If the heartbeat metric is undefined, fall back to wall-clock time with a 2× expected-duration threshold.
- If recovery actions themselves fail (e.g., prompt injection rejected), terminate and preserve all state.
- If multiple agents share a run, isolate the stalled agent before intervening to avoid cascade disruption.