Skilllibrary long-run-watchdog

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/skilllibrary

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/05-agentic-orchestration-and-autonomy/long-run-watchdog" ~/.claude/skills/merceralex397-collab-skilllibrary-long-run-watchdog && rm -rf "$T"

manifest: 05-agentic-orchestration-and-autonomy/long-run-watchdog/SKILL.md

source content

Purpose

Detect when a long-running agent task has stalled, entered an infinite loop, exhausted resources, or stopped making meaningful progress — then intervene with a graduated recovery policy before the task wastes further compute or produces corrupt output.

When to use

An autonomous agent run has exceeded its expected wall-clock time.
Output logs show repeated identical actions (loop signature).
A heartbeat or progress metric has flatlined for N intervals.
Resource consumption (tokens, API calls, disk writes) is spiking without corresponding output.
The orchestrator needs pre-defined escalation rules for unattended runs.

Do NOT use when

The task is a single prompt/response with no iteration.
The agent completed normally but the user dislikes the result (use review skills instead).
You are actively debugging a failure that already terminated (use
```
multi-agent-debugging
```
).
The run is interactive with a human in the loop providing guidance each step.

Operating procedure

Collect run metadata. Read the agent's task description, expected duration, and any configured timeout values. List them in a summary table:
```
| Field | Value |
```
.
Define heartbeat contract. Set a heartbeat interval (default: 60 s for short tasks, 300 s for long tasks). Record the metric that constitutes "progress" — e.g., new file written, test passed, commit created, lines of output produced.
Scan recent output for loop signatures. Run
```
grep
```
or string-match on the last 200 lines of agent output. Flag if any 5-line block repeats 3+ times consecutively.
Check progress metric delta. Compare the progress metric at T-now vs T-minus-one-interval. If delta is zero for 2 consecutive intervals, mark the run as STALLED.
Check resource burn rate. Count total tokens or API calls consumed in the last interval. If burn rate exceeds 3× the average of prior intervals with zero output delta, mark as RUNAWAY.
Apply recovery action based on severity:
- STALLED → Send a nudge prompt: restate the goal and last successful checkpoint.
- RUNAWAY → Pause the agent, snapshot current state, and escalate to the orchestrator.
- LOOP_DETECTED → Inject a "break the pattern" prompt with an alternative approach suggestion.
- TIMEOUT → Terminate the run, preserve artifacts, and write a summary of progress so far.

Log the intervention. Write a structured record:

{ timestamp, run_id, condition, action_taken, outcome }

Verify recovery. After a nudge or pattern-break, wait one full interval and re-run steps 3-5. If the condition persists after 2 recovery attempts, escalate to TIMEOUT.
Produce a watchdog report. Summarize: total run time, intervals monitored, conditions detected, actions taken, final status (recovered / terminated / escalated).

Decision rules

Never terminate a run on the first stall detection; always attempt at least one nudge first.
If the agent has produced partial valuable output, snapshot it before any destructive action.
A loop is only confirmed when the same action sequence repeats 3+ times — not 2.
Resource burn rate thresholds are relative to the run's own history, not absolute numbers.
Escalation to a human or parent orchestrator is mandatory after 2 failed recovery attempts.

Output requirements

Watchdog Status — one of: HEALTHY, STALLED, RUNAWAY, LOOP_DETECTED, TIMEOUT, RECOVERED.
Heartbeat Log — table of intervals with progress metric values.
Intervention Record — each action taken with timestamp and outcome.
Recovery Recommendation — if terminated, a concrete suggestion for what to change on retry.

References

```
references/checkpoint-rules.md
```
— how to snapshot agent state.
```
references/failure-escalation.md
```
— escalation chain definitions.
```
references/delegate-contracts.md
```
— expected heartbeat clauses in delegation.

Related skills

```
multi-agent-debugging
```
— for post-mortem analysis after a watchdog termination.
```
verification-before-advance
```
— for quality gates that prevent runaway progress.
```
human-interrupt-handling
```
— for escalations that reach a human operator.
```
autonomous-run-control
```
— for broader run lifecycle management.

Failure handling

If agent output is inaccessible (no logs), treat as STALLED immediately and escalate.
If the heartbeat metric is undefined, fall back to wall-clock time with a 2× expected-duration threshold.
If recovery actions themselves fail (e.g., prompt injection rejected), terminate and preserve all state.
If multiple agents share a run, isolate the stalled agent before intervening to avoid cascade disruption.