Gsd-skill-creator witness-observer
PMU observation pattern for agent health monitoring. Runs patrol loops, detects stalled agents, sends nudges, and escalates persistent failures to the mayor. Never modifies agent work or resolves conflicts.
git clone https://github.com/Tibsfox/gsd-skill-creator
T=$(mktemp -d) && git clone --depth=1 https://github.com/Tibsfox/gsd-skill-creator "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/skills/orchestration/witness-observer" ~/.claude/skills/tibsfox-gsd-skill-creator-witness-observer && rm -rf "$T"
examples/skills/orchestration/witness-observer/SKILL.mdWitness Observer
Per-rig observer that monitors polecat health and reports anomalies. The witness is the PMU (Performance Monitoring Unit) of the Gastown chipset -- it watches execution units for stalls, detects degraded performance, and raises alerts without interfering with computation. The witness is strictly read-only with respect to agent work. It observes and reports; it never modifies.
Activation Triggers
This skill activates when:
- The agent is assigned to monitor a rig's worker agents
- Multiple polecats are running and health monitoring is needed
- Stall detection is required for long-running work items
- The mayor needs a supervisory agent to watch active polecats
Core Capabilities
Patrol Loop
The witness runs a periodic patrol that checks all active agents in its rig for health indicators.
Patrol cycle:
SCAN EVALUATE ACT WAIT | | | | v v v v list agents -> check each -> nudge/escalate -> sleep interval (active ones) for stalls if needed (default 5 min)
Implementation:
const state = new StateManager({ stateDir: '.chipset/state/' }); const patrolInterval = 5 * 60 * 1000; // 5 minutes (configurable) const stallThreshold = 30 * 60 * 1000; // 30 minutes (configurable) async function patrol(): Promise<void> { // Get all agents that should be working const agents = await state.listAgents({ role: 'polecat' }); const active = agents.filter(a => a.status === 'active'); for (const agent of active) { const hook = await state.getHook(agent.id); if (!hook || hook.status !== 'active') continue; // Check last activity timestamp const lastActivity = new Date(hook.lastActivity).getTime(); const elapsed = Date.now() - lastActivity; if (elapsed > stallThreshold) { await handleStall(agent, hook, elapsed); } } }
Stall Detection
A stall is detected when an agent has hooked work but has not updated its activity timestamp within the threshold period (default 30 minutes).
Stall indicators:
| Indicator | What It Means |
|---|---|
| Hook active, no activity for 30+ min | Agent may be stuck, crashed, or idle |
| Agent status is 'active' but hook timestamp stale | Session may have ended without cleanup |
| Multiple consecutive patrol cycles with no change | Persistent stall, needs escalation |
Stall classification:
type StallSeverity = 'warning' | 'alert' | 'critical'; function classifyStall(elapsed: number, nudgesSent: number): StallSeverity { if (nudgesSent >= 2) return 'critical'; // Nudged twice, still stalled if (elapsed > 60 * 60 * 1000) return 'alert'; // Over 1 hour return 'warning'; // First detection }
Nudge Protocol
When a stall is detected, the witness follows a graduated escalation protocol.
Step 1 -- Send nudge to stalled agent:
async function handleStall( agent: AgentIdentity, hook: HookState, elapsed: number ): Promise<void> { const severity = classifyStall(elapsed, getNudgeCount(agent.id)); if (severity === 'warning') { // First nudge: ask agent if it's still working const nudge: AgentMessage = { from: witnessId, to: agent.id, channel: 'nudge', payload: `HEALTH_CHECK: no activity for ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`, timestamp: new Date().toISOString(), durable: false, }; // Write nudge file recordNudge(agent.id); return; } if (severity === 'alert' || severity === 'critical') { // Escalate to mayor await escalateToMayor(agent, hook, severity, elapsed); } }
Step 2 -- Wait for response (next patrol cycle):
If the agent responds to the nudge (updates its hook activity timestamp or sends mail), the stall is resolved. No further action needed.
Step 3 -- Escalate if unresolved:
async function escalateToMayor( agent: AgentIdentity, hook: HookState, severity: StallSeverity, elapsed: number ): Promise<void> { const escalation: AgentMessage = { from: witnessId, to: 'mayor', channel: 'mail', payload: `STALL_${severity.toUpperCase()}: ${agent.id} idle ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`, timestamp: new Date().toISOString(), durable: true, }; // Write escalation to .chipset/state/mail/mayor/{timestamp}-{witnessId}.json }
Health Reporting
The witness provides aggregate health summaries when queried by the mayor.
interface RigHealthReport { rigName: string; timestamp: string; totalAgents: number; activeAgents: number; stalledAgents: number; idleAgents: number; terminatedAgents: number; stalledDetails: Array<{ agentId: string; beadId: string; stalledMinutes: number; nudgesSent: number; }>; }
Communication Protocol
Messages the Witness SENDS
| Channel | Target | Purpose | Durability |
|---|---|---|---|
| Stalled polecats | "Are you still working?" health check | Non-durable |
| Mayor | Stall alerts (warning, alert, critical) | Durable |
| Mayor | Health report summaries | Durable |
Messages the Witness RECEIVES
| Channel | Source | Content |
|---|---|---|
| Mayor | Instructions (adjust thresholds, focus on specific agent) |
| Polecats | Status responses to nudges |
Error Handling
False Positive Stalls
If an agent is working but updates are slow (large commits, long test runs), the witness may detect a false positive. The nudge protocol handles this: the agent responds to the nudge, and the witness records the response as activity.
Witness Restart
If the witness itself restarts, it resumes patrol from scratch. It reads current agent and hook state from the filesystem -- there is no witness-specific state that needs recovery. The patrol loop is stateless between cycles.
Unresponsive Agent
If an agent does not respond to two nudges across two patrol cycles, the witness sends a
critical escalation to the mayor. The mayor decides whether to terminate and replace the agent.
Boundary: What the Witness Does NOT Do
The witness NEVER:
- Modifies agent work -- does not edit files, change branches, or alter code
- Resolves conflicts -- conflict resolution is outside the observer's scope
- Terminates agents -- only the mayor can terminate; the witness recommends
- Reassigns work -- hook management belongs to the mayor
- Changes agent status -- the witness reads status but does not write it (except its own)
- Runs tests or builds -- the witness observes; it does not validate output quality
The witness is a sensor. It detects anomalies and reports them. It does not act on them.
Integration with Other Gastown Skills
| Skill | Relationship |
|---|---|
| Witness reports stalls and health TO mayor |
| Witness monitors polecat health, sends nudges |
| Witness can observe refinery queue depth and merge failures |
| Witness reads state via StateManager (read-only) |
References
-- How this pattern derives from Gastown's witness.go patrolreferences/gastown-origin.md
-- Read-only constraints and observation-only scopereferences/boundaries.md