Gsd-skill-creator witness-observer

PMU observation pattern for agent health monitoring. Runs patrol loops, detects stalled agents, sends nudges, and escalates persistent failures to the mayor. Never modifies agent work or resolves conflicts.

install
source · Clone the upstream repo
git clone https://github.com/Tibsfox/gsd-skill-creator
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Tibsfox/gsd-skill-creator "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/skills/orchestration/witness-observer" ~/.claude/skills/tibsfox-gsd-skill-creator-witness-observer && rm -rf "$T"
manifest: examples/skills/orchestration/witness-observer/SKILL.md
source content

Witness Observer

Per-rig observer that monitors polecat health and reports anomalies. The witness is the PMU (Performance Monitoring Unit) of the Gastown chipset -- it watches execution units for stalls, detects degraded performance, and raises alerts without interfering with computation. The witness is strictly read-only with respect to agent work. It observes and reports; it never modifies.

Activation Triggers

This skill activates when:

  • The agent is assigned to monitor a rig's worker agents
  • Multiple polecats are running and health monitoring is needed
  • Stall detection is required for long-running work items
  • The mayor needs a supervisory agent to watch active polecats

Core Capabilities

Patrol Loop

The witness runs a periodic patrol that checks all active agents in its rig for health indicators.

Patrol cycle:

SCAN            EVALUATE         ACT              WAIT
  |                |               |                |
  v                v               v                v
list agents -> check each  ->  nudge/escalate -> sleep interval
(active ones)   for stalls     if needed         (default 5 min)

Implementation:

const state = new StateManager({ stateDir: '.chipset/state/' });
const patrolInterval = 5 * 60 * 1000; // 5 minutes (configurable)
const stallThreshold = 30 * 60 * 1000; // 30 minutes (configurable)

async function patrol(): Promise<void> {
  // Get all agents that should be working
  const agents = await state.listAgents({ role: 'polecat' });
  const active = agents.filter(a => a.status === 'active');

  for (const agent of active) {
    const hook = await state.getHook(agent.id);
    if (!hook || hook.status !== 'active') continue;

    // Check last activity timestamp
    const lastActivity = new Date(hook.lastActivity).getTime();
    const elapsed = Date.now() - lastActivity;

    if (elapsed > stallThreshold) {
      await handleStall(agent, hook, elapsed);
    }
  }
}

Stall Detection

A stall is detected when an agent has hooked work but has not updated its activity timestamp within the threshold period (default 30 minutes).

Stall indicators:

IndicatorWhat It Means
Hook active, no activity for 30+ minAgent may be stuck, crashed, or idle
Agent status is 'active' but hook timestamp staleSession may have ended without cleanup
Multiple consecutive patrol cycles with no changePersistent stall, needs escalation

Stall classification:

type StallSeverity = 'warning' | 'alert' | 'critical';

function classifyStall(elapsed: number, nudgesSent: number): StallSeverity {
  if (nudgesSent >= 2) return 'critical';    // Nudged twice, still stalled
  if (elapsed > 60 * 60 * 1000) return 'alert';  // Over 1 hour
  return 'warning';                            // First detection
}

Nudge Protocol

When a stall is detected, the witness follows a graduated escalation protocol.

Step 1 -- Send nudge to stalled agent:

async function handleStall(
  agent: AgentIdentity,
  hook: HookState,
  elapsed: number
): Promise<void> {
  const severity = classifyStall(elapsed, getNudgeCount(agent.id));

  if (severity === 'warning') {
    // First nudge: ask agent if it's still working
    const nudge: AgentMessage = {
      from: witnessId,
      to: agent.id,
      channel: 'nudge',
      payload: `HEALTH_CHECK: no activity for ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`,
      timestamp: new Date().toISOString(),
      durable: false,
    };
    // Write nudge file
    recordNudge(agent.id);
    return;
  }

  if (severity === 'alert' || severity === 'critical') {
    // Escalate to mayor
    await escalateToMayor(agent, hook, severity, elapsed);
  }
}

Step 2 -- Wait for response (next patrol cycle):

If the agent responds to the nudge (updates its hook activity timestamp or sends mail), the stall is resolved. No further action needed.

Step 3 -- Escalate if unresolved:

async function escalateToMayor(
  agent: AgentIdentity,
  hook: HookState,
  severity: StallSeverity,
  elapsed: number
): Promise<void> {
  const escalation: AgentMessage = {
    from: witnessId,
    to: 'mayor',
    channel: 'mail',
    payload: `STALL_${severity.toUpperCase()}: ${agent.id} idle ${Math.floor(elapsed / 60000)}m on ${hook.workItem?.beadId}`,
    timestamp: new Date().toISOString(),
    durable: true,
  };
  // Write escalation to .chipset/state/mail/mayor/{timestamp}-{witnessId}.json
}

Health Reporting

The witness provides aggregate health summaries when queried by the mayor.

interface RigHealthReport {
  rigName: string;
  timestamp: string;
  totalAgents: number;
  activeAgents: number;
  stalledAgents: number;
  idleAgents: number;
  terminatedAgents: number;
  stalledDetails: Array<{
    agentId: string;
    beadId: string;
    stalledMinutes: number;
    nudgesSent: number;
  }>;
}

Communication Protocol

Messages the Witness SENDS

ChannelTargetPurposeDurability
nudge
Stalled polecats"Are you still working?" health checkNon-durable
mail
MayorStall alerts (warning, alert, critical)Durable
mail
MayorHealth report summariesDurable

Messages the Witness RECEIVES

ChannelSourceContent
mail
MayorInstructions (adjust thresholds, focus on specific agent)
mail
PolecatsStatus responses to nudges

Error Handling

False Positive Stalls

If an agent is working but updates are slow (large commits, long test runs), the witness may detect a false positive. The nudge protocol handles this: the agent responds to the nudge, and the witness records the response as activity.

Witness Restart

If the witness itself restarts, it resumes patrol from scratch. It reads current agent and hook state from the filesystem -- there is no witness-specific state that needs recovery. The patrol loop is stateless between cycles.

Unresponsive Agent

If an agent does not respond to two nudges across two patrol cycles, the witness sends a

critical
escalation to the mayor. The mayor decides whether to terminate and replace the agent.

Boundary: What the Witness Does NOT Do

The witness NEVER:

  • Modifies agent work -- does not edit files, change branches, or alter code
  • Resolves conflicts -- conflict resolution is outside the observer's scope
  • Terminates agents -- only the mayor can terminate; the witness recommends
  • Reassigns work -- hook management belongs to the mayor
  • Changes agent status -- the witness reads status but does not write it (except its own)
  • Runs tests or builds -- the witness observes; it does not validate output quality

The witness is a sensor. It detects anomalies and reports them. It does not act on them.

Integration with Other Gastown Skills

SkillRelationship
mayor-coordinator
Witness reports stalls and health TO mayor
polecat-worker
Witness monitors polecat health, sends nudges
refinery-merge
Witness can observe refinery queue depth and merge failures
beads-state
Witness reads state via StateManager (read-only)

References

  • references/gastown-origin.md
    -- How this pattern derives from Gastown's witness.go patrol
  • references/boundaries.md
    -- Read-only constraints and observation-only scope