Software_development_department agent-health

Reads production/traces/agent-metrics.jsonl and displays a per-agent performance summary table for the current or a specified session. Highlights agents with high error rates or OPEN circuit breaker state.

install
source · Clone the upstream repo
git clone https://github.com/tranhieutt/software_development_department
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/tranhieutt/software_development_department "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/agent-health" ~/.claude/skills/tranhieutt-software-development-department-agent-health && rm -rf "$T"
manifest: .claude/skills/agent-health/SKILL.md
source content

Agent Health

Display a performance summary table from

production/traces/agent-metrics.jsonl
, cross-referenced with
production/session-state/circuit-state.json
for live circuit breaker states.

Steps

1. Parse arguments

FlagDefaultDescription
--session <branch>
current branchFilter entries by
session
field
--agent <name>
allShow only this agent
--since <date>
no limitOnly entries with
date >= YYYY-MM-DD
--log
falseIf set, append a fresh metrics snapshot to
agent-metrics.jsonl

Get current branch:

git branch --show-current
.

2. Read data sources

Read both files in parallel:

  • production/traces/agent-metrics.jsonl
    — historical metrics per agent per session
  • production/session-state/circuit-state.json
    — live circuit breaker states

If

agent-metrics.jsonl
contains only the schema header line (no actual entries):

📭 No agent metrics recorded yet for this session.
   Metrics are written when agents use /agent-health --log
   or at the end of a session via /save-state.

Circuit breaker states (live):
[show table from circuit-state.json only]

3. Aggregate metrics

For each agent, compute across the filtered entries:

  • total_tasks
    =
    tasks_completed
    +
    tasks_failed
    +
    tasks_blocked
  • success_rate
    =
    tasks_completed / total_tasks * 100
    (0 if no tasks)
  • error_rate
    = latest
    error_rate
    field value
  • circuit_state
    = from
    circuit-state.json
    (live, not from log)

4. Render health table

🏥 Agent Health Report — session: <branch> · <date range>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Agent                  Tasks  ✅ Done  ❌ Failed  ⛔ Blocked  Success%  Circuit
──────────────────────────────────────────────────────────────────────────────
backend-developer          8       7          1          0      87.5%   🟢 CLOSED
frontend-developer         5       5          0          0     100.0%   🟢 CLOSED
qa-tester                  6       4          2          0      66.7%   🟡 HALF-OPEN
data-engineer              2       2          0          0     100.0%   🟢 CLOSED
investigator               1       0          1          0       0.0%   🔴 OPEN
──────────────────────────────────────────────────────────────────────────────
TOTAL                     22      18          4          0      81.8%

⚠️  Agents needing attention:
  🔴 investigator     — Circuit OPEN · fallback: solver
  🟡 qa-tester        — Circuit HALF-OPEN · 2 failures this session

Circuit state icons:

  • 🟢 CLOSED
    — healthy
  • 🟡 HALF-OPEN
    — recovering, monitor closely
  • 🔴 OPEN
    — bypassed, routed to fallback

Flag agents as needing attention if:

  • circuit_state
    is
    OPEN
    or
    HALF-OPEN
  • success_rate
    < 70%
  • tasks_failed
    >= 2

5. Log snapshot (if --log)

If

--log
flag was passed, append one entry per active agent to
production/traces/agent-metrics.jsonl
:

{"date":"<YYYY-MM-DD>","session":"<branch>","agent":"<agent>","tasks_completed":<N>,"tasks_failed":<N>,"tasks_blocked":<N>,"avg_tokens_est":<N>,"error_rate":<0.0-1.0>,"circuit_state":"CLOSED|OPEN|HALF-OPEN","notes":"<optional>"}

Get

circuit_state
from
circuit-state.json
. Estimate
avg_tokens_est
from decision ledger entry count × 800 tokens (rough estimate per entry) if no exact token data is available. Note this is an estimate and mark with
_est
suffix.

Print after logging:

✅ Metrics snapshot logged → production/traces/agent-metrics.jsonl
   [N] agents recorded · <date>

6. Suggest actions

After the table, if any agents need attention:

💡 Suggested actions:
  • /resume-from <task_id>        — recover failed task checkpoint
  • /trace-history --risk High    — audit high-risk decisions
  • Check circuit-state.json      — update OPEN agents once issue resolved

How metrics get into the file

Agents append entries in two ways:

  1. Manual: Run
    /agent-health --log
    at end of session
  2. Via
    /save-state
    :
    When saving state with a
    task_id
    , metrics for the active agent are appended automatically

The file grows one JSON line per agent per session. Use

--since
to filter to recent sessions and avoid reading stale data from weeks ago.


Quick examples

# Summary for current session
/agent-health

# Check one agent across all time
/agent-health --agent qa-tester

# Log a fresh snapshot and view it
/agent-health --log

# Review last 7 days
/agent-health --since 2026-04-09