Software_development_department agent-health
Reads production/traces/agent-metrics.jsonl and displays a per-agent performance summary table for the current or a specified session. Highlights agents with high error rates or OPEN circuit breaker state.
install
source · Clone the upstream repo
git clone https://github.com/tranhieutt/software_development_department
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/tranhieutt/software_development_department "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/agent-health" ~/.claude/skills/tranhieutt-software-development-department-agent-health && rm -rf "$T"
manifest:
.claude/skills/agent-health/SKILL.mdsource content
Agent Health
Display a performance summary table from
production/traces/agent-metrics.jsonl,
cross-referenced with production/session-state/circuit-state.json for live
circuit breaker states.
Steps
1. Parse arguments
| Flag | Default | Description |
|---|---|---|
| current branch | Filter entries by field |
| all | Show only this agent |
| no limit | Only entries with |
| false | If set, append a fresh metrics snapshot to |
Get current branch:
git branch --show-current.
2. Read data sources
Read both files in parallel:
— historical metrics per agent per sessionproduction/traces/agent-metrics.jsonl
— live circuit breaker statesproduction/session-state/circuit-state.json
If
agent-metrics.jsonl contains only the schema header line (no actual entries):
📭 No agent metrics recorded yet for this session. Metrics are written when agents use /agent-health --log or at the end of a session via /save-state. Circuit breaker states (live): [show table from circuit-state.json only]
3. Aggregate metrics
For each agent, compute across the filtered entries:
=total_tasks
+tasks_completed
+tasks_failedtasks_blocked
=success_rate
(0 if no tasks)tasks_completed / total_tasks * 100
= latesterror_rate
field valueerror_rate
= fromcircuit_state
(live, not from log)circuit-state.json
4. Render health table
🏥 Agent Health Report — session: <branch> · <date range> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Agent Tasks ✅ Done ❌ Failed ⛔ Blocked Success% Circuit ────────────────────────────────────────────────────────────────────────────── backend-developer 8 7 1 0 87.5% 🟢 CLOSED frontend-developer 5 5 0 0 100.0% 🟢 CLOSED qa-tester 6 4 2 0 66.7% 🟡 HALF-OPEN data-engineer 2 2 0 0 100.0% 🟢 CLOSED investigator 1 0 1 0 0.0% 🔴 OPEN ────────────────────────────────────────────────────────────────────────────── TOTAL 22 18 4 0 81.8% ⚠️ Agents needing attention: 🔴 investigator — Circuit OPEN · fallback: solver 🟡 qa-tester — Circuit HALF-OPEN · 2 failures this session
Circuit state icons:
— healthy🟢 CLOSED
— recovering, monitor closely🟡 HALF-OPEN
— bypassed, routed to fallback🔴 OPEN
Flag agents as needing attention if:
iscircuit_state
orOPENHALF-OPEN
< 70%success_rate
>= 2tasks_failed
5. Log snapshot (if --log)
If
--log flag was passed, append one entry per active agent to
production/traces/agent-metrics.jsonl:
{"date":"<YYYY-MM-DD>","session":"<branch>","agent":"<agent>","tasks_completed":<N>,"tasks_failed":<N>,"tasks_blocked":<N>,"avg_tokens_est":<N>,"error_rate":<0.0-1.0>,"circuit_state":"CLOSED|OPEN|HALF-OPEN","notes":"<optional>"}
Get
circuit_state from circuit-state.json. Estimate avg_tokens_est from
decision ledger entry count × 800 tokens (rough estimate per entry) if no exact
token data is available. Note this is an estimate and mark with _est suffix.
Print after logging:
✅ Metrics snapshot logged → production/traces/agent-metrics.jsonl [N] agents recorded · <date>
6. Suggest actions
After the table, if any agents need attention:
💡 Suggested actions: • /resume-from <task_id> — recover failed task checkpoint • /trace-history --risk High — audit high-risk decisions • Check circuit-state.json — update OPEN agents once issue resolved
How metrics get into the file
Agents append entries in two ways:
- Manual: Run
at end of session/agent-health --log - Via
: When saving state with a/save-state
, metrics for the active agent are appended automaticallytask_id
The file grows one JSON line per agent per session. Use
--since to filter
to recent sessions and avoid reading stale data from weeks ago.
Quick examples
# Summary for current session /agent-health # Check one agent across all time /agent-health --agent qa-tester # Log a fresh snapshot and view it /agent-health --log # Review last 7 days /agent-health --since 2026-04-09