Software_development_department agent-health

Reads production/traces/agent-metrics.jsonl and displays a per-agent performance summary table for the current or a specified session. Highlights agents with high error rates or OPEN circuit breaker state.

install

source · Clone the upstream repo

git clone https://github.com/tranhieutt/software_development_department

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/tranhieutt/software_development_department "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/agent-health" ~/.claude/skills/tranhieutt-software-development-department-agent-health && rm -rf "$T"

manifest: .claude/skills/agent-health/SKILL.md

source content

Agent Health

Display a performance summary table from

production/traces/agent-metrics.jsonl

, cross-referenced with

production/session-state/circuit-state.json

for live circuit breaker states.

Steps

1. Parse arguments

Flag	Default	Description
`--session <branch>`	current branch	Filter entries by `session` field
`--agent <name>`	all	Show only this agent
`--since <date>`	no limit	Only entries with `date >= YYYY-MM-DD`
`--log`	false	If set, append a fresh metrics snapshot to `agent-metrics.jsonl`

Get current branch:

git branch --show-current

2. Read data sources

Read both files in parallel:

```
production/traces/agent-metrics.jsonl
```
— historical metrics per agent per session

production/session-state/circuit-state.json

— live circuit breaker states

agent-metrics.jsonl

contains only the schema header line (no actual entries):

📭 No agent metrics recorded yet for this session.
   Metrics are written when agents use /agent-health --log
   or at the end of a session via /save-state.

Circuit breaker states (live):
[show table from circuit-state.json only]

3. Aggregate metrics

For each agent, compute across the filtered entries:

total_tasks

tasks_completed

tasks_failed

tasks_blocked

success_rate

tasks_completed / total_tasks * 100

(0 if no tasks)

```
error_rate
```
= latest
```
error_rate
```
field value
```
circuit_state
```
= from
```
circuit-state.json
```
(live, not from log)

4. Render health table

🏥 Agent Health Report — session: <branch> · <date range>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Agent                  Tasks  ✅ Done  ❌ Failed  ⛔ Blocked  Success%  Circuit
──────────────────────────────────────────────────────────────────────────────
backend-developer          8       7          1          0      87.5%   🟢 CLOSED
frontend-developer         5       5          0          0     100.0%   🟢 CLOSED
qa-tester                  6       4          2          0      66.7%   🟡 HALF-OPEN
data-engineer              2       2          0          0     100.0%   🟢 CLOSED
investigator               1       0          1          0       0.0%   🔴 OPEN
──────────────────────────────────────────────────────────────────────────────
TOTAL                     22      18          4          0      81.8%

⚠️  Agents needing attention:
  🔴 investigator     — Circuit OPEN · fallback: solver
  🟡 qa-tester        — Circuit HALF-OPEN · 2 failures this session

Circuit state icons:

```
🟢 CLOSED
```
— healthy
```
🟡 HALF-OPEN
```
— recovering, monitor closely
```
🔴 OPEN
```
— bypassed, routed to fallback

Flag agents as needing attention if:

```
circuit_state
```
is
```
OPEN
```
or
```
HALF-OPEN
```
```
success_rate
```
< 70%
```
tasks_failed
```
>= 2

5. Log snapshot (if --log)

--log

flag was passed, append one entry per active agent to

production/traces/agent-metrics.jsonl

{"date":"<YYYY-MM-DD>","session":"<branch>","agent":"<agent>","tasks_completed":<N>,"tasks_failed":<N>,"tasks_blocked":<N>,"avg_tokens_est":<N>,"error_rate":<0.0-1.0>,"circuit_state":"CLOSED|OPEN|HALF-OPEN","notes":"<optional>"}

Get

circuit_state

from

circuit-state.json

. Estimate

avg_tokens_est

from decision ledger entry count × 800 tokens (rough estimate per entry) if no exact token data is available. Note this is an estimate and mark with

_est

suffix.

Print after logging:

✅ Metrics snapshot logged → production/traces/agent-metrics.jsonl
   [N] agents recorded · <date>

6. Suggest actions

After the table, if any agents need attention:

💡 Suggested actions:
  • /resume-from <task_id>        — recover failed task checkpoint
  • /trace-history --risk High    — audit high-risk decisions
  • Check circuit-state.json      — update OPEN agents once issue resolved

How metrics get into the file

Agents append entries in two ways:

Manual: Run
```
/agent-health --log
```
at end of session
Via
/save-state
: When saving state with a
```
task_id
```
, metrics for the active agent are appended automatically

The file grows one JSON line per agent per session. Use

--since

to filter to recent sessions and avoid reading stale data from weeks ago.

Quick examples

# Summary for current session
/agent-health

# Check one agent across all time
/agent-health --agent qa-tester

# Log a fresh snapshot and view it
/agent-health --log

# Review last 7 days
/agent-health --since 2026-04-09