Software_development_department diagnose
Multi-agent diagnostic pipeline for complex/intermittent bugs. Orchestrates Investigator → Verifier → Solver → Lead Programmer with enforced handoff contracts. Use ONLY for non-obvious failures (root cause unclear, reproduction unstable, fixes reverted). NOT for trivial bugs with known cause — fix them directly.
install
source · Clone the upstream repo
git clone https://github.com/tranhieutt/software_development_department
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/tranhieutt/software_development_department "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/diagnose" ~/.claude/skills/tranhieutt-software-development-department-diagnose && rm -rf "$T"
manifest:
.claude/skills/diagnose/SKILL.mdsource content
Skill: /diagnose — Complex Bug Diagnostic Pipeline
When to invoke (and when NOT to)
✅ Use /diagnose
when:
/diagnose- Bug reproduces but root cause is unclear after one read-pass of the failing code
- Previous fix attempts have been reverted ≥ 2 times (symptoms return)
- Failure is intermittent (flaky test, race condition, timing-dependent)
- Failure occurs in unfamiliar code (agent has no prior context)
- User has explicitly requested
or "deep investigation"/diagnose - Circuit Breaker (Rule 14) tripped on the specialist agent that normally handles this domain
❌ Do NOT use /diagnose
when:
/diagnose- Cause is obvious (null ref, typo, missing import, incorrect import path)
- Fix is < 10 LOC and has a clear success check
- Bug is in code you just wrote this session (read-pass + local reasoning is faster)
- User wants a quick patch and has accepted the tradeoff
Pipeline overview
┌─────────────────┐ ┌─────────────────┐ ┌──────────────┐ ┌──────────────────┐ │ Investigator │ ──► │ Verifier │ ──► │ Solver │ ──► │ Lead Programmer │ │ (hypothesis) │ │ (devil's adv.) │ │ (tradeoffs) │ │ (assign + exec) │ └─────────────────┘ └─────────────────┘ └──────────────┘ └──────────────────┘ │ │ │ │ ▼ ▼ ▼ ▼ investigation.json verification.json solution.json implementation (root_cause, (status: confirmed | (3 options: (delegates to evidence[], refuted | inconclusive, Quick/Strategic/ backend-developer, confidence) reproduction_steps) Future-Proof) qa-tester, etc.)
Each stage produces a required artifact saved to
.investigations/<task_id>/ and a handoff contract (per Rule 16) to the next agent.
Stage 1 — Investigation
Agent:
investigator
Goal: Produce a falsifiable root-cause hypothesis backed by empirical evidence.
Inputs
- Symptom description (from user or TODO.md bug ID)
- Reproduction steps (or "cannot reproduce" + environment)
- Relevant log lines, stack traces, error IDs
Required output — investigation.json
investigation.json{ "task_id": "BUG-417", "symptom": "POST /api/orders returns 500 when cart has ≥10 items", "reproduction": { "steps": ["...", "..."], "frequency": "100% | intermittent (~30%) | once", "environment": "staging-eu-west-1" }, "hypothesis": { "root_cause": "OrderService.calculateTotal() N+1 query exhausts pool when cart.items.length > 9", "confidence": "high | medium | low", "falsifiable_by": "Run with pool_size=50; if error disappears, cause confirmed" }, "evidence": [ {"type": "log", "ref": ".investigations/BUG-417/pg-pool-exhausted.log", "summary": "..."}, {"type": "code", "ref": "src/services/order.service.ts:142", "summary": "Unbounded .map+await"} ], "unknowns": ["Why only eu-west-1?", "When did this start?"], "next_agent": "verifier" }
Quality gate (Lead Programmer rejects if):
is vague ("check if it works")hypothesis.falsifiable_by
has fewer than 2 items (unverifiable)evidence
is empty butunknowns
(contradictory)confidence: low
Stage 2 — Verification
Agent:
verifier
Goal: Attempt to refute the hypothesis. Only confirmed if refutation fails.
Inputs
(from Stage 1)investigation.json- Access to staging/test environment
Required output — verification.json
verification.json{ "task_id": "BUG-417", "status": "confirmed | refuted | inconclusive", "triangulation": [ {"method": "reproduce_with_fix_applied", "result": "Error gone with pool_size=50"}, {"method": "reproduce_without_fix", "result": "Error returns at 10 items"}, {"method": "adjacent_test_case", "result": "9 items = OK, 10 items = fail → threshold confirmed"} ], "counter_hypotheses_ruled_out": [ "DB slowness (ruled out: p99 < 50ms)", "Network flaps (ruled out: no packet loss in window)" ], "confidence": "high", "recommendation": "Proceed to solver — cause confirmed necessary AND sufficient" }
Decision flow
| Next action |
|---|---|
| Hand off to |
| Return to with counter-evidence. Max 2 round-trips. |
| STOP. Surface to user with all evidence. Do NOT proceed to solver. |
Stage 3 — Solution
Agent:
solver
Goal: Generate 3 solution options with explicit tradeoffs; never pick silently.
Required output — solution.json
solution.json{ "task_id": "BUG-417", "options": [ { "name": "Quick", "description": "Increase pool_size from 20 → 50 in db.ts", "scope_loc": 1, "risk_tier": "Low", "tradeoff": "Masks root cause; higher RAM; future growth hits same wall" }, { "name": "Strategic", "description": "Rewrite calculateTotal() to batch via IN-clause", "scope_loc": 40, "risk_tier": "Medium", "tradeoff": "Fixes N+1 permanently; requires regression test on discount logic" }, { "name": "Future-Proof", "description": "Introduce DataLoader pattern across service layer", "scope_loc": 300, "risk_tier": "High", "tradeoff": "Eliminates entire class of N+1 bugs; 2-3 day refactor; needs ADR" } ], "recommendation": "Strategic — best risk/value ratio. Quick only if release is < 24h." }
Quality gate
- All 3 options must have distinct scope (not three flavors of the same fix)
must state what is sacrificed, not just "takes longer"tradeoff
must cite a criterion (time budget, risk tier, blast radius)recommendation
Stage 4 — Finalization
Agent:
lead-programmer
Goal: Select option, assign specialist, track execution.
Actions
- Review
with user (ifsolution.json
orrisk_tier: High
)scope_loc > 100 - Select option → write selection to
.investigations/<task_id>/decision.md - Create A2A handoff contract (Rule 16) via
:/handoff
(orlead-programmer → backend-developer
,frontend-developer
)data-engineer- Acceptance criteria derived from
investigation.hypothesis.falsifiable_by
- Append ledger entry (Rule 15) to
:production/traces/decision_ledger.jsonl
{"ts":"2026-04-17T14:22:00Z","session":"main","agent_id":"lead-programmer","task_id":"BUG-417","request":"/diagnose BUG-417","reasoning":"Verified N+1 as necessary+sufficient; selected Strategic per solver recommendation","choice":"Strategic refactor of calculateTotal()","outcome":"pass","risk_tier":"Medium","duration_s":1840}
Artifact storage
All intermediate reports MUST be saved to
.investigations/<task_id>/:
.investigations/ └── BUG-417/ ├── investigation.json # Stage 1 output ├── verification.json # Stage 2 output ├── solution.json # Stage 3 output ├── decision.md # Stage 4 — human-readable rationale ├── evidence/ # logs, screenshots, traces referenced in reports └── handoffs/ # A2A contracts (copied from .tasks/handoffs/)
Retention: Keep until bug is closed + 30 days, then archive to
.investigations/archive/.
Escalation paths
| Trigger | Escalate to |
|---|---|
fails 3× (Rule 14 OPEN) | Fallback to with raw symptom |
returns twice | Surface to user; request manual reproduction |
cannot produce 3 distinct options | Escalate to — scope unclear |
| User rejects all 3 options | Return to ; hypothesis likely wrong |
| Bug reoccurs after fix merges | Restart with new ; link prior investigation in |
Integration with coordination rules
- Rule 6 (Layered Recovery): If any stage fails once, retry with fresh context before escalating
- Rule 14 (Circuit Breaker): Read
before invoking each agentproduction/session-state/circuit-state.json - Rule 15 (Decision Ledger): Every stage transition logs to
decision_ledger.jsonl - Rule 16 (A2A Handoff): Stage 1→2, 2→3, 3→4 each require a handoff contract in
.tasks/handoffs/
Concrete example — "Flaky checkout test"
Symptom:
checkout.e2e.test.ts fails ~20% of CI runs; local always passes.
/diagnose flaky-checkout-e2e ↓ Stage 1 → investigator hypothesis: "Test clicks #submit before React hydration completes on slow CI runners" evidence: [CI traces showing hydration marker missing, local has DevTools overhead masking timing] confidence: medium (cannot reproduce locally) ↓ Stage 2 → verifier triangulation: - Inject 500ms delay before click → test passes 50/50 runs ✓ - Remove delay → fails 9/50 ✗ - Check for hydration marker instead of fixed delay → passes 50/50 ✓ status: confirmed ↓ Stage 3 → solver Quick: add sleep(500ms) [masks issue] Strategic: waitFor hydration marker [addresses root cause] Future-Proof: custom test util that always waits for RSC boundary [reusable] recommendation: Strategic ↓ Stage 4 → lead-programmer selects Strategic; assigns to qa-tester handoff contract: "qa-tester updates checkout.e2e.test.ts to use waitFor(hydrationMarker)" acceptance_criteria: ["10 CI runs in a row pass", "no sleep() in test"] ledger entry written
Common pitfalls
| Pitfall | Fix |
|---|---|
| Skipping Verification ("cause is obvious") | Verifier exists specifically to catch "obvious but wrong" hypotheses |
| Investigator produces only 1 hypothesis | Reject — require list in Stage 2 |
| Solver picks Quick fix without naming tradeoff | Reject — all 3 options required for explicit tradeoff comparison |
No artifact written to | Reject — verbal diagnosis is not auditable |
Running in parallel on same bug | Only one active investigation per ; concurrent runs create race |
Output to user
After Stage 4 completes, summarize in ≤ 5 lines:
/diagnose BUG-417 complete. Root cause: Unbounded .map+await in OrderService.calculateTotal() exhausts pg pool. Selected: Strategic (batch via IN-clause, ~40 LOC, Medium risk). Assigned: @backend-developer; acceptance = load test with 50 items passes. Artifacts: .investigations/BUG-417/