Great_cto skeptical-triage
Reusable 3-round self-challenge + arbiter pattern for filtering false positives from findings/verdicts. Use when the cost of a false-positive gate block exceeds the cost of ~4 extra LLM turns.
git clone https://github.com/avelikiy/great_cto
T=$(mktemp -d) && git clone --depth=1 https://github.com/avelikiy/great_cto "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/skeptical-triage" ~/.claude/skills/avelikiy-great-cto-skeptical-triage && rm -rf "$T"
skills/skeptical-triage/SKILL.mdSkeptical Triage
Filter false positives from multi-angle review, security audit, QA regression flags, or any high-stakes judgment before it turns into a blocker.
Three rounds of skeptical self-review + an impartial arbiter, with a confidence score from the vote.
When to invoke
| Caller | Finding type | Apply triage? |
|---|---|---|
| Angle 2/4/7/9 P0/P1 (security, SQL, privacy, concurrency) | Yes |
| Any angle P0/P1 | Yes |
| CSO audit P0/P1 | Yes |
| Secret in source/git, confirmed CVE | No — hard finding |
| Flaky-test verdict (is this a regression or flake?) | Yes |
| ADR trade-off dispute (option A vs. B when both look reasonable) | Yes |
| Any | P2/advisory | No |
The 4-step pattern
Run these sequentially. Each round sees prior reasoning. Arbiter sees all rounds.
Round 1 — Reachability / Premise
Question: is the premise true?
- For security/reliability: can an external attacker reach this code path with untrusted input? Trace input flow backward from the bug site to its origin. If only trusted internal callers → lean INVALID.
- For regressions: does the failing behavior reproduce from a clean state on the target branch?
- For ADR trade-offs: is the constraint that forces the choice actually binding? (e.g. "we need <10ms p99" — is that real or aspirational?)
Output:
{round: 1, verdict: VALID|INVALID|UNCERTAIN, reasoning: "...", crux: "single key fact"}
Round 2 — Verify cited defenses / counter-evidence
Question: are claimed defenses real and sufficient?
- Every cited defense → use
to find its actual implementation line.Grep - Resolve constant names to numeric values.
is not a verified bound —MAX_BUF_SIZE
is.#define MAX_BUF_SIZE 64 - For regressions: is the cited "test covers this" actually asserting the right invariant?
- For ADR: is the cited benchmark/precedent real (grep for it, read it), or rumored?
If you cannot point to the line that enforces the defense, it does not exist.
Output: same JSON shape, with
grep_used: true/false.
Round 3 — Missed angles
Question: what did Rounds 1-2 not consider?
- Error paths, integer overflow, race windows, different callers, platform differences
- Do NOT rehash prior rounds — add new evidence or concede
- For QA: retry logic masking the failure? Test pollution from another test?
- For ADR: option C that neither reviewer raised?
Output: same JSON shape.
Arbiter
Input: all 3 rounds + original finding/question + source code.
Question: final call — which side has the stronger evidence?
- Deliver single
(no UNCERTAIN — make the call).verdict: VALID|INVALID - Deliver one-sentence
— the key fact the verdict turns on.crux - If 3 prior rounds all said the same thing, only override with overwhelming new evidence and explain why.
Output:
{ "verdict": "VALID", "crux": "memcpy at auth.c:142 copies network-controlled len bytes into 64-byte stack buffer with no bound check", "reasoning": "Rounds 1 and 3 verified attacker reach; Round 2 found no size check in 50 LOC radius; arbiter confirms no caller clamps len." }
Hard rules
Burn these into every round's prompt:
- Absence of defense → VALID, not UNCERTAIN. If you searched for a defense and did not find one, that is the answer. "Other code probably handles this" is not a valid defense.
- A constant name is not a verified bound — only its resolved value is. Grep for the
/#define
declaration.const - Name the line or it does not exist. Vague references to "assumptions in this codebase" do not count.
- Do not contradict your own conclusion in the same response. If you verified a defense is insufficient, that is the verdict. Stop searching for reasons to flip.
- Code quality issue ≠ security vulnerability. Data race on diagnostic state, NULL check on internal-only API, UB only in debug builds → INVALID.
- Trust your own reasoning. If you see the crux on first read, don't manufacture a counter-argument.
Confidence scoring
confidence = valid_rounds_before_arbiter / 3
(VVV) — 3/3 rounds VALID. Arbiter rubber-stamps unless it finds something brand-new.100%
(VVI or VIV or IVV) — majority VALID. Arbiter breaks tie with new evidence.67%
(IIV or IVI or VII) — majority INVALID. Arbiter usually confirms INVALID.33%
(III) — 3/3 INVALID. Arbiter rarely overrides.0%
Arbiter overrides the final verdict; confidence reflects the round vote for transparency. Record both in the output so humans can see where the arbiter diverged.
Applying triage results to severity
Once the arbiter returns:
| Arbiter verdict | Confidence | Severity action |
|---|---|---|
| ≥ 50% | Keep original severity |
| < 50% | Demote: P0→P1, P1→P2 |
| any | Remove from gate tally, record as in report for audit |
(only if arbiter could not decide) | n/a | Keep original severity, flag for manual CTO review |
Output schema
Every caller logs triage results to
.great_cto/triage-log.jsonl (append-only, one JSON per line):
{ "timestamp": "2026-04-19T12:34:56Z", "caller": "review|security-officer|qa-engineer|tech-lead", "finding_id": "SEC-042", "file": "src/auth.c:142", "original_severity": "P0", "rounds": [ {"round": 1, "verdict": "VALID", "crux": "..."}, {"round": 2, "verdict": "VALID", "crux": "...", "grep_used": true}, {"round": 3, "verdict": "INVALID", "crux": "..."} ], "arbiter": {"verdict": "VALID", "crux": "..."}, "confidence": 0.67, "final_severity": "P0" }
This log is how we measure whether triage earns its keep. Review it weekly:
# False-positive rate: how many findings the arbiter flipped to INVALID jq 'select(.arbiter.verdict=="INVALID")' .great_cto/triage-log.jsonl | wc -l # Average rounds-to-consensus (did we need all 3 or did R1+R2 agree?) jq '[.rounds[].verdict] | unique | length' .great_cto/triage-log.jsonl
If FP rate < 10% after 50 triages — triage is filtering noise that wasn't there. Lower threshold or skip triage for that angle. If FP rate > 40% — original review prompt is too trigger-happy; tighten the angle rules.
Token budget
Per triaged finding: ~4 LLM turns (3 rounds + arbiter). At typical review sizes (~5-10 triaged findings per PR), total budget: 20-40 extra turns per
/review. Batch when possible — one arbiter can handle multiple findings in a single call if their cruxes are independent.
For cost-sensitive runs (
approval-level: auto on a huge PR), consider: triage only P0, leave P1 untriaged. Re-tune based on .great_cto/triage-log.jsonl data.
Anti-patterns
- Don't triage P2/advisory findings. The whole point is gate decisions. P2 is advisory — let the author see it and move on.
- Don't let rounds rehash each other. Round 3 prompt must say "add NEW evidence or concede." If 3 rounds produce identical reasoning, you wasted 2 turns.
- Don't skip the arbiter on UNCERTAIN. If all 3 rounds say UNCERTAIN, the arbiter's job is to decide — not to join the fog.
- Don't hide arbiter overrides. When the arbiter flips the majority vote, record both
(the vote) andconfidence
(the arbiter). Humans deserve to see the disagreement.final_verdict