git clone https://github.com/sambeau/kanbanzai
T=$(mktemp -d) && git clone --depth=1 https://github.com/sambeau/kanbanzai "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.kbz/skills/review-code" ~/.claude/skills/sambeau-kanbanzai-review-code && rm -rf "$T"
.kbz/skills/review-code/SKILL.mdVocabulary
- finding — a single observation about the code under review, classified as blocking or non-blocking. Never called an "issue," "problem," or "comment" within this skill.
- finding classification — the severity bucket for a finding: blocking (violates a spec requirement, must be fixed before approval) or non-blocking (improvement suggestion, style note, or observation that does not prevent approval).
- evidence-backed verdict — a verdict supported by specific code locations and spec citations, not by qualitative impressions.
- acceptance criteria traceability — the practice of linking each blocking finding (and each "pass" claim) to a numbered acceptance criterion in the specification.
- per-dimension outcome — the result for a single evaluation dimension:
(fully meets requirements),pass
(meets requirements with non-blocking observations),pass_with_notes
(potential gap that needs human judgement),concern
(violates a spec requirement).fail - review unit — a cohesive group of files evaluated together in a single review pass. Defined by the orchestrator, not by this skill.
- dimension — an independent evaluation axis (e.g., spec_conformance, implementation_quality, test_adequacy, security). Each dimension is assessed in isolation. Which dimensions apply depends on the reviewer role.
- structured review output — the machine-parseable format this skill produces. Every field has exactly one interpretation.
- remediation recommendation — a concrete suggestion attached to a non-blocking finding. Blocking findings cite the violated requirement; the fix is implied.
- spec conformance gap — a mismatch between what the spec requires and what the code implements. Always blocking.
- review profile — the set of dimensions required for this review, determined by the reviewer role assigned to this agent.
- aggregate verdict — the overall review outcome, derived from per-dimension outcomes. Set by the orchestrator after collating all reviewers, not by this skill.
- clearance — an approved verdict backed by per-dimension evidence. Distinguished from a rubber-stamp by the presence of substantive evidence.
- spec anchor — a reference to a specific numbered acceptance criterion or spec section that grounds a finding or a pass claim.
- validation loop — the iterative check at the end of the procedure where findings are verified against classification criteria before output is finalised.
Anti-Patterns
Rubber-Stamp Review (MAST FM-3.1)
- Detect: Verdict is "approved" with zero findings AND no per-dimension evidence citations.
- BECAUSE: LLM sycophancy makes approval the path of least resistance. FM-3.1 is the #1 quality failure mode in multi-agent systems (MAST, 2024). A review that approves without evidence is indistinguishable from no review at all.
- Resolve: Require at least one finding OR substantive per-dimension evidence for every clearance. If the code genuinely has no findings, demonstrate that by citing specific code locations and spec criteria checked.
Severity Inflation
- Detect: More than 40% of findings classified as blocking.
- BECAUSE: Over-classifying non-blocking findings as blocking dilutes the signal from genuine spec violations and creates unnecessary remediation cycles. A blocking finding requires a specific violated acceptance criterion — if you cannot cite one, the finding is non-blocking.
- Resolve: Re-check each blocking finding against the classification criteria. Blocking requires a specific violated requirement. Reclassify any finding that lacks a spec anchor to non-blocking.
Dimension Bleed
- Detect: A finding in one dimension influences the verdict of another dimension (e.g., poor test coverage lowering the implementation_quality score).
- BECAUSE: Dimensions are independent evaluation axes. A poor test_adequacy result does not make the implementation incorrect — it means the tests are insufficient. Bleeding dimensions produces unreliable verdicts because a single weakness gets counted multiple times.
- Resolve: Evaluate each dimension in complete isolation. Cross-reference findings only in the aggregate verdict (which the orchestrator produces, not this skill).
Prose Commentary
- Detect: Output contains qualitative prose ("well-structured," "clean code," "looks good") instead of structured findings with locations and evidence.
- BECAUSE: Prose is ambiguous and cannot be machine-parsed for remediation routing. "Well-structured" means different things to different readers. Structured output has exactly one interpretation — dimension, location, evidence, classification.
- Resolve: Replace every qualitative statement with a finding entry containing dimension, location, and evidence. If a positive observation is worth recording, record it as per-dimension evidence with a spec anchor, not as prose.
Missing Spec Anchor
- Detect: A blocking finding does not cite a specific spec requirement or acceptance criterion.
- BECAUSE: Without a spec anchor, the finding is an opinion, not a conformance gap. Opinions cannot be objectively verified or remediated — the implementer has no way to know what "correct" looks like. Spec-anchored findings have a clear resolution path.
- Resolve: Link every blocking finding to a numbered acceptance criterion or spec section. If no spec requirement covers the observation, reclassify as non-blocking with a remediation recommendation.
Checklist
Copy this checklist and track your progress:
- Read spec section(s) fully — understand what was required
- Read all files in the file list — understand what was implemented
- Confirm review profile and identify required dimensions
- Evaluate each required dimension independently
- Classify all findings as blocking or non-blocking
- Verify every blocking finding has a spec anchor
- Check for dimension bleed — no cross-contamination between verdicts
- Produce structured output in the required format
Procedure
Step 1: Orient
- Read the spec section(s) fully. Identify numbered acceptance criteria.
- Read all files in the file list. Understand what was implemented.
- Note the review profile — this determines which dimensions you evaluate.
- IF any input is missing (no spec, no file list, no review profile) → STOP. Report a Missing Context finding. Do not proceed without inputs.
- IF the spec is ambiguous or incomplete for any dimension you must evaluate → STOP. Report the ambiguity as a finding. Do not infer intent. The cost of asking is low; the cost of guessing wrong is high.
Step 2: Evaluate each dimension independently
For each dimension in your review profile:
- Work through the dimension's evaluation questions (provided by your role).
- For each acceptance criterion relevant to this dimension, check whether the implementation satisfies it. Record the specific code location as evidence.
- Record a per-dimension outcome: pass, pass_with_notes, concern, or fail.
- Record any findings with their classification (blocking or non-blocking).
- IF you notice a finding that belongs to a different dimension — record it under that dimension, not the current one. Do not let it influence the current dimension's verdict.
Step 3: Validate and iterate
Before producing final output, validate your findings:
- Does every blocking finding cite a specific spec requirement? IF NOT → either add the spec anchor or reclassify as non-blocking.
- Is any dimension's verdict influenced by another dimension's result? IF YES → re-evaluate the affected dimension in isolation.
- Are more than 40% of findings classified as blocking? IF YES → re-check each blocking classification against the criteria.
- Does every "pass" verdict have at least one evidence citation? IF NOT → add evidence or reconsider the verdict.
IF any validation check fails → fix the problem → re-validate from step 1. Repeat until all checks pass. Only then produce the structured output.
Edge Cases
Missing Spec
The review unit has no linked spec or the spec reference resolves to nothing.
- Set the
per-dimension outcome tospec_conformance
.not_applicable - Record a non-blocking finding under
stating that conformance cannot be assessed because no spec is linked.spec_conformance - Continue evaluating all other dimensions normally — implementation quality, test adequacy, documentation currency, and workflow integrity do not require a spec to assess.
- In the overall verdict rationale, note that spec conformance was skipped due to missing spec. The verdict is derived from the remaining dimensions only.
Partial Implementation
The implementation contains stubs, TODOs, placeholder returns, or missing code paths that indicate incomplete work.
- Record a blocking finding under
describing exactly what is missing — cite the stub location and what the code path should do.implementation_quality - Set
tospec_conformance
for any acceptance criteria that depend on the incomplete code, and evaluate remaining criteria normally.concern - If the incomplete code lacks corresponding tests, record a blocking finding
under
for each untested stub or placeholder path.test_adequacy - Set the overall verdict to
.needs_remediation - Do not infer what the finished implementation would look like. Evaluate only what exists. Speculation about intent is not evidence.
Ambiguous Conformance
The implementation differs from the spec, but the difference may be intentional (e.g., an improved algorithm, a reordered sequence that preserves semantics).
- Record the deviation as a finding under
with per-dimension outcomespec_conformance
— notconcern
. Reservefail
for clear contradictions.fail - Describe precisely what the spec says (cite the spec anchor) versus what the implementation does (cite the code location). Do not editorialize.
- Classify the finding:
- Non-blocking if the implementation appears intentionally better or equivalent and does not omit or contradict any stated requirement.
- Blocking if the deviation omits a requirement or contradicts a spec constraint, even if the alternative seems reasonable.
- If the deviation improves on the spec without omitting requirements, attach a note recommending that the spec be updated to reflect the implementation.
- Name the ambiguity explicitly (e.g., "AC-3 requires X; implementation does Y instead") so the orchestrator or human reviewer can resolve it with full context.
Missing Context
Required inputs are absent — files are inaccessible, the context packet is incomplete, or the review profile references artifacts that cannot be found.
- Record specifically what is missing: which files, which spec sections, or which context packet fields are absent.
- Set the overall verdict to
.blocked - For each affected dimension, record a blocking finding listing the missing input that prevents evaluation.
- Set each affected dimension's per-dimension outcome to
with a note identifying the missing dependency.not_applicable - Do not produce a partial review presented as complete. A blocked verdict with an explicit list of missing inputs is more useful than a review that silently omits dimensions.
Output Format
Review Unit: <unit-id> Files: <file list> Spec: <spec reference and section(s)> Reviewer Role: <role name> Overall: approved | approved_with_followups | needs_remediation | rejected Dimensions: <dimension_name>: <pass | pass_with_notes | concern | fail> Evidence: - <spec anchor>: <what was checked, code location> - <spec anchor>: <what was checked, code location> Findings: - [blocking] <description> (spec: <anchor>, location: <file:lines>) - [non-blocking] <description> (location: <file:lines>) Recommendation: <remediation suggestion> <dimension_name>: <pass | pass_with_notes | concern | fail> Evidence: - ... Findings: - ... Finding Summary: Blocking: <count> Non-blocking: <count> Total: <count>
Overall verdict rules:
— all dimensions pass, zero findings.approved
— all dimensions pass or pass_with_notes, zero blocking findings, one or more non-blocking findings.approved_with_followups
— one or more blocking findings exist.needs_remediation
— a dimension has outcomerejected
with multiple blocking findings indicating fundamental spec misalignment.fail
Examples
BAD: Rubber-stamp with prose
Review Unit: service-layer Overall: approved Notes: Code is well-structured and follows Go conventions. Good use of error handling. Tests look comprehensive.
WHY BAD: No findings. No evidence citations. No per-dimension verdicts. Qualitative prose ("well-structured," "comprehensive") with no structured data. A human or machine cannot determine what was actually checked. This is FM-3.1 — indistinguishable from not reviewing at all.
GOOD: Evidence-backed structured review
Review Unit: service-layer Files: internal/service/feature.go, internal/service/feature_test.go Spec: work/spec/feature-lifecycle.md §3 (AC-1 through AC-5) Reviewer Role: reviewer-conformance Overall: approved_with_followups Dimensions: spec_conformance: pass Evidence: - AC-1: entity creation verified (feature.go L34-52, NewFeature constructor) - AC-2: input validation verified (feature.go L55-71, Validate method) - AC-3: error response format verified (feature.go L73-89, error wrapping) implementation_quality: pass_with_notes Evidence: - Error handling present on all exported functions - Interface accepted at consumer (feature.go L8), struct returned Findings: - [non-blocking] Error wrapping in CreateFeature (feature.go L48) uses fmt.Errorf without %w — loses error chain for callers using errors.Is Recommendation: Use fmt.Errorf("create feature: %w", err) to preserve the error chain test_adequacy: pass Evidence: - 14 test cases covering happy path, validation failures, and duplicate detection (feature_test.go L12-189) - Table-driven pattern used throughout Finding Summary: Blocking: 0 Non-blocking: 1 Total: 1
WHY GOOD: Per-dimension verdicts with specific evidence. The single finding has a location, explanation, and remediation recommendation. Spec requirements cited by number. A machine can parse this; a human can verify each claim.
GOOD: Evidence-backed clearance with zero findings
Review Unit: storage-layer Files: internal/store/yaml.go, internal/store/yaml_test.go Spec: work/spec/entity-storage.md §2 (AC-4 through AC-6) Reviewer Role: reviewer-conformance Overall: approved Dimensions: spec_conformance: pass Evidence: - AC-4: YAML serialisation verified (yaml.go L12-34, Marshal method) - AC-5: canonical field order verified (yaml.go L36-58, fieldOrder slice) - AC-6: round-trip determinism verified (yaml_test.go L102, TestStore_RoundTrip confirms identical output) implementation_quality: pass Evidence: - Error wrapping with %w throughout (yaml.go L22, L41, L67) - No exported functions without doc comments - Interface accepted at consumer (yaml.go L8), struct returned test_adequacy: pass Evidence: - 22 test cases including round-trip serialisation (yaml_test.go L15-198) - Error path coverage via TestStore_CreateConflict (yaml_test.go L145) - Table-driven pattern for serialisation variants Finding Summary: Blocking: 0 Non-blocking: 0 Total: 0
WHY GOOD: Zero findings but substantive evidence for every dimension. The reviewer demonstrably examined the code — each pass verdict is backed by specific locations and spec anchors. This is a legitimate clearance, not a rubber stamp. Best example placed last because it demonstrates the hardest case: saying "approved" credibly.
Evaluation Criteria
These criteria are for evaluating the review output, not for self-evaluation during the review. They are phrased as gradable questions to support automated LLM-as-judge evaluation.
- Does every dimension have an explicit outcome (pass / pass_with_notes / concern / fail)? Weight: required.
- Does every blocking finding cite a specific spec requirement? Weight: required.
- Does the output distinguish blocking from non-blocking findings? Weight: required.
- Can a machine extract all findings from the output without ambiguity? Weight: high.
- Are dimensions evaluated independently — no bleed between verdicts? Weight: high.
- Is every "approved" or "pass" verdict backed by at least one evidence citation? Weight: high.
- Does the output avoid qualitative prose in place of structured data? Weight: medium.
Questions This Skill Answers
- How do I review code changes against a specification?
- What dimensions should I evaluate during code review?
- How do I classify a finding as blocking vs non-blocking?
- What format should my review output use?
- When should I stop and report missing context during review?
- What does a well-evidenced "approved" verdict look like?
- How do I handle ambiguous or incomplete spec sections during review?
- What is the difference between a concern and a fail?
- How do I avoid rubber-stamping a review?