git clone https://github.com/Intense-Visions/harness-engineering
T=$(mktemp -d) && git clone --depth=1 https://github.com/Intense-Visions/harness-engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agents/skills/claude-code/harness-code-review" ~/.claude/skills/intense-visions-harness-engineering-harness-code-review-6c00f4 && rm -rf "$T"
agents/skills/claude-code/harness-code-review/SKILL.mdHarness Code Review
Multi-phase code review pipeline — mechanical checks, graph-scoped context, parallel review agents, cross-agent deduplication, and structured output with technical rigor over social performance.
When to Use
- When performing a code review (manual invocation or triggered by
/on_pr
)on_review - When requesting a review of completed work (see Role A at the end of this document)
- When responding to review feedback (see Role C at the end of this document)
- NOT for in-progress work (complete the feature first)
- NOT for rubber-stamping (if you cannot find issues, look harder or state confidence level)
- NOT for style-only feedback (leave that to linters and mechanical checks)
Argument Resolution
When invoked by autopilot (or with explicit arguments), resolve paths before starting:
- Session slug: If
argument provided, setsession-slug
. Pass to{sessionDir} = .harness/sessions/<session-slug>/
. All handoff writes go togather_context({ session: "<session-slug>" })
.{sessionDir}/handoff.json - Commit range: If
argument provided (e.g.,commit-range
), use as diff scope in Phase 2 MECHANICAL and Phase 7 OUTPUT. Otherwise, auto-detect from git environment.abc123..HEAD
When no arguments are provided (standalone invocation), session slug is unknown — omit from
gather_context, fall back to global .harness/ paths. Diff scope auto-detected.
Process
Iron Law
Review identifies issues. Review never fixes them.
A reviewer who applies fixes is no longer reviewing — they are editing with reviewer authority and no review. Suggest the fix in the finding. Do not apply it. If you catch yourself writing production code during review, STOP. You have crossed the boundary.
The review runs as a 7-phase pipeline. Each phase has a clear input, output, and exit condition.
Phase 1: GATE --> Phase 2: MECHANICAL --> Phase 3: CONTEXT --> Phase 4: FAN-OUT | Phase 7: OUTPUT <-- Phase 6: DEDUP+MERGE <-- Phase 5: VALIDATE <------+
| Phase | Tier | Purpose | Exit Condition |
|---|---|---|---|
| 1. GATE | fast | Skip ineligible PRs (CI only) | PR eligible, or exit with reason |
| 2. MECHANICAL | none | Lint, typecheck, test, sec scan | All pass -> continue; any fail -> report and stop |
| 3. CONTEXT | fast | Scope context per review domain | Context bundles assembled for each subagent |
| 4. FAN-OUT | mixed | Parallel review subagents | All subagents return ReviewFinding[] |
| 5. VALIDATE | none | Exclude mechanical dupes, verify | Unvalidated findings discarded |
| 6. DEDUP+MERGE | none | Group, merge, assign severity | Deduplicated finding list with merged evidence |
| 7. OUTPUT | none | Text output or GitHub comments | Review delivered, exit code set |
Finding Schema
interface ReviewFinding { id: string; // unique, for dedup file: string; // file path lineRange: [number, number]; // start, end domain: 'compliance' | 'bug' | 'security' | 'architecture'; severity: 'critical' | 'important' | 'suggestion'; title: string; // one-line summary rationale: string; // why this is an issue suggestion?: string; // fix, if available evidence: string[]; // supporting context from agent validatedBy: 'mechanical' | 'graph' | 'heuristic'; }
Flags
| Flag | Effect |
|---|---|
| Post inline comments to GitHub PR via CLI or GitHub MCP |
| Pass to for threat modeling |
| Skip mechanical checks (useful if already run in CI) |
| Enable eligibility gate, non-interactive output |
| Skip learnings, fast-tier agents for all fan-out slots |
| Always load learnings, full roster + meta-judge, learnings in output |
Rigor Levels
Set via
--fast or --thorough flags (or passed by autopilot). Default is standard.
| Phase | | (default) | |
|---|---|---|---|
| 3. CONTEXT | Skip learnings entirely | Load learnings if file exists; score via filterByRelevance | Always load learnings; fail loudly if missing |
| 4. FAN-OUT | All agents at fast tier. Reduced focus areas. | Default tier assignments | Full roster + meta-judge confirms findings cited by multiple agents, flags contradictions, and surfaces cross-cutting concerns |
| 7. OUTPUT | Standard format | Standard format | Include "Learnings Applied" section with relevance scores |
Model Tiers
Tiers are abstract labels resolved at runtime from project config. If no config exists, all phases use the current model.
| Tier | Default | Used By |
|---|---|---|
| haiku-class | GATE, CONTEXT |
| sonnet-class | Compliance agent, Architecture agent |
| opus-class | Bug Detection agent, Security agent |
Review Learnings Calibration
Before starting the pipeline, check for a project-specific calibration file. Behavior by rigor:
: Skip entirely. Do not read or score learnings.fast
: Read if file exists, score and filter. If missing, use default focus areas.standard
: Always read. Ifthorough
missing, log warning..harness/review-learnings.md
If
.harness/review-learnings.md exists (and rigor is not fast):
- Useful Findings section — prioritize these categories (historically caught real issues).
- Noise / False Positives section — de-prioritize or skip (wastes author time).
- Calibration Notes section — apply as project-specific overrides.
Learnings Relevance Scoring
When learnings are loaded (standard or thorough), score against diff context before applying:
- Build diff context string. Concatenate changed file paths + diff summary.
- Score each learning via
fromfilterByRelevance(learnings, diffContext, 0.7, 1000)
. Only learnings >= 0.7 retained, sorted by score, truncated to 1000-token budget.packages/core/src/state/learnings-relevance.ts - Apply filtered learnings: boost priority for passing Useful Findings, suppress passing Noise entries, apply passing Calibration Notes as overrides.
- If none pass 0.7 threshold, use default focus areas. No fallback to unscored inclusion.
After review, consider suggesting
.harness/review-learnings.md creation if you notice patterns that would benefit from calibration.
Pipeline Phases
Phase 1: GATE
Tier: fast | Mode: CI only (
--ci). Skip when invoked manually.
Checks whether the PR should be reviewed at all, preventing wasted compute in CI.
Checks:
- PR state: Closed or merged? -> Skip: "PR is closed."
- Draft status: Draft? -> Skip: "PR is draft."
- Trivial change: All changed files
? -> Skip: "Documentation-only change.".md - Already reviewed: Same commit range reviewed before? -> Skip: "Already reviewed at {sha}."
gh pr view --json state,isDraft,files gh pr diff --name-only | grep -v '\.md$' | wc -l # 0 means docs-only
Exit: If any check triggers skip, output reason and exit 0. Otherwise continue to Phase 2.
Phase 2: MECHANICAL
Tier: none (no LLM) | Mode: Skipped with
--no-mechanical.
Run mechanical checks to establish an exclusion boundary. Issues caught here are excluded from AI review (Phase 4).
Checks:
- Harness validation:
— runs checks in parallel.assess_project({ path, checks: ["validate", "deps", "docs"], mode: "detailed" }) - Security scan:
on changed files. Record findings with rule ID, file, line.run_security_scan - Type checking: Run
. Record errors.tsc --noEmit - Linting: Run linter. Record violations.
- Tests: Run test suite. Record failures.
Output: Set of mechanical findings (file, line, tool, message) forming the exclusion list for Phase 5.
Evidence Gate (session-aware)
When
sessionSlug is available, load evidence entries from session state and cross-reference with findings:
- Load:
readSessionSection(projectRoot, sessionSlug, 'evidence') - Check if evidence entries reference the same file:line as each finding
- Unmatched findings tagged
[UNVERIFIED] - Append coverage report: evidence entries count, findings with evidence, coverage %
When no session is available, skip silently. Evidence checking enhances but does not gate reviews.
Exit: If harness validate, typecheck, or tests fail, report in Strengths/Issues/Assessment format and stop. Lint warnings and security findings do not stop the pipeline -- recorded for exclusion only.
Phase 3: CONTEXT
Tier: fast | Purpose: Assemble scoped context bundles per review domain. Each subagent receives only domain-relevant context.
Change-Type Detection
Determine change type to shape review focus:
- Commit prefix:
-> feature,feat:
-> bugfix,fix:
-> refactor,refactor:
-> docsdocs: - Diff heuristic: New files + tests -> feature; small changes + test -> bugfix; renames/moves -> refactor; only
-> docs.md - Default: If ambiguous, treat as feature (most thorough review).
Context Scoping
| Domain | With Graph | Without Graph (Fallback) |
|---|---|---|
| Compliance | Convention files + changed files | Same (no graph needed) |
| Bug Detection | Changed files + dependencies via | Changed files + imported files () |
| Security | Security paths + data flow via | Changed files + files with auth/crypto/SQL/shell patterns |
| Architecture | Layer boundaries + imports via / | Changed files + output |
1:1 Context Ratio Rule
- Small diffs (<20 lines): 3:1 context-to-diff ratio.
- Medium diffs (20-200 lines): 1:1 ratio. Full files + immediate dependencies.
- Large diffs (>200 lines): 1:1 floor. Prioritize ruthlessly. Flag as review concern.
Context Gathering Priority
- Files directly imported/referenced by changed files
- Corresponding test files (note if missing)
- Spec/design docs mentioning changed components
- Type definitions consumed or produced by changed code
- Recent commits touching the same files
Graph-Enhanced Context
When
.harness/graph/ exists, use gather_context for efficient assembly:
gather_context({ path: "<project-root>", intent: "Code review of <change description>", skill: "harness-code-review", session: "<session-slug-if-provided>", tokenBudget: 8000, include: ["graph", "learnings", "validation"] })
Replaces manual
query_graph + get_impact + find_context_for calls. Falls back gracefully when no graph exists. Supplement with targeted query_graph calls for domain-specific scoping.
Context Assembly Commands
git diff --stat HEAD~1 # measure diff size git diff HEAD~1 -- <file> # per-file diff grep -n "import\|require\|from " <file> # find imports find . -name "*<module>*test*" -o -name "*<module>*spec*" grep -rl "<component>" docs/changes/ docs/design-docs/ docs/plans/ grep -rn "interface\|type\|schema" <changed-file> | head -20
Commit History Context
git log --oneline -5 -- <affected-file>
Use to determine: Hotspot? (3+ changes in last 5 commits) Recently refactored? Multiple authors? Last change was bugfix? (yellow flag)
Exit: Context bundles assembled for all four domains. Continue to Phase 4.
Phase 4: FAN-OUT
Tier: mixed | Purpose: Run four parallel review subagents with domain-scoped context. Each produces
ReviewFinding[].
Rigor overrides:
: All agents at fast tier (haiku-class), reduced reasoning depth.fast
: Default tiers per agent.standard
: Default tiers + meta-judge pass (strong tier) cross-validating findings, flagging contradictions, surfacing cross-cutting concerns.thorough
Compliance Agent (standard tier)
Reviews adherence to project conventions, standards, and documentation.
Input: Compliance context bundle (convention files + changed files + change type)
Focus by change type:
Feature:
- Spec alignment: Implementation matches spec? All specified behaviors present?
- API surface: New interfaces minimal and well-named? Could exports be kept internal?
- Backward compatibility: Breaks existing callers? Migration path documented?
Bugfix:
- Root cause identified: Addresses root cause, not symptom?
- Original issue referenced: Commit/PR references bug report?
- No collateral changes: Only necessary changes?
Refactor:
- Behavioral equivalence: Existing tests pass without modification?
- No functionality changes: No new behavior introduced?
Docs:
- Accuracy: Documented behaviors match code?
- Completeness: All public interfaces documented?
- Consistency: Follows existing style and terminology?
- Links valid: All internal links resolve?
Output:
ReviewFinding[] with domain: 'compliance'
Bug Detection Agent (strong tier)
Reviews for logic errors, edge cases, and correctness issues.
Input: Bug detection context bundle (changed files + dependencies)
Focus areas:
- Edge cases: Boundary conditions (empty, max, null, concurrent)
- Error handling: Appropriate level, helpful messages, no silent swallowing
- Logic errors: Off-by-one, incorrect boolean logic, missing early returns
- Race conditions: Concurrent shared state access, missing locks/atomics
- Resource leaks: Unclosed handles, missing cleanup in error paths
- Type safety: Mismatches, unsafe casts, missing null checks
- Test coverage: Happy path, error paths, edge cases. Meaningful, not just present.
- Regression tests: For bugfixes -- test that would have caught the bug
Output:
ReviewFinding[] with domain: 'bug'
Security Agent (strong tier) -- via harness-security-review
Invokes
harness-security-review in changed-files mode as the security fan-out slot.
Input: Security context bundle (security-relevant paths + data flows)
Invocation: Pipeline invokes
harness-security-review with scope changed-files:
- Skips Phase 1 (SCAN) -- reads mechanical findings from PipelineContext
- Runs Phase 2 (REVIEW) -- OWASP baseline + stack-adaptive on changed files
- Skips Phase 3 (THREAT-MODEL) unless
was passed--deep - Returns
with security fields (ReviewFinding[]
,cweId
,owaspCategory
,confidence
,remediation
)references
Focus areas:
-
Semantic security review (beyond mechanical scanners):
- Input flowing through multiple functions to dangerous sinks (SQL, shell, HTML)
- Missing authorization on new/modified endpoints
- Sensitive data in logs, errors, or API responses
- Authentication bypass paths
- Insecure defaults in new config
-
Stack-adaptive: Node.js (prototype pollution, ReDoS, path traversal), React (XSS, dangerouslySetInnerHTML), Go (race conditions, integer overflow, unsafe pointer), Python (pickle, SSTI, command injection)
-
CWE/OWASP references: All findings include
,cweId
,owaspCategory
.remediation
Confirmed vulnerabilities are always
severity: 'critical'.
Dedup with mechanical scan: Phase 5 uses the exclusion set from Phase 2 to discard overlapping findings.
Output:
ReviewFinding[] with domain: 'security'
Architecture Agent (standard tier)
Reviews architectural violations, dependency direction, and design pattern compliance.
Input: Architecture context bundle (layer boundaries + import graph)
Focus areas:
- Layer compliance: Imports flowing in correct direction?
- Dependency direction: Depends on abstractions, not concretions?
- Single Responsibility: Each module has one reason to change?
- Open/Closed: Behavior extensible without modifying existing code?
- Pattern consistency: Follows established patterns? New patterns justified?
- Separation of concerns: Business logic separated from infrastructure?
- DRY violations: Duplicated logic to extract -- but NOT intentional duplication that will diverge.
- Performance preserved: Could restructuring introduce regressions?
Output:
ReviewFinding[] with domain: 'architecture'
Exit: All four agents returned findings. Continue to Phase 5.
Phase 5: VALIDATE
Tier: none (mechanical) | Purpose: Remove false positives via cross-referencing.
Steps:
- Mechanical exclusion: Discard AI findings where same file + line range already flagged by Phase 2.
- Graph reachability (if available): Verify claimed dependency paths via
. Discard findings with invalid reachability claims.query_graph - Import-chain heuristic (fallback): Follow imports 2 levels deep. If claimed impact unreachable within 2 hops, downgrade to
.suggestion
Exit: Validated finding set. Continue to Phase 6.
Phase 6: DEDUP + MERGE
Tier: none (mechanical) | Purpose: Eliminate redundant findings across agents.
Steps:
- Group by location: Group by
+ overlappingfile
(intersecting or within 3 lines).lineRange - Merge overlapping findings: Keep highest severity, combine evidence arrays, preserve strongest rationale, merge domain tags, generate merged ID.
- Assign final severity:
- Critical — Must fix. Bugs, security vulns, failing harness checks, boundary violations.
- Important — Should fix. Missing error handling, missing critical-path tests, unclear naming.
- Suggestion — Consider. Style, minor optimizations, alternatives. Does not block merge.
Exit: Deduplicated, severity-assigned list. Continue to Phase 7.
Phase 7: OUTPUT
Tier: none | Purpose: Deliver review in requested format.
Text Output (default)
**[STRENGTH]** Clean separation between route handler and service logic **[CRITICAL]** api/routes/users.ts:12-15 -- Direct import from db/queries.ts bypasses service layer **[IMPORTANT]** services/user-service.ts:45 -- createUser does not handle duplicate email **[SUGGESTION]** Consider extracting validation into a shared utility
Strengths: What is done well. Be specific -- "Clean separation between X and Y" not "Looks good".
Issues: Grouped by severity (Critical / Important / Suggestion). Each includes: location, problem, rationale, suggested fix.
Assessment: Approve (no critical/important), Request Changes (critical/important present), or Comment (observations only).
Learnings Applied (thorough only): List learnings with Jaccard scores and how they influenced review. Omitted in fast/standard.
Exit code: 0 for Approve/Comment, 1 for Request Changes.
Inline GitHub Comments (--comment
)
--comment- Small fixes (<10 lines): Committable suggestion block.
- Large fixes (>=10 lines or no concrete suggestion): Description + rationale comment.
- Summary: Top-level PR review comment with Strengths/Issues/Assessment.
gh pr review --event APPROVE|REQUEST_CHANGES|COMMENT --body "<summary>" gh api repos/{owner}/{repo}/pulls/{pr}/comments \ --field body="<rationale>\n\`\`\`suggestion\n<fix>\n\`\`\`" \ --field path="<file>" --field line=<line>
Review Acceptance
emit_interaction({ path: "<project-root>", type: "confirmation", confirmation: { text: "Review complete: <Assessment>. Accept review?", context: "<N critical, N important, N suggestion findings>", impact: "Accepting finalizes findings. Approve = ready for merge. Request-changes = fixes needed.", risk: "<low if approve, high if critical>" } })
Handoff and Transition
Write handoff to the session-scoped path when session slug is known, otherwise fall back to global:
- Session-scoped (preferred):
.harness/sessions/<session-slug>/handoff.json - Global (fallback, deprecated):
.harness/handoff.json
[DEPRECATED] Writing to
is deprecated. In autopilot sessions, always write to.harness/handoff.json..harness/sessions/<slug>/handoff.json
{ "fromSkill": "harness-code-review", "phase": "OUTPUT", "summary": "<assessment summary>", "assessment": "approve | request-changes | comment", "findingCount": { "critical": 0, "important": 0, "suggestion": 0 }, "artifacts": ["<reviewed files>"] }
Write session summary (if session known):
writeSessionSummary(projectPath, sessionSlug, { session: "<session-slug>", lastActive: "<ISO timestamp>", skill: "harness-code-review", spec: "<spec path if known>", status: "Review complete. Assessment: <type>. <N> findings.", keyContext: "<1-2 sentences: review outcome, key findings>", nextStep: "<Address findings / Ready to merge / Observations delivered>" })
If "approve": Emit transition:
{ "type": "transition", "transition": { "completedPhase": "review", "suggestedNext": "merge", "reason": "Review approved with no blocking issues", "artifacts": ["<reviewed files>"], "requiresConfirmation": true, "summary": "Review approved. <N> suggestions. Ready for PR/merge.", "qualityGate": { "checks": [ { "name": "mechanical-checks", "passed": true }, { "name": "no-critical-findings", "passed": true }, { "name": "no-important-findings", "passed": true }, { "name": "harness-validate", "passed": true } ], "allPassed": true } } }
If user confirms: proceed to create PR or merge. If user declines: stop.
If "request-changes": Do NOT emit transition. Surface critical/important findings for resolution. Re-run after fixes.
If "comment": Do NOT emit transition. Observations delivered, no further action implied.
Role A: Requesting a Review
Not part of the pipeline. Documents the process for requesting reviews.
- Prepare context: Commit range, description (WHAT and WHY), plan reference, test evidence, harness check results.
- Dispatch: Identify reviewer, provide context, state what feedback you want.
- Wait. Do not modify code under review.
Role C: Responding to Review Feedback
Not part of the pipeline. Documents the process for responding to feedback.
- Read all feedback first. Understand full picture before responding.
- Verify before implementing: Understand it? Is it correct? Is it actionable?
- Technical rigor over social performance:
- Do NOT agree just to be agreeable. Push back with evidence if wrong.
- Do NOT implement every suggestion. Apply YAGNI.
- Do NOT make changes you do not understand. Ask.
- DO acknowledge correct feedback.
- DO push back when feedback contradicts approved plan/spec.
- Implement fixes: Change, test, run
andharness validate
, commit referencing feedback.harness check-deps - Re-request review with change summary, addressed vs. pushed-back items, fresh harness results.
Evidence Requirements
Every
ReviewFinding.evidence array MUST include citations using one of:
- File reference:
format (e.g.,file:line
-- "bypasses service layer")src/api/routes/users.ts:12-15 - Diff evidence: Before/after code with file path and line numbers
- Dependency chain: Import path showing violation (e.g.,
)routes/users.ts:3 imports db/queries.ts - Test evidence: Test command and output for test-related findings
- Convention reference: Specific convention file and rule (e.g.,
)AGENTS.md:45 - Session evidence: Write to
session section viaevidencemanage_state
When to cite: Phase 4 (each subagent populates evidence), Phase 5 (evidence verifies reachability), Phase 7 (every issue backed by evidence).
Uncited claims: Findings without evidence discarded in Phase 5. Observations without file:line references prefixed
[UNVERIFIED] and downgraded to suggestion.
Rubric Compression
Review rubrics passed to subagents in Phase 4 MUST use compressed single-line format to minimize token consumption. Each rubric entry is one line with pipe-delimited fields:
domain|check-name|severity|one-sentence-criterion
Example (Compliance Agent rubric):
compliance|spec-alignment|critical|Implementation matches all behaviors specified in the approved spec compliance|api-surface|important|New exports are minimal and well-named; internal symbols stay unexported compliance|backward-compat|critical|No breaking changes to existing callers without documented migration path compliance|naming|suggestion|Names follow project conventions (check AGENTS.md or .eslintrc)
Why: Verbose rubric prose inflates context by 2-5x without improving review accuracy. Dense single-line rubrics give the agent the same signal in fewer tokens, leaving more budget for actual code analysis.
Rules:
- Maximum 80 characters per criterion text
- Domain must match the subagent's scope (compliance, bug, security, architecture)
- Severity must be one of: critical, important, suggestion
- Rubric entries are guidance, not exhaustive — agents may surface findings outside the rubric
Harness Integration
— Phase 2: run validate/deps/docs in parallel. Failures are Critical and stop pipeline.assess_project
— Phase 3: parallel context assembly. Session parameter scopes to session directory.gather_context
— Optional Phase 2 check for entropy in changed files.harness cleanup- Graph queries — Phase 3 (dependency context) and Phase 5 (reachability verification). Graceful fallback.
— Post-review: suggest merge transition on APPROVE. Confirmed transition.emit_interaction- Rigor levels —
/--fast
control learnings, agent tiers, output. See Rigor Levels table.--thorough
— Phase 3 learnings scoring. Threshold 0.7, budget 1000 tokens.filterByRelevance- Session directory —
contains.harness/sessions/<slug>/
,handoff.json
,state.json
(spec/plan paths, reviewed file list). Write handoff to session scope when slug is known. Globalartifacts.json
is deprecated for session-aware invocations..harness/handoff.json
Uncertainty Surfacing
When a review subagent encounters ambiguity during analysis, classify it immediately:
- Blocking: Cannot determine severity without more context (e.g., unclear whether a pattern is intentional or accidental). Surface as a finding with severity
and rationale explaining the ambiguity. Do not guess.suggestion - Assumption: Can classify if assumption is stated (e.g., "assuming this endpoint is internal-only, the missing auth check is acceptable"). State the assumption in the finding. If wrong, the finding severity changes.
- Deferrable: Ambiguity does not affect the review (e.g., whether a naming choice will cause confusion in the future). Omit from findings — it is noise.
Do not suppress ambiguous findings. An ambiguous finding surfaced as a question is more valuable than a confident finding built on a wrong assumption.
Success Criteria
- Pipeline runs all 7 phases in order (skipping GATE without
)--ci - Mechanical failures in Phase 2 stop pipeline before AI review
- Each Phase 4 subagent receives only domain-scoped context
- All findings use ReviewFinding schema
- Mechanical findings excluded from Phase 4 output in Phase 5
- Cross-agent duplicates merged in Phase 6
- Text output uses Strengths/Issues/Assessment with Critical/Important/Suggestion
posts inline GitHub comments with committable suggestions--comment
adds threat modeling to Security agent--deep- No code merges with Critical issues or failing harness checks
- Role C feedback verified before implementation; pushback is evidence-based
: learnings skipped, all agents at fast tierfast
: learnings always loaded/scored, meta-judge validates, "Learnings Applied" in outputthorough
: learnings included if file exists, scored at 0.7 thresholdstandard- Zero learnings below threshold means zero included (no unscored fallback)
Examples
Example: Pipeline Review of a New API Endpoint
Phase 1 (GATE): Skipped -- manual invocation.
Phase 2 (MECHANICAL):
harness validate passes. harness check-deps passes. Security scan clean. tsc --noEmit passes. Lint passes.
Phase 3 (CONTEXT): Change type:
feature (prefix feat:). Bundles:
- Compliance:
+ changed filesCLAUDE.md - Bug detection:
,api/routes/users.ts
,services/user-service.tsdb/queries.ts - Security:
(endpoint),api/routes/users.ts
(data flow)services/user-service.ts - Architecture: import graph showing
layersroutes -> services -> db
Phase 4 (FAN-OUT): Four agents in parallel:
- Compliance: 0 findings (spec alignment confirmed)
- Bug detection: 1 finding (missing duplicate email handling in createUser)
- Security: 0 findings
- Architecture: 1 finding (routes/users.ts imports directly from db/queries.ts)
Phase 5 (VALIDATE): No mechanical exclusions. Architecture finding validated by
check-deps showing layer violation.
Phase 6 (DEDUP+MERGE): No overlaps -- 2 distinct findings in different files.
Phase 7 (OUTPUT):
Strengths:
- Clean separation between route handler and service logic
- Input validation using Zod schemas with clear error messages
- Comprehensive test coverage including error paths
Issues:
Critical:
-- Direct import fromapi/routes/users.ts:12-15
bypasses service layer. Must route throughdb/queries.ts
. (domain: architecture, validatedBy: heuristic)services/user-service.ts
Important:
--services/user-service.ts:45
does not handle duplicate email. Database throws constraint violation surfacing as 500. Should catch and return 409. (domain: bug, validatedBy: heuristic)createUser
Suggestion: (none)
Assessment: Request Changes -- one critical layer violation and one important missing error handler.
Gates
- Never skip mechanical checks without
. If not run in CI or locally, they must run in Phase 2 before AI review.--no-mechanical - Never merge with failing harness checks.
andharness validate
must pass. Always Critical.harness check-deps - Never implement feedback without verification. Verify correctness before changing code. Do not blindly comply.
- Never agree performatively. "Sure, I'll change that" without understanding is forbidden.
- Never skip the YAGNI check. Every suggestion must serve a current, concrete need. Speculative improvements rejected.
- Never apply fixes during review. Review output is findings, not code changes. Suggest fixes in finding text; never edit production code. Iron Law violation.
Red Flags
Universal
- "I believe the codebase does X" -- Stop. Read code, cite file:line. Belief is not evidence.
- "Let me recommend [pattern]" without checking existing patterns -- Stop. Search codebase first.
- "While we're here, we should also..." -- Stop. Flag the idea but do not expand scope.
Domain-Specific
- "The change looks reasonable, approving" -- Stop. Read every changed file. Approval without full review is rubber-stamping.
- "Let me fix this issue I found" -- Stop. Review identifies; it does not fix. Suggest the fix.
- "This is a minor style issue" -- Stop. Style or readability/maintainability? Classify accurately.
- "The author probably meant to..." -- Stop. Do not infer intent. Flag ambiguity as a question.
- Comment replacing code -- If a diff removes functional code and adds a comment (e.g.,
,// removed
,// TODO: re-add
), flag as Critical. Comments are not fixes. The code was either needed (removal is a bug) or not (remove silently). A comment replacing code is technical debt disguised as a change.// no longer needed
Rationalizations to Reject
Universal
These reasoning patterns sound plausible but lead to bad outcomes. Reject them.
- "It's probably fine" — "Probably" is not evidence. Verify before asserting.
- "This is best practice" — Best practice in what context? Cite the source and confirm it applies to this codebase.
- "We can fix it later" — If it is worth flagging, it is worth documenting now with a concrete follow-up plan.
Domain-Specific
| Rationalization | Reality |
|---|---|
| "The tests pass, so the logic must be correct" | Tests can be incomplete. Review the logic independently of test results. |
| "This is how it was done elsewhere in the codebase" | Existing patterns can be wrong. Evaluate the pattern on its merits, not just its precedent. |
| "It's just a refactor, low risk" | Refactors change behavior surfaces. Review them with the same rigor as feature changes. |
| "The fix is trivial, I'll just apply it inline" | Trivial fixes still skip review when applied by the reviewer. Suggest the fix; let the author apply and re-review. Iron Law. |
| "The diff is small so I can approve without reading every file" | Small diffs can contain critical bugs. Read every changed file completely — size does not correlate with risk. A one-line auth bypass is a small diff. |
| "The author is experienced, so I can be less thorough" | Review rigor is based on the code, not the author. Experienced authors make mistakes too. Apply the same checklist regardless of who wrote it. |
Escalation
- When reviewers disagree: Escalate to the human or tech lead.
- When review feedback changes the plan: Pause review. Plan must be updated first.
- When you cannot reproduce a reported issue: Ask reviewer for exact reproduction steps.
- When review takes more than 2 rounds: Something is fundamentally misaligned. Stop and discuss synchronously.
- When harness checks fail and you believe the check is wrong: Do not override or skip. File an issue against harness config.
- When pipeline produces a false positive after validation: Add pattern to
Noise / False Positives section..harness/review-learnings.md