Harness-engineering harness-code-review

Harness Code Review

install

source · Clone the upstream repo

git clone https://github.com/Intense-Visions/harness-engineering

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/Intense-Visions/harness-engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agents/skills/claude-code/harness-code-review" ~/.claude/skills/intense-visions-harness-engineering-harness-code-review-6c00f4 && rm -rf "$T"

manifest: agents/skills/claude-code/harness-code-review/SKILL.md

source content

Harness Code Review

Multi-phase code review pipeline — mechanical checks, graph-scoped context, parallel review agents, cross-agent deduplication, and structured output with technical rigor over social performance.

When to Use

When performing a code review (manual invocation or triggered by
```
on_pr
```
/
```
on_review
```
)
When requesting a review of completed work (see Role A at the end of this document)
When responding to review feedback (see Role C at the end of this document)
NOT for in-progress work (complete the feature first)
NOT for rubber-stamping (if you cannot find issues, look harder or state confidence level)
NOT for style-only feedback (leave that to linters and mechanical checks)

Argument Resolution

When invoked by autopilot (or with explicit arguments), resolve paths before starting:

Session slug: If

session-slug

argument provided, set

{sessionDir} = .harness/sessions/<session-slug>/

. Pass to

gather_context({ session: "<session-slug>" })

. All handoff writes go to

{sessionDir}/handoff.json

Commit range: If
```
commit-range
```
argument provided (e.g.,
```
abc123..HEAD
```
), use as diff scope in Phase 2 MECHANICAL and Phase 7 OUTPUT. Otherwise, auto-detect from git environment.

When no arguments are provided (standalone invocation), session slug is unknown — omit from

gather_context

, fall back to global

.harness/

paths. Diff scope auto-detected.

Process

Iron Law

Review identifies issues. Review never fixes them.

A reviewer who applies fixes is no longer reviewing — they are editing with reviewer authority and no review. Suggest the fix in the finding. Do not apply it. If you catch yourself writing production code during review, STOP. You have crossed the boundary.

The review runs as a 7-phase pipeline. Each phase has a clear input, output, and exit condition.

Phase 1: GATE --> Phase 2: MECHANICAL --> Phase 3: CONTEXT --> Phase 4: FAN-OUT
                                                                      |
Phase 7: OUTPUT <-- Phase 6: DEDUP+MERGE <-- Phase 5: VALIDATE <------+

Phase	Tier	Purpose	Exit Condition
1. GATE	fast	Skip ineligible PRs (CI only)	PR eligible, or exit with reason
2. MECHANICAL	none	Lint, typecheck, test, sec scan	All pass -> continue; any fail -> report and stop
3. CONTEXT	fast	Scope context per review domain	Context bundles assembled for each subagent
4. FAN-OUT	mixed	Parallel review subagents	All subagents return ReviewFinding[]
5. VALIDATE	none	Exclude mechanical dupes, verify	Unvalidated findings discarded
6. DEDUP+MERGE	none	Group, merge, assign severity	Deduplicated finding list with merged evidence
7. OUTPUT	none	Text output or GitHub comments	Review delivered, exit code set

Finding Schema

interface ReviewFinding {
  id: string; // unique, for dedup
  file: string; // file path
  lineRange: [number, number]; // start, end
  domain: 'compliance' | 'bug' | 'security' | 'architecture';
  severity: 'critical' | 'important' | 'suggestion';
  title: string; // one-line summary
  rationale: string; // why this is an issue
  suggestion?: string; // fix, if available
  evidence: string[]; // supporting context from agent
  validatedBy: 'mechanical' | 'graph' | 'heuristic';
}

Flags

Flag	Effect
`--comment`	Post inline comments to GitHub PR via `gh` CLI or GitHub MCP
`--deep`	Pass `--deep` to `harness-security-review` for threat modeling
`--no-mechanical`	Skip mechanical checks (useful if already run in CI)
`--ci`	Enable eligibility gate, non-interactive output
`--fast`	Skip learnings, fast-tier agents for all fan-out slots
`--thorough`	Always load learnings, full roster + meta-judge, learnings in output

Rigor Levels

Set via

--fast

--thorough

flags (or passed by autopilot). Default is

standard

Phase	`fast`	`standard` (default)	`thorough`
3. CONTEXT	Skip learnings entirely	Load learnings if file exists; score via filterByRelevance	Always load learnings; fail loudly if missing
4. FAN-OUT	All agents at fast tier. Reduced focus areas.	Default tier assignments	Full roster + meta-judge confirms findings cited by multiple agents, flags contradictions, and surfaces cross-cutting concerns
7. OUTPUT	Standard format	Standard format	Include "Learnings Applied" section with relevance scores

Model Tiers

Tiers are abstract labels resolved at runtime from project config. If no config exists, all phases use the current model.

Tier	Default	Used By
`fast`	haiku-class	GATE, CONTEXT
`standard`	sonnet-class	Compliance agent, Architecture agent
`strong`	opus-class	Bug Detection agent, Security agent

Review Learnings Calibration

Before starting the pipeline, check for a project-specific calibration file. Behavior by rigor:

fast
: Skip entirely. Do not read or score learnings.
standard
: Read if file exists, score and filter. If missing, use default focus areas.
thorough
: Always read. If
```
.harness/review-learnings.md
```
missing, log warning.

.harness/review-learnings.md

exists (and rigor is not

fast

Useful Findings section — prioritize these categories (historically caught real issues).
Noise / False Positives section — de-prioritize or skip (wastes author time).
Calibration Notes section — apply as project-specific overrides.

Learnings Relevance Scoring

When learnings are loaded (standard or thorough), score against diff context before applying:

Build diff context string. Concatenate changed file paths + diff summary.
Score each learning via
```
filterByRelevance(learnings, diffContext, 0.7, 1000)
```
from
```
packages/core/src/state/learnings-relevance.ts
```
. Only learnings >= 0.7 retained, sorted by score, truncated to 1000-token budget.
Apply filtered learnings: boost priority for passing Useful Findings, suppress passing Noise entries, apply passing Calibration Notes as overrides.
If none pass 0.7 threshold, use default focus areas. No fallback to unscored inclusion.

After review, consider suggesting

.harness/review-learnings.md

creation if you notice patterns that would benefit from calibration.

Pipeline Phases

Phase 1: GATE

Tier: fast | Mode: CI only (

--ci

). Skip when invoked manually.

Checks whether the PR should be reviewed at all, preventing wasted compute in CI.

Checks:

PR state: Closed or merged? -> Skip: "PR is closed."
Draft status: Draft? -> Skip: "PR is draft."
Trivial change: All changed files
```
.md
```
? -> Skip: "Documentation-only change."
Already reviewed: Same commit range reviewed before? -> Skip: "Already reviewed at {sha}."

gh pr view --json state,isDraft,files
gh pr diff --name-only | grep -v '\.md$' | wc -l  # 0 means docs-only

Exit: If any check triggers skip, output reason and exit 0. Otherwise continue to Phase 2.

Phase 2: MECHANICAL

Tier: none (no LLM) | Mode: Skipped with

--no-mechanical

Run mechanical checks to establish an exclusion boundary. Issues caught here are excluded from AI review (Phase 4).

Checks:

Harness validation:

assess_project({ path, checks: ["validate", "deps", "docs"], mode: "detailed" })

— runs checks in parallel.

Security scan:
```
run_security_scan
```
on changed files. Record findings with rule ID, file, line.
Type checking: Run
```
tsc --noEmit
```
. Record errors.
Linting: Run linter. Record violations.
Tests: Run test suite. Record failures.

Output: Set of mechanical findings (file, line, tool, message) forming the exclusion list for Phase 5.

Evidence Gate (session-aware)

When

sessionSlug

is available, load evidence entries from session state and cross-reference with findings:

Load:

readSessionSection(projectRoot, sessionSlug, 'evidence')

Check if evidence entries reference the same file:line as each finding
Unmatched findings tagged
```
[UNVERIFIED]
```
Append coverage report: evidence entries count, findings with evidence, coverage %

When no session is available, skip silently. Evidence checking enhances but does not gate reviews.

Exit: If harness validate, typecheck, or tests fail, report in Strengths/Issues/Assessment format and stop. Lint warnings and security findings do not stop the pipeline -- recorded for exclusion only.

Phase 3: CONTEXT

Tier: fast | Purpose: Assemble scoped context bundles per review domain. Each subagent receives only domain-relevant context.

Change-Type Detection

Determine change type to shape review focus:

Commit prefix:
```
feat:
```
-> feature,
```
fix:
```
-> bugfix,
```
refactor:
```
-> refactor,
```
docs:
```
-> docs
Diff heuristic: New files + tests -> feature; small changes + test -> bugfix; renames/moves -> refactor; only
```
.md
```
-> docs
Default: If ambiguous, treat as feature (most thorough review).

Context Scoping

Domain	With Graph	Without Graph (Fallback)
Compliance	Convention files + changed files	Same (no graph needed)
Bug Detection	Changed files + dependencies via `query_graph`	Changed files + imported files ( `grep import` )
Security	Security paths + data flow via `query_graph`	Changed files + files with auth/crypto/SQL/shell patterns
Architecture	Layer boundaries + imports via `query_graph` / `get_impact`	Changed files + `harness check-deps` output

1:1 Context Ratio Rule

Small diffs (<20 lines): 3:1 context-to-diff ratio.
Medium diffs (20-200 lines): 1:1 ratio. Full files + immediate dependencies.
Large diffs (>200 lines): 1:1 floor. Prioritize ruthlessly. Flag as review concern.

Context Gathering Priority

Files directly imported/referenced by changed files
Corresponding test files (note if missing)
Spec/design docs mentioning changed components
Type definitions consumed or produced by changed code
Recent commits touching the same files

Graph-Enhanced Context

When

.harness/graph/

exists, use

gather_context

for efficient assembly:

gather_context({
  path: "<project-root>",
  intent: "Code review of <change description>",
  skill: "harness-code-review",
  session: "<session-slug-if-provided>",
  tokenBudget: 8000,
  include: ["graph", "learnings", "validation"]
})

Replaces manual

query_graph

get_impact

find_context_for

calls. Falls back gracefully when no graph exists. Supplement with targeted

query_graph

calls for domain-specific scoping.

Context Assembly Commands

git diff --stat HEAD~1                    # measure diff size
git diff HEAD~1 -- <file>                 # per-file diff
grep -n "import\|require\|from " <file>   # find imports
find . -name "*<module>*test*" -o -name "*<module>*spec*"
grep -rl "<component>" docs/changes/ docs/design-docs/ docs/plans/
grep -rn "interface\|type\|schema" <changed-file> | head -20

Commit History Context

git log --oneline -5 -- <affected-file>

Use to determine: Hotspot? (3+ changes in last 5 commits) Recently refactored? Multiple authors? Last change was bugfix? (yellow flag)

Exit: Context bundles assembled for all four domains. Continue to Phase 4.

Phase 4: FAN-OUT

Tier: mixed | Purpose: Run four parallel review subagents with domain-scoped context. Each produces

ReviewFinding[]

Rigor overrides:

fast
: All agents at fast tier (haiku-class), reduced reasoning depth.
standard
: Default tiers per agent.
thorough
: Default tiers + meta-judge pass (strong tier) cross-validating findings, flagging contradictions, surfacing cross-cutting concerns.

Compliance Agent (standard tier)

Reviews adherence to project conventions, standards, and documentation.

Input: Compliance context bundle (convention files + changed files + change type)

Focus by change type:

Feature:

Spec alignment: Implementation matches spec? All specified behaviors present?
API surface: New interfaces minimal and well-named? Could exports be kept internal?
Backward compatibility: Breaks existing callers? Migration path documented?

Bugfix:

Root cause identified: Addresses root cause, not symptom?
Original issue referenced: Commit/PR references bug report?
No collateral changes: Only necessary changes?

Refactor:

Behavioral equivalence: Existing tests pass without modification?
No functionality changes: No new behavior introduced?

Docs:

Accuracy: Documented behaviors match code?
Completeness: All public interfaces documented?
Consistency: Follows existing style and terminology?
Links valid: All internal links resolve?

Output:

ReviewFinding[]

with

domain: 'compliance'

Bug Detection Agent (strong tier)

Reviews for logic errors, edge cases, and correctness issues.

Input: Bug detection context bundle (changed files + dependencies)

Focus areas:

Edge cases: Boundary conditions (empty, max, null, concurrent)
Error handling: Appropriate level, helpful messages, no silent swallowing
Logic errors: Off-by-one, incorrect boolean logic, missing early returns
Race conditions: Concurrent shared state access, missing locks/atomics
Resource leaks: Unclosed handles, missing cleanup in error paths
Type safety: Mismatches, unsafe casts, missing null checks
Test coverage: Happy path, error paths, edge cases. Meaningful, not just present.
Regression tests: For bugfixes -- test that would have caught the bug

Output:

ReviewFinding[]

with

domain: 'bug'

Security Agent (strong tier) -- via harness-security-review

Invokes

harness-security-review

in changed-files mode as the security fan-out slot.

Input: Security context bundle (security-relevant paths + data flows)

Invocation: Pipeline invokes

harness-security-review

with scope

changed-files

Skips Phase 1 (SCAN) -- reads mechanical findings from PipelineContext
Runs Phase 2 (REVIEW) -- OWASP baseline + stack-adaptive on changed files
Skips Phase 3 (THREAT-MODEL) unless
```
--deep
```
was passed

Returns

ReviewFinding[]

with security fields (

cweId

owaspCategory

confidence

remediation

references

)

Focus areas:

Semantic security review (beyond mechanical scanners):
- Input flowing through multiple functions to dangerous sinks (SQL, shell, HTML)
- Missing authorization on new/modified endpoints
- Sensitive data in logs, errors, or API responses
- Authentication bypass paths
- Insecure defaults in new config
Stack-adaptive: Node.js (prototype pollution, ReDoS, path traversal), React (XSS, dangerouslySetInnerHTML), Go (race conditions, integer overflow, unsafe pointer), Python (pickle, SSTI, command injection)
CWE/OWASP references: All findings include
```
cweId
```
,
```
owaspCategory
```
,
```
remediation
```
.

Confirmed vulnerabilities are always

severity: 'critical'

Dedup with mechanical scan: Phase 5 uses the exclusion set from Phase 2 to discard overlapping findings.

Output:

ReviewFinding[]

with

domain: 'security'

Architecture Agent (standard tier)

Reviews architectural violations, dependency direction, and design pattern compliance.

Input: Architecture context bundle (layer boundaries + import graph)

Focus areas:

Layer compliance: Imports flowing in correct direction?
Dependency direction: Depends on abstractions, not concretions?
Single Responsibility: Each module has one reason to change?
Open/Closed: Behavior extensible without modifying existing code?
Pattern consistency: Follows established patterns? New patterns justified?
Separation of concerns: Business logic separated from infrastructure?
DRY violations: Duplicated logic to extract -- but NOT intentional duplication that will diverge.
Performance preserved: Could restructuring introduce regressions?

Output:

ReviewFinding[]

with

domain: 'architecture'

Exit: All four agents returned findings. Continue to Phase 5.

Phase 5: VALIDATE

Tier: none (mechanical) | Purpose: Remove false positives via cross-referencing.

Steps:

Mechanical exclusion: Discard AI findings where same file + line range already flagged by Phase 2.
Graph reachability (if available): Verify claimed dependency paths via
```
query_graph
```
. Discard findings with invalid reachability claims.
Import-chain heuristic (fallback): Follow imports 2 levels deep. If claimed impact unreachable within 2 hops, downgrade to
```
suggestion
```
.

Exit: Validated finding set. Continue to Phase 6.

Phase 6: DEDUP + MERGE

Tier: none (mechanical) | Purpose: Eliminate redundant findings across agents.

Steps:

Group by location: Group by
```
file
```
+ overlapping
```
lineRange
```
(intersecting or within 3 lines).
Merge overlapping findings: Keep highest severity, combine evidence arrays, preserve strongest rationale, merge domain tags, generate merged ID.
Assign final severity:
- Critical — Must fix. Bugs, security vulns, failing harness checks, boundary violations.
- Important — Should fix. Missing error handling, missing critical-path tests, unclear naming.
- Suggestion — Consider. Style, minor optimizations, alternatives. Does not block merge.

Exit: Deduplicated, severity-assigned list. Continue to Phase 7.

Phase 7: OUTPUT

Tier: none | Purpose: Deliver review in requested format.

Text Output (default)

**[STRENGTH]** Clean separation between route handler and service logic
**[CRITICAL]** api/routes/users.ts:12-15 -- Direct import from db/queries.ts bypasses service layer
**[IMPORTANT]** services/user-service.ts:45 -- createUser does not handle duplicate email
**[SUGGESTION]** Consider extracting validation into a shared utility

Strengths: What is done well. Be specific -- "Clean separation between X and Y" not "Looks good".

Issues: Grouped by severity (Critical / Important / Suggestion). Each includes: location, problem, rationale, suggested fix.

Assessment: Approve (no critical/important), Request Changes (critical/important present), or Comment (observations only).

Learnings Applied (thorough only): List learnings with Jaccard scores and how they influenced review. Omitted in fast/standard.

Exit code: 0 for Approve/Comment, 1 for Request Changes.

Inline GitHub Comments (

--comment

)

Small fixes (<10 lines): Committable suggestion block.
Large fixes (>=10 lines or no concrete suggestion): Description + rationale comment.
Summary: Top-level PR review comment with Strengths/Issues/Assessment.

gh pr review --event APPROVE|REQUEST_CHANGES|COMMENT --body "<summary>"
gh api repos/{owner}/{repo}/pulls/{pr}/comments \
  --field body="<rationale>\n\`\`\`suggestion\n<fix>\n\`\`\`" \
  --field path="<file>" --field line=<line>

Review Acceptance

emit_interaction({
  path: "<project-root>",
  type: "confirmation",
  confirmation: {
    text: "Review complete: <Assessment>. Accept review?",
    context: "<N critical, N important, N suggestion findings>",
    impact: "Accepting finalizes findings. Approve = ready for merge. Request-changes = fixes needed.",
    risk: "<low if approve, high if critical>"
  }
})

Handoff and Transition

Write handoff to the session-scoped path when session slug is known, otherwise fall back to global:

Session-scoped (preferred):

.harness/sessions/<session-slug>/handoff.json

Global (fallback, deprecated):
```
.harness/handoff.json
```

[DEPRECATED] Writing to
.harness/handoff.json
is deprecated. In autopilot sessions, always write to
.harness/sessions/<slug>/handoff.json
.

{
  "fromSkill": "harness-code-review",
  "phase": "OUTPUT",
  "summary": "<assessment summary>",
  "assessment": "approve | request-changes | comment",
  "findingCount": { "critical": 0, "important": 0, "suggestion": 0 },
  "artifacts": ["<reviewed files>"]
}

Write session summary (if session known):

writeSessionSummary(projectPath, sessionSlug, {
  session: "<session-slug>",
  lastActive: "<ISO timestamp>",
  skill: "harness-code-review",
  spec: "<spec path if known>",
  status: "Review complete. Assessment: <type>. <N> findings.",
  keyContext: "<1-2 sentences: review outcome, key findings>",
  nextStep: "<Address findings / Ready to merge / Observations delivered>"
})

If "approve": Emit transition:

{
  "type": "transition",
  "transition": {
    "completedPhase": "review",
    "suggestedNext": "merge",
    "reason": "Review approved with no blocking issues",
    "artifacts": ["<reviewed files>"],
    "requiresConfirmation": true,
    "summary": "Review approved. <N> suggestions. Ready for PR/merge.",
    "qualityGate": {
      "checks": [
        { "name": "mechanical-checks", "passed": true },
        { "name": "no-critical-findings", "passed": true },
        { "name": "no-important-findings", "passed": true },
        { "name": "harness-validate", "passed": true }
      ],
      "allPassed": true
    }
  }
}

If user confirms: proceed to create PR or merge. If user declines: stop.

If "request-changes": Do NOT emit transition. Surface critical/important findings for resolution. Re-run after fixes.

If "comment": Do NOT emit transition. Observations delivered, no further action implied.

Role A: Requesting a Review

Not part of the pipeline. Documents the process for requesting reviews.

Prepare context: Commit range, description (WHAT and WHY), plan reference, test evidence, harness check results.
Dispatch: Identify reviewer, provide context, state what feedback you want.
Wait. Do not modify code under review.

Role C: Responding to Review Feedback

Not part of the pipeline. Documents the process for responding to feedback.

Read all feedback first. Understand full picture before responding.
Verify before implementing: Understand it? Is it correct? Is it actionable?
Technical rigor over social performance:
- Do NOT agree just to be agreeable. Push back with evidence if wrong.
- Do NOT implement every suggestion. Apply YAGNI.
- Do NOT make changes you do not understand. Ask.
- DO acknowledge correct feedback.
- DO push back when feedback contradicts approved plan/spec.
Implement fixes: Change, test, run
```
harness validate
```
and
```
harness check-deps
```
, commit referencing feedback.
Re-request review with change summary, addressed vs. pushed-back items, fresh harness results.

Evidence Requirements

Every

ReviewFinding.evidence

array MUST include citations using one of:

File reference:
```
file:line
```
format (e.g.,
```
src/api/routes/users.ts:12-15
```
-- "bypasses service layer")
Diff evidence: Before/after code with file path and line numbers
Dependency chain: Import path showing violation (e.g.,
```
routes/users.ts:3 imports db/queries.ts
```
)
Test evidence: Test command and output for test-related findings
Convention reference: Specific convention file and rule (e.g.,
```
AGENTS.md:45
```
)
Session evidence: Write to
```
evidence
```
session section via
```
manage_state
```

When to cite: Phase 4 (each subagent populates evidence), Phase 5 (evidence verifies reachability), Phase 7 (every issue backed by evidence).

Uncited claims: Findings without evidence discarded in Phase 5. Observations without file:line references prefixed

[UNVERIFIED]

and downgraded to

suggestion

Rubric Compression

Review rubrics passed to subagents in Phase 4 MUST use compressed single-line format to minimize token consumption. Each rubric entry is one line with pipe-delimited fields:

domain|check-name|severity|one-sentence-criterion

Example (Compliance Agent rubric):

compliance|spec-alignment|critical|Implementation matches all behaviors specified in the approved spec
compliance|api-surface|important|New exports are minimal and well-named; internal symbols stay unexported
compliance|backward-compat|critical|No breaking changes to existing callers without documented migration path
compliance|naming|suggestion|Names follow project conventions (check AGENTS.md or .eslintrc)

Why: Verbose rubric prose inflates context by 2-5x without improving review accuracy. Dense single-line rubrics give the agent the same signal in fewer tokens, leaving more budget for actual code analysis.

Rules:

Maximum 80 characters per criterion text
Domain must match the subagent's scope (compliance, bug, security, architecture)
Severity must be one of: critical, important, suggestion
Rubric entries are guidance, not exhaustive — agents may surface findings outside the rubric

Harness Integration

assess_project
— Phase 2: run validate/deps/docs in parallel. Failures are Critical and stop pipeline.
gather_context
— Phase 3: parallel context assembly. Session parameter scopes to session directory.
harness cleanup
— Optional Phase 2 check for entropy in changed files.
Graph queries — Phase 3 (dependency context) and Phase 5 (reachability verification). Graceful fallback.
emit_interaction
— Post-review: suggest merge transition on APPROVE. Confirmed transition.
Rigor levels —
```
--fast
```
/
```
--thorough
```
control learnings, agent tiers, output. See Rigor Levels table.
filterByRelevance
— Phase 3 learnings scoring. Threshold 0.7, budget 1000 tokens.
Session directory —
```
.harness/sessions/<slug>/
```
contains
```
handoff.json
```
,
```
state.json
```
,
```
artifacts.json
```
(spec/plan paths, reviewed file list). Write handoff to session scope when slug is known. Global
```
.harness/handoff.json
```
is deprecated for session-aware invocations.

Uncertainty Surfacing

When a review subagent encounters ambiguity during analysis, classify it immediately:

Blocking: Cannot determine severity without more context (e.g., unclear whether a pattern is intentional or accidental). Surface as a finding with severity
```
suggestion
```
and rationale explaining the ambiguity. Do not guess.
Assumption: Can classify if assumption is stated (e.g., "assuming this endpoint is internal-only, the missing auth check is acceptable"). State the assumption in the finding. If wrong, the finding severity changes.
Deferrable: Ambiguity does not affect the review (e.g., whether a naming choice will cause confusion in the future). Omit from findings — it is noise.

Do not suppress ambiguous findings. An ambiguous finding surfaced as a question is more valuable than a confident finding built on a wrong assumption.

Success Criteria

Pipeline runs all 7 phases in order (skipping GATE without
```
--ci
```
)
Mechanical failures in Phase 2 stop pipeline before AI review
Each Phase 4 subagent receives only domain-scoped context
All findings use ReviewFinding schema
Mechanical findings excluded from Phase 4 output in Phase 5
Cross-agent duplicates merged in Phase 6
Text output uses Strengths/Issues/Assessment with Critical/Important/Suggestion
```
--comment
```
posts inline GitHub comments with committable suggestions
```
--deep
```
adds threat modeling to Security agent
No code merges with Critical issues or failing harness checks
Role C feedback verified before implementation; pushback is evidence-based
```
fast
```
: learnings skipped, all agents at fast tier
```
thorough
```
: learnings always loaded/scored, meta-judge validates, "Learnings Applied" in output
```
standard
```
: learnings included if file exists, scored at 0.7 threshold
Zero learnings below threshold means zero included (no unscored fallback)

Examples

Example: Pipeline Review of a New API Endpoint

Phase 1 (GATE): Skipped -- manual invocation.

Phase 2 (MECHANICAL):

harness validate

passes.

harness check-deps

passes. Security scan clean.

tsc --noEmit

passes. Lint passes.

Phase 3 (CONTEXT): Change type:

feature

(prefix

feat:

). Bundles:

Compliance:
```
CLAUDE.md
```
+ changed files

Bug detection:

api/routes/users.ts

services/user-service.ts

db/queries.ts

Security:

api/routes/users.ts

(endpoint),

services/user-service.ts

(data flow)

Architecture: import graph showing
```
routes -> services -> db
```
layers

Phase 4 (FAN-OUT): Four agents in parallel:

Compliance: 0 findings (spec alignment confirmed)
Bug detection: 1 finding (missing duplicate email handling in createUser)
Security: 0 findings
Architecture: 1 finding (routes/users.ts imports directly from db/queries.ts)

Phase 5 (VALIDATE): No mechanical exclusions. Architecture finding validated by

check-deps

showing layer violation.

Phase 6 (DEDUP+MERGE): No overlaps -- 2 distinct findings in different files.

Phase 7 (OUTPUT):

Strengths:

Clean separation between route handler and service logic
Input validation using Zod schemas with clear error messages
Comprehensive test coverage including error paths

Issues:

Critical:

```
api/routes/users.ts:12-15
```
-- Direct import from
```
db/queries.ts
```
bypasses service layer. Must route through
```
services/user-service.ts
```
. (domain: architecture, validatedBy: heuristic)

Important:

```
services/user-service.ts:45
```
--
```
createUser
```
does not handle duplicate email. Database throws constraint violation surfacing as 500. Should catch and return 409. (domain: bug, validatedBy: heuristic)

Suggestion: (none)

Assessment: Request Changes -- one critical layer violation and one important missing error handler.

Gates

Never skip mechanical checks without
--no-mechanical
. If not run in CI or locally, they must run in Phase 2 before AI review.
Never merge with failing harness checks.
```
harness validate
```
and
```
harness check-deps
```
must pass. Always Critical.
Never implement feedback without verification. Verify correctness before changing code. Do not blindly comply.
Never agree performatively. "Sure, I'll change that" without understanding is forbidden.
Never skip the YAGNI check. Every suggestion must serve a current, concrete need. Speculative improvements rejected.
Never apply fixes during review. Review output is findings, not code changes. Suggest fixes in finding text; never edit production code. Iron Law violation.

Red Flags

Universal

"I believe the codebase does X" -- Stop. Read code, cite file:line. Belief is not evidence.
"Let me recommend [pattern]" without checking existing patterns -- Stop. Search codebase first.
"While we're here, we should also..." -- Stop. Flag the idea but do not expand scope.

Domain-Specific

"The change looks reasonable, approving" -- Stop. Read every changed file. Approval without full review is rubber-stamping.
"Let me fix this issue I found" -- Stop. Review identifies; it does not fix. Suggest the fix.
"This is a minor style issue" -- Stop. Style or readability/maintainability? Classify accurately.
"The author probably meant to..." -- Stop. Do not infer intent. Flag ambiguity as a question.
Comment replacing code -- If a diff removes functional code and adds a comment (e.g.,
```
// removed
```
,
```
// TODO: re-add
```
,
```
// no longer needed
```
), flag as Critical. Comments are not fixes. The code was either needed (removal is a bug) or not (remove silently). A comment replacing code is technical debt disguised as a change.

Rationalizations to Reject

Universal

These reasoning patterns sound plausible but lead to bad outcomes. Reject them.

"It's probably fine" — "Probably" is not evidence. Verify before asserting.
"This is best practice" — Best practice in what context? Cite the source and confirm it applies to this codebase.
"We can fix it later" — If it is worth flagging, it is worth documenting now with a concrete follow-up plan.

Domain-Specific

Rationalization	Reality
"The tests pass, so the logic must be correct"	Tests can be incomplete. Review the logic independently of test results.
"This is how it was done elsewhere in the codebase"	Existing patterns can be wrong. Evaluate the pattern on its merits, not just its precedent.
"It's just a refactor, low risk"	Refactors change behavior surfaces. Review them with the same rigor as feature changes.
"The fix is trivial, I'll just apply it inline"	Trivial fixes still skip review when applied by the reviewer. Suggest the fix; let the author apply and re-review. Iron Law.
"The diff is small so I can approve without reading every file"	Small diffs can contain critical bugs. Read every changed file completely — size does not correlate with risk. A one-line auth bypass is a small diff.
"The author is experienced, so I can be less thorough"	Review rigor is based on the code, not the author. Experienced authors make mistakes too. Apply the same checklist regardless of who wrote it.

Escalation

When reviewers disagree: Escalate to the human or tech lead.
When review feedback changes the plan: Pause review. Plan must be updated first.
When you cannot reproduce a reported issue: Ask reviewer for exact reproduction steps.
When review takes more than 2 rounds: Something is fundamentally misaligned. Stop and discuss synchronously.
When harness checks fail and you believe the check is wrong: Do not override or skip. File an issue against harness config.
When pipeline produces a false positive after validation: Add pattern to
```
.harness/review-learnings.md
```
Noise / False Positives section.

Harness-engineering harness-code-review

Harness Code Review

When to Use

Argument Resolution

Process

Iron Law

Finding Schema

Flags

Rigor Levels

Model Tiers

Review Learnings Calibration

Learnings Relevance Scoring

Pipeline Phases

Phase 1: GATE

Phase 2: MECHANICAL

Evidence Gate (session-aware)

Phase 3: CONTEXT

Change-Type Detection

Context Scoping

1:1 Context Ratio Rule

Context Gathering Priority

Graph-Enhanced Context

Context Assembly Commands

Commit History Context

Phase 4: FAN-OUT

Compliance Agent (standard tier)

Bug Detection Agent (strong tier)

Security Agent (strong tier) -- via harness-security-review

Architecture Agent (standard tier)

Phase 5: VALIDATE

Phase 6: DEDUP + MERGE

Phase 7: OUTPUT

Text Output (default)

Inline GitHub Comments (--comment)

Review Acceptance

Handoff and Transition

Role A: Requesting a Review

Role C: Responding to Review Feedback

Evidence Requirements

Rubric Compression

Harness Integration

Uncertainty Surfacing

Success Criteria

Examples

Example: Pipeline Review of a New API Endpoint

Gates

Red Flags

Universal

Domain-Specific

Rationalizations to Reject

Universal

Domain-Specific

Escalation

Inline GitHub Comments (
`--comment`
)