git clone https://github.com/Intense-Visions/harness-engineering
T=$(mktemp -d) && git clone --depth=1 https://github.com/Intense-Visions/harness-engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agents/skills/claude-code/harness-verification" ~/.claude/skills/intense-visions-harness-engineering-harness-verification-8fa78a && rm -rf "$T"
agents/skills/claude-code/harness-verification/SKILL.mdHarness Verification
3-level evidence-based verification. No completion claims without fresh evidence. "Should work" is not evidence.
When to Use
- After completing any implementation task (before claiming "done")
- After executing a plan or spec (verify all deliverables)
- When validating work done by another agent or in a previous session
- When resuming after a context reset (re-verify before continuing)
- When
oron_commit
triggers fire and verification is neededon_pr - NOT as a replacement for tests (verification checks tests exist and pass, not that logic is correct)
- NOT for in-progress work (verify at completion boundaries, not mid-stream)
Verification Tiers
| Tier | Skill | When | What |
|---|---|---|---|
| Quick gate | harness-execution (built-in) | After every task | test + lint + typecheck + build + harness validate |
| Deep audit | harness-verification (this) | Milestones, PRs, on-demand | EXISTS -> SUBSTANTIVE -> WIRED |
Use deep audit for milestone boundaries, before PRs, or when the quick gate passes but something feels wrong. Do NOT invoke after every individual task.
Process
Iron Law
No completion claim may be made without fresh verification evidence collected in THIS session.
Cached results, remembered outcomes, and "it worked last time" are not evidence. Run the checks. Read the output. Report what you observed.
The words "should", "probably", "seems to", and "I believe" are forbidden in verification reports. Replace with "verified: [evidence]" or "not verified: [what is missing]."
Argument Resolution
When invoked by autopilot (or with explicit arguments), resolve paths before starting:
- Session slug: If
argument provided, setsession-slug
. Pass to{sessionDir} = .harness/sessions/<session-slug>/
. All handoff writes go togather_context({ session: "<session-slug>" })
.{sessionDir}/handoff.json
When no arguments are provided (standalone invocation), session slug is unknown — omit from
gather_context, fall back to global .harness/ paths.
Uncertainty Surfacing
When you encounter an unknown during verification, classify it immediately:
- Blocking: Cannot determine pass/fail without resolving this (e.g., spec does not define expected behavior for a scenario, cannot run tests due to missing dependency). STOP and escalate.
- Assumption: Can verify if assumption is stated (e.g., "this module is internal-only so WIRED check against external consumers is not applicable"). Document the assumption in the report. If wrong, verification must be re-run.
- Deferrable: Does not affect current verification (e.g., whether additional test coverage would be beneficial). Note in report as an observation.
Do not mark PASS with unstated assumptions. An assumption-laden PASS is a false positive.
Review-never-fixes: Verification identifies gaps. Verification never fills them. If you find a stub, missing test, or unwired artifact, record it as a FAIL with evidence. Do not implement the fix — that is the executor's job. A verifier who fixes is no longer verifying independently.
Context Loading
Before running verification levels, load session context:
gather_context({ path: "<project-root>", intent: "Verify phase deliverables", skill: "harness-verification", session: "<session-slug-if-provided>", include: ["state", "learnings", "validation"] })
If a session slug is known, include the
session parameter to scope reads to .harness/sessions/<slug>/. Otherwise omit it to fall back to .harness/. Use returned learnings to check for known failures relevant to artifacts being verified.
Level 1: EXISTS -- The Artifact Is Present
For every artifact that was supposed to be created or modified:
- Check the file exists on disk. Use
,ls
, or read. Do not assume it exists because you wrote it -- writes can fail silently.stat - Check it has content. An empty file is not an artifact.
- Check it is in the right location. Compare actual path against spec/plan.
- Record the result:
[EXISTS: PASS] path/to/file.ts (247 lines) [EXISTS: FAIL] path/to/missing-file.ts -- file not found
Do not proceed to Level 2 until all Level 1 checks pass. Missing files must be created first.
Level 2: SUBSTANTIVE -- Not a Stub
For every artifact that passed Level 1:
- Read the file content thoroughly -- do not skim.
- Scan for stub anti-patterns:
/TODO
comments,FIXME
,throw new Error('not implemented')
,() => {}
/return null
/undefined
as sole logic,{}
(Python),pass
/placeholder
/stub
in non-test code, functions with only a descriptive comment, interfaces defined but never implemented.mock - Verify real implementation exists. A function returning only a hardcoded value is a stub unless that is correct behavior.
- Check completeness against spec. If spec says "handles errors X, Y, Z," verify all three, not just X.
- Record the result:
[SUBSTANTIVE: PASS] path/to/file.ts -- real implementation, no stubs [SUBSTANTIVE: FAIL] path/to/file.ts -- TODO on line 34, empty handler on line 67
Do not proceed to Level 3 until all Level 2 checks pass. Stubs must be replaced with real implementations first.
Level 3: WIRED -- Connected to the System
For every artifact that passed Level 2:
- Verify imported/required by at least one other file (unless it is an entry point).
- Verify called/used. An unused import is dead code. Trace usage: functions called, components rendered, types used in signatures, config loaded, tests executed.
- Verify tested. At least one test exercises the artifact's behavior: test file exists, imports the artifact, makes behavior assertions, actually runs (not
/.skip
).xit - Run the tests. Execute the suite now -- do not trust "they passed earlier."
- Run harness checks. Execute
and verify integration with project constraints.harness validate - Record the result:
[WIRED: PASS] path/to/file.ts -- imported by 3 files, tested in file.test.ts (4 tests, all pass) [WIRED: FAIL] path/to/file.ts -- exported but not imported by any other file
Anti-Pattern Scan
Run across all changed files as a final check:
Markers: TODO, FIXME, XXX, HACK, PLACEHOLDER, NOT_IMPLEMENTED Code: () => {}, return null (sole body), pass, raise NotImplementedError Tests: .skip, xit, xdescribe, @pytest.mark.skip, pending
Any match is a verification failure. Fix it or document why it is acceptable (e.g., "TODO tracked in issue #123, out of scope").
Gap Identification
After all three levels, produce a structured gap report:
## Verification Report ### Level 1: EXISTS - [PASS] path/to/file-a.ts (120 lines) - [FAIL] path/to/file-c.ts -- not found ### Level 2: SUBSTANTIVE - [PASS] path/to/file-a.ts -- real implementation - [FAIL] path/to/file-b.ts -- TODO on line 22 ### Level 3: WIRED - [PASS] path/to/file-a.ts -- imported, tested, harness passes - [NOT CHECKED] path/to/file-b.ts -- blocked by Level 2 failure ### Anti-Pattern Scan - path/to/file-b.ts:22 -- TODO: implement validation ### Gaps 1. path/to/file-c.ts must be created 2. path/to/file-b.ts:22 must be implemented (not stub) ### Verdict: INCOMPLETE -- 2 gaps must be resolved
Use severity markers for critical findings:
**[CRITICAL]** for blocking issues, **[IMPORTANT]** for non-blocking concerns.
Verification Sign-Off
After producing the report, request acceptance:
emit_interaction({ path: "<project-root>", type: "confirmation", confirmation: { text: "Verification report: <VERDICT>. Accept and proceed?", context: "<N artifacts checked, N gaps found>", impact: "Accepting proceeds to code review. Declining requires gap resolution.", risk: "<low if PASS, high if gaps remain>" } })
Handoff and Transition
Write handoff to the session-scoped path when session slug is known, otherwise fall back to global:
- Session-scoped (preferred):
.harness/sessions/<session-slug>/handoff.json - Global (fallback, deprecated):
.harness/handoff.json
[DEPRECATED] Writing to
is deprecated. In autopilot sessions, always write to.harness/handoff.jsonto prevent cross-session contamination..harness/sessions/<slug>/handoff.json
{ "fromSkill": "harness-verification", "phase": "COMPLETE", "summary": "<verdict summary>", "artifacts": ["<verified file paths>"], "verdict": "pass | fail", "gaps": ["<gap descriptions if any>"] }
Session summary (if session known): Update via
writeSessionSummary with skill, status (Verification <PASS|FAIL|INCOMPLETE>. <N> artifacts, <N> gaps.), keyContext, and nextStep.
If verdict is PASS: Call
emit_interaction:
{ "type": "transition", "transition": { "completedPhase": "verification", "suggestedNext": "review", "reason": "Verification passed at all 3 levels", "artifacts": ["<verified file paths>"], "requiresConfirmation": false, "summary": "Verification passed: <N> artifacts. EXISTS, SUBSTANTIVE, WIRED all passed.", "qualityGate": { "checks": [ { "name": "level1-exists", "passed": true }, { "name": "level2-substantive", "passed": true }, { "name": "level3-wired", "passed": true }, { "name": "anti-pattern-scan", "passed": true }, { "name": "harness-validate", "passed": true } ], "allPassed": true } } }
Immediately invoke harness-code-review without waiting for user input.
If verdict is FAIL or INCOMPLETE: Do NOT emit a transition. Surface gaps for resolution. Handoff records gaps for future reference.
Regression Test Verification
When verifying a bug fix:
- Write the regression test reproducing the bug
- Run the test -- must PASS (proving fix works)
- Revert the fix temporarily (
or comment out)git stash - Run the test -- must FAIL (proving test catches the bug)
- Restore the fix (
or uncomment)git stash pop - Run the test -- must PASS again (proving fix is the reason)
If step 4 passes (test does not fail without fix), the test is invalid. Rewrite it.
Evidence Requirements
Every pass/fail assertion in the verification report MUST include concrete evidence. "Should", "probably", "seems to" are forbidden by the Iron Law -- this defines HOW to cite.
Every verification claim MUST use one of:
- File reference:
with observed content (e.g.,file:line
-- "create method validates email")src/services/user-service.ts:42 - Test output: Actual command and complete output
- Harness output: Full
andharness validate
outputharness check-deps - Anti-pattern scan output: Actual search command and results
- Import chain evidence: Actual import statements found for WIRED level
- Session evidence: Write to
session section per level viaevidencemanage_state
When to cite: At every level. L1 cites file reads. L2 cites specific line content. L3 cites imports, test output, harness output. Each
[PASS]/[FAIL] marker must be accompanied by producing evidence.
Uncited claims: Any verification assertion without direct evidence is a verification failure. This skill does not use
[UNVERIFIED] -- if evidence cannot be produced, verdict is FAIL or INCOMPLETE.
Rubric Compression
Verification checklists passed to subagents or used internally MUST use compressed single-line format. Each check is one line with pipe-delimited fields:
level|check-name|pass-criterion
Example (Level 2 SUBSTANTIVE rubric):
L2|no-stubs|No TODO/FIXME/throw-not-implemented in production code L2|no-empty-bodies|No empty function bodies, () => {}, or return null as sole logic L2|spec-complete|All behaviors specified in spec have corresponding implementation L2|real-logic|Functions contain meaningful logic, not just hardcoded returns
Why: Verbose checklist prose inflates verification context without improving accuracy. Dense single-line rubrics give the same signal in fewer tokens, leaving more budget for reading and analyzing actual file content.
Rules:
- Level prefix must be L1 (EXISTS), L2 (SUBSTANTIVE), or L3 (WIRED)
- Maximum 80 characters per criterion text
- Rubric entries are guidance — the verification levels define the authoritative checks
Red Flags
| Flag | Corrective Action |
|---|---|
| "Tests passed earlier, so I just need to check the files exist" | STOP. Iron Law: fresh evidence in THIS session. "Earlier" is cached — run the checks now. |
| "The implementation looks substantive at a glance" | STOP. Level 2 requires thorough reading, not glancing. Stubs designed to look real (e.g., functions with only a log statement) are the whole reason L2 exists. |
| "The artifact is exported so it must be wired" | STOP. Export without import is dead code. Trace the actual usage chain: import, call, test, pass. "Must be" is not evidence. |
or in production code | STOP. These are Level 2 failures. Do not proceed to Level 3. Do not fix them yourself — record as FAIL and report. |
Non-Determinism Tolerance
Mechanical checks (tests, lint, types) are binary pass/fail. No tolerance.
For behavioral verification (convention adherence, style guides), accept threshold-based results: 4/5 runs = pass, 2/5 runs = fail. If a convention fails >40% of the time, the convention needs rewriting -- blame the instruction, not the executor.
Rationalizations to Reject
| Rationalization | Reality |
|---|---|
| "Tests passed earlier, no need to re-run" | Iron Law forbids cached results. All evidence must be fresh in THIS session. |
| "File exists and has code, skip thorough read for Level 2" | Level 2 requires thorough reading. Scanning for TODO, throw Error, empty functions catches stubs that look real. |
| "Artifact is imported by a test file, so passes Level 3" | Import is necessary but not sufficient. Test must assert on behavior and not be skipped. |
| "Verification report probably looks fine from memory" | "Should", "probably", "seems to", "I believe" are forbidden. Replace with "verified: [evidence]" or "not verified: [missing]." |
| "I found a stub so I'll quickly implement it to make verification pass" | Verification identifies gaps — verification never fills them. Record the stub as a FAIL. The executor fixes it. A verifier who implements is no longer independent. |
| "The spec only mentions 3 behaviors but I'll verify 5 to be thorough" | Verify what the spec requires, not what you think it should require. Extra verification against unstated requirements conflates verification with spec review. |
Examples
Example: Verifying a New Service Module
Task: "Create UserService with CRUD operations."
## Verification Report ### Level 1: EXISTS - [PASS] src/services/user-service.ts (189 lines) - [PASS] src/services/user-service.test.ts (245 lines) - [PASS] src/services/index.ts (updated -- exports UserService) ### Level 2: SUBSTANTIVE - [PASS] user-service.ts -- 4 CRUD methods with validation, error handling, DB calls - [PASS] user-service.test.ts -- 12 tests: happy paths, errors, edge cases (none skipped) ### Level 3: WIRED - [PASS] user-service.ts -- imported by src/api/routes/users.ts, 12 tests pass - [PASS] harness validate -- passes - [PASS] harness check-deps -- no boundary violations ### Anti-Pattern Scan - No matches ### Verdict: COMPLETE -- all artifacts verified at all levels
Gates
- No completion without evidence. Do not say "done" or "complete" without a verification report showing PASS at all 3 levels for all deliverables.
- No stale evidence. Evidence must be from the current session. "I checked earlier" is not evidence.
- No forbidden language. "Should work," "probably fine," "seems correct" are not verification statements. Use observed evidence or state "not verified."
- No skipping levels. Level 1 before Level 2. Level 2 before Level 3. Each depends on the previous.
- No satisfaction before evidence. Feeling done is not being done. Evidence is being done.
Success Criteria
- All deliverables verified at 3 levels (EXISTS, SUBSTANTIVE, WIRED)
- No anti-patterns remain in production code
- Report uses structured format with severity markers
- All evidence is fresh (gathered this session, not assumed)
- No forbidden hedging language in findings
- Gaps explicitly listed with severity
- Regression revert check confirms clean rollback
Escalation
- Artifact cannot pass WIRED because the connected system does not exist yet: Document the gap explicitly. State what integration is missing. Do not mark PASS.
- Anti-pattern scan finds intentional TODOs: Each must have a tracked issue number. "TODO: implement" without issue ref is unacceptable. "TODO(#123): add rate limiting" is acceptable.
- Tests pass but may not test real behavior: If tests only check "does not throw" or assert mock return values without real behavior checks, flag as SUBSTANTIVE failure.
- Verification reveals spec is incomplete: Do not fill gaps yourself. Escalate: "Spec does not define behavior for [scenario]. How should this be handled?"
- Cannot run harness checks: If
orharness validate
cannot run, this is blocking. Do not skip -- fix tooling or escalate.harness check-deps
Harness Integration
-- Load session-scoped state, learnings, and validation before Level 1.gather_context
parameter scopes to session directory.session
-- Run during Level 3 (WIRED) to verify artifact integration.harness validate
-- Run during Level 3 (WIRED) to verify dependency boundaries.harness check-deps
-- Verify documentation updated for new artifacts. Missing docs for public APIs is a gap.harness check-docs- Test runner -- Must be run fresh (not cached) during Level 3. Read actual output, check exit codes.
-- Auto-transition to harness-code-review on PASS verdict only.emit_interaction- Session directory --
contains.harness/sessions/<slug>/
,handoff.json
,state.json
(spec path, plan path, file lists from execution). Do not write to globalartifacts.json
when session slug is known..harness/handoff.json
All commands must be run fresh. Do not rely on results from previous sessions or runs.
After verification, append a tagged learning:
YYYY-MM-DD [skill:harness-verification] [outcome:pass/fail]: Verified [feature]. [Brief note.]