Harness-engineering harness-verification
<!-- Generated by harness generate-slash-commands. Do not edit. -->
git clone https://github.com/Intense-Visions/harness-engineering
T=$(mktemp -d) && git clone --depth=1 https://github.com/Intense-Visions/harness-engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agents/commands/codex/harness/harness-verification" ~/.claude/skills/intense-visions-harness-engineering-harness-verification && rm -rf "$T"
agents/commands/codex/harness/harness-verification/SKILL.mdHarness Verification
3-level evidence-based verification. No completion claims without fresh evidence. "Should work" is not evidence.
When to Use
- After completing any implementation task (before claiming "done")
- After executing a plan or spec (verify all deliverables)
- When validating work done by another agent or in a previous session
- When resuming work after a context reset (re-verify before continuing)
- When
oron_commit
triggers fire and verification is neededon_pr - NOT as a replacement for tests (verification checks that tests exist and pass, not that logic is correct)
- NOT for in-progress work (verify at completion boundaries, not mid-stream)
Verification Tiers
Harness uses a two-tier verification model:
| Tier | Skill | When | What |
|---|---|---|---|
| Quick gate | harness-execution (built-in) | After every task | test + lint + typecheck + build + harness validate |
| Deep audit | harness-verification (this skill) | Milestones, PRs, on-demand | EXISTS → SUBSTANTIVE → WIRED |
Use this skill (deep audit) for milestone boundaries, before creating PRs, or when the quick gate passes but something feels wrong. Do NOT invoke this skill after every individual task — that is what the quick gate handles.
Process
Iron Law
No completion claim may be made without fresh verification evidence collected in THIS session.
Cached results, remembered outcomes, and "it worked last time" are not evidence. Run the checks. Read the output. Report what you observed.
The words "should", "probably", "seems to", and "I believe" are forbidden in verification reports. Replace with "verified: [evidence]" or "not verified: [what is missing]."
Context Loading
Before running verification levels, load session context if a session slug was provided (e.g., by autopilot dispatch):
gather_context({ path: "<project-root>", intent: "Verify phase deliverables", skill: "harness-verification", session: "<session-slug-if-provided>", include: ["state", "learnings", "validation"] })
Session resolution: If a session slug is known (passed via autopilot dispatch or available from a previous handoff), include the
session parameter. This scopes all state reads to .harness/sessions/<slug>/. If no session is known, omit it — gather_context falls back to global files at .harness/.
Use the returned learnings to check for known failures and dead ends relevant to the artifacts being verified.
Level 1: EXISTS — The Artifact Is Present
For every artifact that was supposed to be created or modified:
-
Check that the file exists on disk. Use
,ls
, or read the file. Do not assume it exists because you wrote it — file writes can fail silently.stat -
Check that the file has content. An empty file is not an artifact. Read the file and confirm it has non-trivial content.
-
Check the file is in the right location. Compare the actual path against the spec or plan. A file in the wrong directory is not "present."
-
Record the result. For each expected artifact:
[EXISTS: PASS] path/to/file.ts (247 lines) [EXISTS: FAIL] path/to/missing-file.ts — file not found
Do not proceed to Level 2 until all Level 1 checks pass. Missing files must be created first.
Level 2: SUBSTANTIVE — Not a Stub
For every artifact that passed Level 1:
-
Read the file content. Do not skim — read it thoroughly.
-
Scan for anti-patterns that indicate stub or placeholder implementations:
orTODO
comments (especiallyFIXME
)TODO: implementthrow new Error('not implemented')
(empty arrow functions)() => {}
,return null
,return undefined
as the only logicreturn {}
(Python placeholder)pass
,placeholder
,stub
in non-test codemock- Functions with only a comment describing what they should do
- Interfaces or types defined but never implemented
-
Verify real implementation exists. The file must contain actual logic that performs the described behavior. A function that only returns a hardcoded value is a stub unless that is the correct behavior.
-
Check for completeness against the spec. If the spec says "handles errors X, Y, Z," verify all three are handled, not just X.
-
Record the result. For each artifact:
[SUBSTANTIVE: PASS] path/to/file.ts — real implementation, no stubs [SUBSTANTIVE: FAIL] path/to/file.ts — contains TODO on line 34, empty handler on line 67
Do not proceed to Level 3 until all Level 2 checks pass. Stubs must be replaced with real implementations first.
Level 3: WIRED — Connected to the System
For every artifact that passed Level 2:
-
Verify the artifact is imported/required by at least one other file in the system (unless it is an entry point).
-
Verify the artifact is called/used. An import that is never called is dead code. Trace the usage:
- Functions: called from at least one other function or test
- Components: rendered in at least one parent or route
- Types: used in at least one function signature or variable declaration
- Configuration: loaded and applied by the system
- Tests: executed by the test runner
-
Verify the artifact is tested. There must be at least one test that exercises the artifact's behavior. Check:
- Test file exists
- Test imports or references the artifact
- Test makes assertions about the artifact's behavior
- Test actually runs (not skipped with
or.skip
)xit
-
Run the tests. Execute the test suite and verify tests pass. Do not trust "they passed earlier" — run them now.
-
Run harness checks. Execute
and verify the artifact integrates correctly with the project's constraints.harness validate -
Record the result. For each artifact:
[WIRED: PASS] path/to/file.ts — imported by 3 files, tested in file.test.ts (4 tests, all pass) [WIRED: FAIL] path/to/file.ts — exported but not imported by any other file
Anti-Pattern Scan
Run this scan across all changed files as a final check:
Scan targets: TODO, FIXME, XXX, HACK, PLACEHOLDER, NOT_IMPLEMENTED Code patterns: () => {}, return null (as sole body), pass, raise NotImplementedError Test patterns: .skip, xit, xdescribe, @pytest.mark.skip, pending
Any match is a verification failure. Either fix it or explicitly document why it is acceptable (e.g., "TODO is tracked in issue #123 and out of scope for this task").
Gap Identification
After running all three levels, produce a structured gap report:
## Verification Report ### Level 1: EXISTS - [PASS] path/to/file-a.ts (120 lines) - [PASS] path/to/file-b.ts (85 lines) - [FAIL] path/to/file-c.ts — not found ### Level 2: SUBSTANTIVE - [PASS] path/to/file-a.ts — real implementation - [FAIL] path/to/file-b.ts — TODO on line 22 ### Level 3: WIRED - [PASS] path/to/file-a.ts — imported, tested, harness passes - [NOT CHECKED] path/to/file-b.ts — blocked by Level 2 failure ### Anti-Pattern Scan - path/to/file-b.ts:22 — TODO: implement validation ### Gaps 1. path/to/file-c.ts must be created 2. path/to/file-b.ts:22 must be implemented (not stub) ### Verdict: INCOMPLETE — 2 gaps must be resolved
The verification report uses conventional markdown patterns for structured output:
**[CRITICAL]** path/to/file.ts:22 — TODO: implement validation (anti-pattern) **[IMPORTANT]** path/to/file.ts — exported but not imported by any other file
Verification Sign-Off
After producing the verification report, request acceptance:
emit_interaction({ path: "<project-root>", type: "confirmation", confirmation: { text: "Verification report: <VERDICT>. Accept and proceed?", context: "<summary: N artifacts checked, N gaps found>", impact: "Accepting proceeds to code review. Declining requires gap resolution first.", risk: "<low if PASS, high if gaps remain>" } })
Handoff and Transition
After producing the verification report, write the handoff and conditionally transition:
Write
.harness/handoff.json:
{ "fromSkill": "harness-verification", "phase": "COMPLETE", "summary": "<verdict summary>", "artifacts": ["<verified file paths>"], "verdict": "pass | fail", "gaps": ["<gap descriptions if any>"] }
Write session summary (if session is known). If running within a session context, update the session summary:
writeSessionSummary(projectPath, sessionSlug, { session: "<session-slug>", lastActive: "<ISO timestamp>", skill: "harness-verification", status: "Verification <PASS|FAIL|INCOMPLETE>. <N> artifacts checked, <N> gaps.", spec: "<spec path if known>", keyContext: "<1-2 sentences: verification outcome, any gaps found>", nextStep: "<e.g., Proceed to code review / Resolve gaps>" })
If no session slug is known, skip this step.
If verdict is PASS (all levels passed, no gaps):
Call
emit_interaction:
{ "type": "transition", "transition": { "completedPhase": "verification", "suggestedNext": "review", "reason": "Verification passed at all 3 levels", "artifacts": ["<verified file paths>"], "requiresConfirmation": false, "summary": "Verification passed: <N> artifacts checked. EXISTS, SUBSTANTIVE, WIRED all passed.", "qualityGate": { "checks": [ { "name": "level1-exists", "passed": true, "detail": "<N> artifacts present" }, { "name": "level2-substantive", "passed": true, "detail": "No stubs or placeholders" }, { "name": "level3-wired", "passed": true, "detail": "All artifacts imported, tested, integrated" }, { "name": "anti-pattern-scan", "passed": true, "detail": "No matches" }, { "name": "harness-validate", "passed": true } ], "allPassed": true } } }
The response will include
nextAction: "Invoke harness-code-review skill now".
Immediately invoke harness-code-review without waiting for user input.
If verdict is FAIL or INCOMPLETE:
Do NOT emit a transition. Surface gaps to the user for resolution. The handoff is written with the gaps recorded for future reference.
Regression Test Verification
When verifying a bug fix, apply this extended protocol:
- Write the regression test that reproduces the bug
- Run the test — it must PASS (proving the fix works)
- Revert the fix (temporarily):
or comment out the fixgit stash - Run the test — it must FAIL (proving the test actually catches the bug)
- Restore the fix:
or uncommentgit stash pop - Run the test — it must PASS again (proving the fix is the reason)
If step 4 passes (test does not fail without the fix), the test is not a valid regression test. It does not catch the bug. Rewrite it.
Evidence Requirements
This skill is the primary evidence producer in the workflow. Every pass/fail assertion in the verification report MUST include concrete evidence. The words "should", "probably", and "seems to" are already forbidden by the Iron Law -- this section defines HOW to cite evidence.
Every verification claim MUST use one of:
- File reference:
format with observed content (e.g.,file:line
-- "create method validates email format before insert")src/services/user-service.ts:42 - Test output: Include the actual test command and its complete output:
$ npx vitest run src/services/user-service.test.ts PASS src/services/user-service.test.ts UserService create (4 tests) list (3 tests) expiry (2 tests) Tests: 9 passed, 9 total - Harness output: Include full
andharness validate
outputharness check-deps - Anti-pattern scan output: Include the actual grep/search command and results (or absence of results)
- Import chain evidence: Include the actual import statements found when verifying WIRED level
- Session evidence: Write to the
session section for each verification level:evidencemanage_state({ action: "append_entry", session: "<current-session>", section: "evidence", authorSkill: "harness-verification", content: "[EXISTS:PASS] src/services/user-service.ts (189 lines) -- verified via direct file read" })
When to cite: At every verification level. Level 1 (EXISTS) cites file reads. Level 2 (SUBSTANTIVE) cites specific line content. Level 3 (WIRED) cites import statements, test execution output, and harness check output. The verification report format already requires
[PASS]/[FAIL] markers -- each marker must be accompanied by the evidence that produced it.
Uncited claims: ANY verification assertion without direct evidence is a verification failure, not merely an uncited claim. This skill does not use
[UNVERIFIED] -- if evidence cannot be produced, the verdict is FAIL or INCOMPLETE.
Harness Integration
— Run in Level 3 WIRED check. Verifies project-wide health and constraint compliance.harness validate
— Run in Level 3 to verify new artifacts respect dependency boundaries.harness check-deps
— Run to verify documentation is updated for new artifacts. Missing docs for new public APIs is a gap.harness check-docs- Test runner — Must be run fresh (not cached) during Level 3. Read actual output, check exit codes.
All commands must be run fresh in the current session. Do not rely on results from a previous session or a previous run in the same session if code has changed since.
-- Call after verification passes to auto-transition to harness-code-review. Only emitted on PASS verdict. Uses auto-transition (proceeds immediately).emit_interaction
Success Criteria
- Every claimed deliverable has been verified at all 3 levels
- No anti-patterns remain in delivered code
- Verification report uses the structured format with PASS/FAIL per artifact per level
- All verification evidence was collected fresh in the current session
- No forbidden language ("should", "probably", "seems to") appears in the report
- All gaps are explicitly identified with specific remediation steps
- Regression tests (for bug fixes) pass the 5-step revert check
Non-Determinism Tolerance
For mechanical checks (tests pass, lint clean, types check), results are binary — pass or fail. No tolerance.
For behavioral verification (did the agent follow a convention, did the output match a style guide), accept threshold-based results:
- Run the check multiple times if needed
- "Agent followed the constraint in 4/5 runs" = pass
- "Agent followed the constraint in 2/5 runs" = fail — the convention is poorly written, not the agent
If a behavioral convention fails more than 40% of the time, the convention needs rewriting. Blame the instruction, not the executor.
Examples
Example: Verifying a New Service Module
Task: "Create UserService with create, read, update, delete operations."
## Verification Report ### Level 1: EXISTS - [PASS] src/services/user-service.ts (189 lines) - [PASS] src/services/user-service.test.ts (245 lines) - [PASS] src/services/index.ts (updated — exports UserService) ### Level 2: SUBSTANTIVE - [PASS] src/services/user-service.ts — all 4 CRUD methods implemented with validation, error handling, and database calls - [PASS] src/services/user-service.test.ts — 12 tests covering happy paths, error cases, and edge cases (no skipped tests) ### Level 3: WIRED - [PASS] src/services/user-service.ts — imported by src/api/routes/users.ts, tested in user-service.test.ts (12 tests, all pass) - [PASS] harness validate — passes - [PASS] harness check-deps — no boundary violations ### Anti-Pattern Scan - No matches found ### Gaps (none) ### Verdict: COMPLETE — all artifacts verified at all levels
Rationalizations to Reject
| Rationalization | Reality |
|---|---|
| "The tests passed when I ran them earlier, so I do not need to re-run them now" | The Iron Law forbids cached results. All verification evidence must be collected fresh in THIS session. |
| "The file exists and has code in it, so I can skip reading it thoroughly for Level 2" | Level 2 (SUBSTANTIVE) requires reading the file content thoroughly. Scanning for TODO, throw new Error, empty functions, and hardcoded return values catches stubs that look like real implementations. |
| "This artifact is tested by a file that imports it, so it passes Level 3 WIRED" | Being imported is necessary but not sufficient for WIRED. The test must actually make assertions about the artifact's behavior and not be skipped. |
| "The verification report probably looks fine based on what I remember" | The words "should", "probably", "seems to", and "I believe" are forbidden in verification reports. Replace with "verified: [evidence]" or "not verified: [what is missing]." |
Gates
- No completion without evidence. You may not say "done," "complete," "finished," or "implemented" without a verification report showing PASS at all 3 levels for all deliverables.
- No stale evidence. Evidence must be from the current session. "I checked earlier" is not evidence. Run it again.
- No forbidden language. "Should work," "probably fine," "seems correct," and "I believe it works" are not verification statements. Replace with observed evidence or state "not verified."
- No skipping levels. Level 1 before Level 2. Level 2 before Level 3. Each level depends on the previous.
- No satisfaction before evidence. The natural inclination after writing code is to feel done. Resist it. Feeling done is not being done. Evidence is being done.
Escalation
- When an artifact cannot pass Level 3 (WIRED) because the system it connects to does not exist yet: Document the gap explicitly. State what integration is missing and what must be built. Do not mark it as PASS.
- When anti-pattern scan finds TODOs that are intentional: Each must be justified with a tracked issue number. "TODO: implement" with no issue reference is not acceptable. "TODO(#123): add rate limiting after infrastructure is ready" is acceptable.
- When tests pass but you suspect they are not testing real behavior: Read the test assertions carefully. If tests only check "does not throw" or assert on mock return values without verifying real behavior, flag them as SUBSTANTIVE failures.
- When verification reveals the spec itself is incomplete: Do not fill in the gaps yourself. Escalate to the human: "Verification found that the spec does not define behavior for [scenario]. How should this be handled?"
- When you cannot run harness checks: If
orharness validate
cannot be run (missing configuration, broken tooling), this is a blocking issue. Do not skip verification — fix the tooling or escalate.harness check-deps
Harness Integration
— Used in Context Loading phase (before Level 1) to load session-scoped state, learnings, and validation in a single call. Thegather_context
parameter scopes reads to the session directory when provided by autopilot dispatch.session
— Run during Level 3 (WIRED) to verify artifact integration.harness validate
— Run during Level 3 (WIRED) to verify dependency boundaries.harness check-deps
After verification completes, append a tagged learning:
- YYYY-MM-DD [skill:harness-verification] [outcome:pass/fail]: Verified [feature]. [Brief note on what was found or confirmed.]