NWave nw-tdd-review-enforcement

Test design mandate enforcement, test budget validation, 5-phase TDD validation, and external validity checks for the software crafter reviewer

install

source · Clone the upstream repo

git clone https://github.com/nWave-ai/nWave

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/nWave-ai/nWave "$T" && mkdir -p ~/.claude/skills && cp -r "$T/nWave/skills/nw-tdd-review-enforcement" ~/.claude/skills/nwave-ai-nwave-nw-tdd-review-enforcement && rm -rf "$T"

manifest: nWave/skills/nw-tdd-review-enforcement/SKILL.md

source content

TDD Review Enforcement

Domain knowledge for reviewing TDD implementations against 5 test design mandates, test budget, 5-phase validation, and external validity.

5 Test Design Mandates

Mandate 1: Observable Behavioral Outcomes

All assertions validate observable outcomes, never internal structure.

Observable: return values from driving ports | state changes via queries | side effects at driven port boundaries | exceptions from public API | business invariants

Violations: asserting private fields | verifying internal method call order | inspecting intermediate calculations | checking internal class instantiation

Severity: Blocker. Rewrite to assert observable outcomes only.

Mandate 2: No Domain Layer Unit Tests

Zero unit tests of domain entities/value objects/services directly. Test indirectly through application service (driving port) tests.

Violations: imports domain entity (Order, Customer) | instantiates value object (Money, Email) | invokes domain service method

Exception: complex standalone algorithm with stable public interface (rare). Severity: Blocker. Delete domain tests, add application service test.

Mandate 3: Test Through Driving Ports

All unit tests enter through driving ports (application services, controllers, CLI handlers, event handlers). Never internal classes.

Detection: grep for internal imports (

from domain.entity

from internal.validator

). Severity: Blocker. Rewrite through driving port.

Mandate 4: Integration Tests for Adapters

Adapters have integration tests with real infrastructure (testcontainers, test servers). No mocked unit tests.

Violations: mocking IDbConnection | mocking SMTP client | stubs instead of real infrastructure Acceptable: in-memory implementations if behavior-complete. Severity: Blocker. Convert to integration test.

Mandate 5: Parametrized Input Variations

Input variations of same behavior use parametrized tests, not duplicates.

Violations: test_valid_email_1/test_valid_email_2 | copy-pasted tests with only inputs changed Severity: High. Consolidate into @pytest.mark.parametrize.

Test Budget Validation

Formula:

max_unit_tests = 2 x number_of_distinct_behaviors

A behavior = ONE observable outcome from a driving port action. Edge cases of same behavior = ONE (use parametrized).

Counting Rules

Enforcement Steps

Count distinct behaviors in AC | 2. Calculate
```
budget = 2 x count
```
Count actual test methods (parametrized cases don't add) | 4. Pass: actual <= budget. Fail: actual > budget (Blocker)

Example Finding

TEST BUDGET VALIDATION: FAILED

Acceptance Criteria Analysis:
- "User can register with valid email" = 1 behavior
- "Invalid email format rejected" = 1 behavior
- "Duplicate email rejected" = 1 behavior

Budget: 3 behaviors x 2 = 6 unit tests maximum
Actual: 14 unit tests

Violations:
1. Budget exceeded: 14 > 6 (Blocker)
2. Internal class testing: test_user_validator.py tests UserValidator directly (Blocker)
3. Parameterization missing: 5 separate tests for valid email variations

Required: delete internal tests, consolidate via parametrize, re-submit

5-Phase TDD Validation

Verify all 5 phases in execution-log.json: PREPARE, RED_ACCEPTANCE, RED_UNIT, GREEN, COMMIT.

Phase Checks

Completeness: all 5 present (Blocker if missing) | Outcomes: all PASS (Blocker if FAIL)
Sequential execution: correct order by timestamps | Test discipline: 100% green after GREEN, COMMIT

Quality Gates

Gate	Description	Phase
G1	Exactly one acceptance test active	PREPARE
G2	Acceptance test fails for valid reason	RED_ACCEPTANCE
G3	Unit test fails on assertion	RED_UNIT
G4	No mocks inside hexagon	RED_UNIT
G5	Business language in tests	GREEN
G6	All tests green	GREEN
G7	100% passing before commit	COMMIT
G8	Test count within budget	RED_UNIT
G9	No test modifications to accommodate implementation	GREEN

Gates G2, G4, G7, G8, G9 are Blockers if not verified.

Note: Review/refactoring quality verified at deliver-level Phase 4 (Adversarial Review).

Walking Skeleton Override

When

is_walking_skeleton: true

: don't flag missing unit tests | verify exactly one E2E test | thinnest slice OK (hardcoded values) | RED_UNIT and GREEN may be SKIPPED with "NOT_APPLICABLE: walking skeleton"

External Validity Check

Verify features are invocable through entry points, not just existing in code.

Question: "If I follow these steps, will the feature WORK or just EXIST?"

Criteria

Acceptance tests import entry point modules, not internals (Blocker)
At least one test invokes through user-facing entry point (High)
Component wired into system entry point (Blocker)

Example Finding

EXTERNAL VALIDITY CHECK: FAILED

Issue: All 6 acceptance tests import des.validator.TemplateValidator directly.
No test imports des.orchestrator.DESOrchestrator (the entry point).

Consequence: Tests pass, coverage is 100%, but TemplateValidator is never
called in production because DESOrchestrator doesn't use it.

Required: update acceptance test to invoke through entry point, wire component.

Test Modification Detection (ALWAYS BLOCKER)

The single worst TDD violation: modifying a test to make it pass instead of fixing the implementation. This inverts the TDD feedback loop -- the test no longer protects behavior. Instant rejection, no exceptions, no conditional approval.

Detection Signals

Signal	How to Detect	Severity
Test + implementation changed in same commit	Git diff shows test file edits alongside production code edits during GREEN phase	BLOCKER
Assertion weakened	`assertEquals(expected, actual)` changed to `assertNotNull(actual)` or `assertTrue(result)`	BLOCKER
Expectations reduced	Test previously checked 5 fields, now checks 1-2	BLOCKER
Test deleted or skipped	`@skip` , `@pytest.mark.skip` , `@Disabled` , `xfail` , entire test method removed	BLOCKER
Deferred fix comments	`# TODO: fix later` , `# temporarily relaxed` , `# workaround` , `# adjusted for now` in test files	BLOCKER
Assertion count decreased	Previous commit had N assertions, current has fewer for same test	BLOCKER

Review Procedure

Compare test files at start of RED phase vs end of GREEN phase
If any test file was modified during GREEN: flag for detailed inspection
Check each modification against the signals table above
If modification is purely additive (new assertions, new test methods): PASS
If modification weakens, removes, or relaxes any existing assertion: BLOCKER -- reject immediately

Legitimate Test Changes (Not Violations)

Renaming test methods for clarity (no assertion changes)
Adding new assertions to existing tests (strengthening, not weakening)
Fixing a genuine test bug identified and approved by product owner (requires
```
ESCALATION_NEEDED
```
marker in execution log)
Parametrization refactoring that preserves all original assertions

Example Finding

TEST MODIFICATION DETECTION: BLOCKER

File: tests/unit/test_order_service.py
Commit: abc123 (GREEN phase)

Before (RED phase):
  assert result.total == Decimal("150.00")
  assert result.tax == Decimal("15.00")
  assert result.items == 3
  assert result.status == OrderStatus.CONFIRMED

After (GREEN phase):
  assert result is not None  # <-- weakened from 4 specific assertions to existence check

Verdict: REJECTED. Implementation could not satisfy the original assertions.
The crafter modified the test instead of fixing the implementation.
Required: revert test to RED-phase version, fix implementation to satisfy original assertions.

Fixture Theater Detection (ALWAYS BLOCKER)

Acceptance tests pass because test fixtures implement the expected behavior directly, rather than exercising production code through the driving port. The tests verify the correct outcome from the wrong source. This is a form of Testing Theater where the entire GREEN phase is fraudulent -- no production code was changed.

Detection Signals

Signal	How to Detect	Severity
No production files in git diff	`git diff --name-only` after GREEN shows only test files, none of the `files_to_modify` entries	BLOCKER
Given steps create end-state	Test Given/Arrange steps construct the expected output directly instead of setting up preconditions for production code	BLOCKER
Fixture implements behavior	Test helper/fixture contains domain logic that should live in production code	BLOCKER
RED-to-GREEN without production changes	Acceptance test flips from failing to passing but `git diff --stat` shows zero production file changes	BLOCKER

Review Procedure

After GREEN phase, run
```
git diff --name-only
```
and compare against
```
files_to_modify
```
from the roadmap step
Every file listed in
```
files_to_modify
```
MUST appear in the diff (excluding test files). If any production file is missing: BLOCKER
Apply the deletion test: "If I revert ALL changes to test files and fixtures, does the acceptance test still pass with ONLY the production code changes?" If yes: production code is doing the work (PASS). If the test cannot pass without fixture changes: BLOCKER
Inspect test Given/Arrange sections for domain logic that belongs in production code

Legitimate Exceptions (Not Violations)

Step explicitly tagged as test-only (e.g.,
```
files_to_modify
```
lists only test files)
Documentation-only steps where no production code is expected
Hash update steps where the production change is a constant update in a test file

Example Finding

FIXTURE THEATER DETECTION: BLOCKER

Step: 02-01 (implement gitignore support in DES plugin)
Files to modify: [src/des/adapters/.../des_plugin.py]

git diff --name-only after GREEN:
  tests/des/acceptance/test_plugin_gitignore.py  (fixture modified)
  tests/conftest.py                               (helper added)

Missing from diff: src/des/adapters/.../des_plugin.py

Verdict: REJECTED. Agent modified test fixtures to produce expected state
instead of implementing production code in des_plugin.py.
The acceptance test passes because the fixture creates the expected output,
not because the driving port (DESPlugin.install()) produces it.
Required: revert fixture changes, implement production code in des_plugin.py.

Escalation Verification

When a crafter gets stuck, the correct action is to escalate -- not to silently weaken tests. The reviewer verifies proper escalation protocol was followed.

What to Check

ESCALATION_NEEDED markers: execution-log.json should contain
```
escalation_needed: true
```
with reason if the crafter hit a wall
Three-attempt rule: evidence of at least 3 distinct implementation attempts before any test change (check GREEN phase attempts in execution log)
Product owner approval: any requirement-driven test change must reference explicit PO approval (e.g.,
```
po_approved: true
```
or
```
requirement_change: {ticket}
```
in execution log)

Escalation Failures

Failure	Detection	Severity
Silent test modification	No escalation marker + test weakened	BLOCKER
Insufficient attempts	Fewer than 3 GREEN attempts before test change	BLOCKER
Missing PO approval	Test changed for "requirement change" without PO reference	BLOCKER
Proper escalation	`ESCALATION_NEEDED` marker present, 3+ attempts logged	PASS (reviewer verifies test change validity)

Approval Decision Logic

Approved

All 5 phases present, all PASS, all gates satisfied (G1-G9), zero defects, budget met, no internal class tests, no test modifications, no testing theater.

Rejected

Escalation

2 review iterations | persistent gate failures | unresolved architectural violations | crafter properly escalated (ESCALATION_NEEDED marker present with 3+ attempts). Escalate to tech lead.