Skillshub eval-harness
Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
git clone https://github.com/ComeOnOliver/skillshub
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/affaan-m/everything-claude-code/eval-harness" ~/.claude/skills/comeonoliver-skillshub-eval-harness && rm -rf "$T"
skills/affaan-m/everything-claude-code/eval-harness/SKILL.mdEval Harness Skill
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
When to Activate
- Setting up eval-driven development (EDD) for AI-assisted workflows
- Defining pass/fail criteria for Claude Code task completion
- Measuring agent reliability with pass@k metrics
- Creating regression test suites for prompt or agent changes
- Benchmarking agent performance across model versions
Philosophy
Eval-Driven Development treats evals as the "unit tests of AI development":
- Define expected behavior BEFORE implementation
- Run evals continuously during development
- Track regressions with each change
- Use pass@k metrics for reliability measurement
Eval Types
Capability Evals
Test if Claude can do something it couldn't before:
[CAPABILITY EVAL: feature-name] Task: Description of what Claude should accomplish Success Criteria: - [ ] Criterion 1 - [ ] Criterion 2 - [ ] Criterion 3 Expected Output: Description of expected result
Regression Evals
Ensure changes don't break existing functionality:
[REGRESSION EVAL: feature-name] Baseline: SHA or checkpoint name Tests: - existing-test-1: PASS/FAIL - existing-test-2: PASS/FAIL - existing-test-3: PASS/FAIL Result: X/Y passed (previously Y/Y)
Grader Types
1. Code-Based Grader
Deterministic checks using code:
# Check if file contains expected pattern grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL" # Check if tests pass npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL" # Check if build succeeds npm run build && echo "PASS" || echo "FAIL"
2. Model-Based Grader
Use Claude to evaluate open-ended outputs:
[MODEL GRADER PROMPT] Evaluate the following code change: 1. Does it solve the stated problem? 2. Is it well-structured? 3. Are edge cases handled? 4. Is error handling appropriate? Score: 1-5 (1=poor, 5=excellent) Reasoning: [explanation]
3. Human Grader
Flag for manual review:
[HUMAN REVIEW REQUIRED] Change: Description of what changed Reason: Why human review is needed Risk Level: LOW/MEDIUM/HIGH
Metrics
pass@k
"At least one success in k attempts"
- pass@1: First attempt success rate
- pass@3: Success within 3 attempts
- Typical target: pass@3 > 90%
pass^k
"All k trials succeed"
- Higher bar for reliability
- pass^3: 3 consecutive successes
- Use for critical paths
Eval Workflow
1. Define (Before Coding)
## EVAL DEFINITION: feature-xyz ### Capability Evals 1. Can create new user account 2. Can validate email format 3. Can hash password securely ### Regression Evals 1. Existing login still works 2. Session management unchanged 3. Logout flow intact ### Success Metrics - pass@3 > 90% for capability evals - pass^3 = 100% for regression evals
2. Implement
Write code to pass the defined evals.
3. Evaluate
# Run capability evals [Run each capability eval, record PASS/FAIL] # Run regression evals npm test -- --testPathPattern="existing" # Generate report
4. Report
EVAL REPORT: feature-xyz ======================== Capability Evals: create-user: PASS (pass@1) validate-email: PASS (pass@2) hash-password: PASS (pass@1) Overall: 3/3 passed Regression Evals: login-flow: PASS session-mgmt: PASS logout-flow: PASS Overall: 3/3 passed Metrics: pass@1: 67% (2/3) pass@3: 100% (3/3) Status: READY FOR REVIEW
Integration Patterns
Pre-Implementation
/eval define feature-name
Creates eval definition file at
.claude/evals/feature-name.md
During Implementation
/eval check feature-name
Runs current evals and reports status
Post-Implementation
/eval report feature-name
Generates full eval report
Eval Storage
Store evals in project:
.claude/ evals/ feature-xyz.md # Eval definition feature-xyz.log # Eval run history baseline.json # Regression baselines
Best Practices
- Define evals BEFORE coding - Forces clear thinking about success criteria
- Run evals frequently - Catch regressions early
- Track pass@k over time - Monitor reliability trends
- Use code graders when possible - Deterministic > probabilistic
- Human review for security - Never fully automate security checks
- Keep evals fast - Slow evals don't get run
- Version evals with code - Evals are first-class artifacts
Example: Adding Authentication
## EVAL: add-authentication ### Phase 1: Define (10 min) Capability Evals: - [ ] User can register with email/password - [ ] User can login with valid credentials - [ ] Invalid credentials rejected with proper error - [ ] Sessions persist across page reloads - [ ] Logout clears session Regression Evals: - [ ] Public routes still accessible - [ ] API responses unchanged - [ ] Database schema compatible ### Phase 2: Implement (varies) [Write code] ### Phase 3: Evaluate Run: /eval check add-authentication ### Phase 4: Report EVAL REPORT: add-authentication ============================== Capability: 5/5 passed (pass@3: 100%) Regression: 3/3 passed (pass^3: 100%) Status: SHIP IT
Product Evals (v1.8)
Use product evals when behavior quality cannot be captured by unit tests alone.
Grader Types
- Code grader (deterministic assertions)
- Rule grader (regex/schema constraints)
- Model grader (LLM-as-judge rubric)
- Human grader (manual adjudication for ambiguous outputs)
pass@k Guidance
: direct reliabilitypass@1
: practical reliability under controlled retriespass@3
: stability test (all 3 runs must pass)pass^3
Recommended thresholds:
- Capability evals: pass@3 >= 0.90
- Regression evals: pass^3 = 1.00 for release-critical paths
Eval Anti-Patterns
- Overfitting prompts to known eval examples
- Measuring only happy-path outputs
- Ignoring cost and latency drift while chasing pass rates
- Allowing flaky graders in release gates
Minimal Eval Artifact Layout
definition.claude/evals/<feature>.md
run history.claude/evals/<feature>.log
release snapshotdocs/releases/<version>/eval-summary.md