Vibeguard eval-harness

Assessment-driven development — Quantify code generation quality with pass@k / pass^k metrics, automatically scored by Grader.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/vibeguard

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/vibeguard "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/eval-harness" ~/.claude/skills/majiayu000-vibeguard-eval-harness && rm -rf "$T"

manifest: skills/eval-harness/SKILL.md

source content

Eval Harness

Overview

Evaluation-driven development: Not just "can the code run", but quantify "how good is the code".

Core indicators

pass@k (single success rate)

Generate k candidate solutions, with a probability of at least 1 passing
Used to evaluate the completion quality of a single task
Target: pass@1 > 80%

pass^k (continuous success rate)

The probability of passing all k consecutive tasks at once
Used to evaluate overall workflow reliability
Goal: pass^5 > 50% (pass all 5 consecutive tasks in one go)

Grader type

Code Basics Grader (deterministic)

Grader	Check content	Pass conditions
Compilation check	Whether the code can be compiled / type check passed	Zero errors
Test check	Whether all tests passed	Full green
Lint check	Whether the code style conforms to the specification	Zero warnings (or only allowed warnings)
Coverage check	Check whether the test coverage reaches the standard	≥ 80%

Model base Grader (probabilistic)

Grader	Check content	How to grade
Code review	Code quality, readability, security	0-10 points
Requirements matching	Whether the implementation meets the requirements	0-1 matching degree
Architecture evaluation	Is the design reasonable	0-10 points

Usage process

Define Evaluation Criteria
- Extract verifiable passing conditions from requirements
- Choose the right grader combination
Run the evaluation
- Code base Grader first (fast, deterministic)
- Model basics Grader supplement (depth, probabilistic)
Analysis results
- pass@1 < 80% → Unclear requirements or problematic implementation strategies
- pass^5 < 50% → There is a systemic problem with the workflow
Improvements
- Adjust strategies based on failure patterns
- Updated Grader rules

VibeGuard Integration

Code base Grader can reuse guard script output (such as
```
guards/<lang>/check_*.sh
```
)
Security Grader reference
```
vibeguard/rules/security.md
```
Quality Grader reference
```
vibeguard/rules/universal.md
```