Vibeguard eval-harness
Assessment-driven development — Quantify code generation quality with pass@k / pass^k metrics, automatically scored by Grader.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/vibeguard
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/vibeguard "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/eval-harness" ~/.claude/skills/majiayu000-vibeguard-eval-harness && rm -rf "$T"
manifest:
skills/eval-harness/SKILL.mdsource content
Eval Harness
Overview
Evaluation-driven development: Not just "can the code run", but quantify "how good is the code".
Core indicators
pass@k (single success rate)
- Generate k candidate solutions, with a probability of at least 1 passing
- Used to evaluate the completion quality of a single task
- Target: pass@1 > 80%
pass^k (continuous success rate)
- The probability of passing all k consecutive tasks at once
- Used to evaluate overall workflow reliability
- Goal: pass^5 > 50% (pass all 5 consecutive tasks in one go)
Grader type
Code Basics Grader (deterministic)
| Grader | Check content | Pass conditions |
|---|---|---|
| Compilation check | Whether the code can be compiled / type check passed | Zero errors |
| Test check | Whether all tests passed | Full green |
| Lint check | Whether the code style conforms to the specification | Zero warnings (or only allowed warnings) |
| Coverage check | Check whether the test coverage reaches the standard | ≥ 80% |
Model base Grader (probabilistic)
| Grader | Check content | How to grade |
|---|---|---|
| Code review | Code quality, readability, security | 0-10 points |
| Requirements matching | Whether the implementation meets the requirements | 0-1 matching degree |
| Architecture evaluation | Is the design reasonable | 0-10 points |
Usage process
-
Define Evaluation Criteria
- Extract verifiable passing conditions from requirements
- Choose the right grader combination
-
Run the evaluation
- Code base Grader first (fast, deterministic)
- Model basics Grader supplement (depth, probabilistic)
-
Analysis results
- pass@1 < 80% → Unclear requirements or problematic implementation strategies
- pass^5 < 50% → There is a systemic problem with the workflow
-
Improvements
- Adjust strategies based on failure patterns
- Updated Grader rules
VibeGuard Integration
- Code base Grader can reuse guard script output (such as
)guards/<lang>/check_*.sh - Security Grader reference
vibeguard/rules/security.md - Quality Grader reference
vibeguard/rules/universal.md