Claude-skill-registry eval-harness-kit
Build and run deterministic evaluation suites for agent workflows (single-turn or agentic). Use when you need reproducible eval runs with manifests, graders, metrics, and JSONL logs for capability or regression tracking.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/eval-harness-kit" ~/.claude/skills/majiayu000-claude-skill-registry-eval-harness-kit && rm -rf "$T"
manifest:
skills/data/eval-harness-kit/SKILL.mdsource content
Eval Harness Kit
Overview
Create eval manifests, run tasks through an agent or command harness, and grade outputs with deterministic checks and optional LLM rubrics. The harness writes trajectories, metrics, and summaries to disk for repeatable analysis.
Quick start
- Copy
and edit tasks.templates/eval.manifest.json - Run:
python <CODEX_HOME>/skills/eval-harness-kit/scripts/run_eval.py --manifest <path> --run-id <id> - Inspect outputs in
and the summary JSON. Replaceeval_runs/<run-id>/
with your installed skill root (for example,<CODEX_HOME>
or~/.codex
).C:\Users\you\.codex
Single-turn vs agentic
- Single-turn:
writes a response file; graders check the output.run_cmd - Agentic:
invokes your agent harness; graders check output plus optional transcript files.run_cmd
LLM rubric graders (optional)
- Use
to call an external judge.type: "llm_rubric" - Provide
in the manifest orllm_judge_cmd
per task.judge_cmd - The judge must print JSON:
.{"passed": true|false, "score": 0.0-1.0, "details": "..."}
Core Guidance
- Decide capability vs regression up front; keep regression suites near 100% pass rate.
- Prefer deterministic graders (exact/regex/json) and add LLM rubrics only when needed.
- Keep each trial isolated; write outputs and transcripts to the run directory.
- Log metrics for every trial: latency, exit code, stdout/stderr sizes, output size.
- Use files as the memory boundary; do not paste large outputs into chat.
Trust / Permissions
- Always: Read local files, write run artifacts under
.eval_runs/ - Ask: Any networked grader (LLM rubric), running commands that mutate state, or running tools outside the repo.
- Never: Exfiltrating credentials or running destructive commands without explicit user request.
Resources
: Execute evals from a manifest; writes JSONL results and summaries.scripts/run_eval.py
: Grade a single output against expected data.scripts/grade_response.py
: Compare two results files and flag regressions.scripts/compare_runs.py
: Example manifest with single-turn and agentic tasks.templates/eval.manifest.json
: Guidance for building and maintaining eval suites.references/eval-roadmap.md
Validation
- Run the example manifest; confirm
exists.eval_runs/<run-id>/summary.json - Use
to compare two runs and verify regression detection.compare_runs.py