Claude-skill-registry eval-recipes-runner
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/eval-recipes-runner" ~/.claude/skills/majiayu000-claude-skill-registry-eval-recipes-runner && rm -rf "$T"
manifest:
skills/data/eval-recipes-runner/SKILL.mdsource content
eval-recipes Runner Skill
Purpose
Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents.
When to Use
- User asks to "test with eval-recipes"
- User says "run the evals" or "benchmark this change"
- User wants to validate improvements against codex/claude_code
- Testing a PR branch to prove it improves scores
Capabilities
I can run eval-recipes benchmarks to:
- Test specific amplihack branches
- Compare against baseline agents (codex, claude_code)
- Run specific tasks (linkedin_drafting, email_drafting, etc.)
- Compare before/after scores for PRs
- Generate reports with score improvements
How It Works
Setup (One-Time)
# Clone eval-recipes from Microsoft git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes cd ~/eval-recipes # Copy our agent configs cp -r $(pwd)/.claude/agents/eval-recipes/* data/agents/ # Install dependencies uv sync
Running Benchmarks
Test a specific branch:
# Update install.dockerfile to use specific branch # Then run benchmark cd ~/eval-recipes uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3
Compare before/after:
# Test baseline (main) uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting # Test PR branch (edit install.dockerfile to checkout PR branch) uv run eval_recipes/main.py --agent amplihack_pr1443 --task linkedin_drafting # Compare scores
Available Tasks
Common tasks from eval-recipes:
- Create tool for LinkedIn posts (scored 6.5/100 before PR #1443)linkedin_drafting
- Create CLI tool for emails (scored 26/100 before)email_drafting
- Research toolarxiv_paper_summarizer
- Documentation toolgithub_docs_extractor- Many more in
~/eval-recipes/data/tasks/
Typical Workflow
When user says "test this change with eval-recipes":
- Identify the branch/PR to test
- Update agent config to use that branch:
# In .claude/agents/eval-recipes/amplihack/install.dockerfile RUN git clone https://github.com/rysweet/...git /tmp/amplihack && \ cd /tmp/amplihack && \ git checkout BRANCH_NAME && \ pip install -e . - Copy to eval-recipes:
cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/ - Run benchmark:
cd ~/eval-recipes uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3 - Report scores and compare with baseline
Expected Scores
Baseline (main branch):
- Overall: 40.6/100
- LinkedIn: 6.5/100
- Email: 26/100
With PR #1443 (task classification):
- Expected: 55-60/100 (+15-20 points)
- LinkedIn: 30-40/100 (creates actual tool)
- Email: 45/100 (consistent execution)
Example Usage
User says: "Test PR #1443 with eval-recipes on the LinkedIn task"
I do:
- Update install.dockerfile to checkout
feat/issue-1435-task-classification - Copy to eval-recipes:
cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/ - Run:
cd ~/eval-recipes && uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3 - Report results: "Score: 35.2/100 (up from 6.5 baseline)"
Prerequisites
- eval-recipes cloned to
~/eval-recipes - API key in environment:
export ANTHROPIC_API_KEY=sk-ant-... - Docker installed (for containerized runs)
- uv installed:
curl -LsSf https://astral.sh/uv/install.sh | sh
Notes
- Benchmarks take 2-15 minutes per task depending on complexity
- Multiple trials (3-5) give more reliable averages
- Docker builds can be cached for speed
- Results saved to
in eval-recipes repo.benchmark_results/
Automation
For fully autonomous testing:
# Test suite for a PR tasks="linkedin_drafting email_drafting arxiv_paper_summarizer" for task in $tasks; do uv run eval_recipes/main.py --agent amplihack --task $task --trials 3 done # Compare results cat .benchmark_results/*/amplihack/*/score.txt