Claude-skill-registry swe-bench-lite
Quick-start command to run SWE-bench Lite evaluation with sensible defaults.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/benchmark-swe-lite" ~/.claude/skills/majiayu000-claude-skill-registry-swe-bench-lite && rm -rf "$T"
manifest:
skills/data/benchmark-swe-lite/SKILL.mdsource content
Instructions
This skill provides a streamlined way to run the SWE-bench Lite benchmark with pre-configured defaults.
What This Skill Does
This skill runs a quick SWE-bench Lite evaluation with:
- 5 sample tasks (configurable)
- Verbose output for visibility
- Results saved to
results.json - Report saved to
report.md
Prerequisites Check
Before running, verify:
-
Docker is running:
docker ps -
API key is set:
echo $ANTHROPIC_API_KEY -
Config file exists:
- Check for
in the current directorymcpbr.yaml - If missing, run
to generate itmcpbr init
- Check for
Default Command
The default command for SWE-bench Lite:
mcpbr run -c mcpbr.yaml --dataset SWE-bench/SWE-bench_Lite -n 5 -v -o results.json -r report.md
Customization Options
Users can customize the run by modifying:
- Sample size: Change
to any number (or remove for full dataset)-n 5 - Config file: Change
to point to a different config-c mcpbr.yaml - Verbosity: Use
for very verbose output-vv - Output files: Change
andresults.json
to different pathsreport.md
Example Variations
Minimal quick test (1 task)
mcpbr run -c mcpbr.yaml -n 1 -v
Full evaluation (all ~300 tasks)
mcpbr run -c mcpbr.yaml --dataset SWE-bench/SWE-bench_Lite -v -o results.json
MCP-only (skip baseline)
mcpbr run -c mcpbr.yaml -n 5 -M -v -o results.json
Specific tasks
mcpbr run -c mcpbr.yaml -t astropy__astropy-12907 -t django__django-11099 -v
Expected Runtime & Cost
For 5 tasks with default settings:
- Runtime: 15-30 minutes (depends on task complexity)
- Cost: $2-5 (depends on task complexity and model used)
What to Do If It Fails
- Docker not running: Start Docker Desktop
- API key missing: Set with
export ANTHROPIC_API_KEY="sk-ant-..." - Config missing: Run
to generate default configmcpbr init - Config invalid: Check that
placeholder is in the{workdir}
arrayargs - MCP server fails: Test the server command independently
After the Run
Once complete, you'll have:
- results.json: Full evaluation data with metrics, token usage, and per-task results
- report.md: Human-readable summary with resolution rates and comparisons
- Console output: Real-time progress and summary table
Review the results to see how your MCP server performed compared to the baseline!
Pro Tips
- Start with
to verify everything works before running larger evaluations-n 1 - Use
to save detailed per-task logs for debugging--log-dir logs/ - Compare multiple runs by changing the MCP server config between runs
- Use
to detect regressions between versions--baseline-results baseline.json