Claude-skill-registry swe-bench-lite

Quick-start command to run SWE-bench Lite evaluation with sensible defaults.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/benchmark-swe-lite" ~/.claude/skills/majiayu000-claude-skill-registry-swe-bench-lite && rm -rf "$T"

manifest: skills/data/benchmark-swe-lite/SKILL.md

Instructions

This skill provides a streamlined way to run the SWE-bench Lite benchmark with pre-configured defaults.

What This Skill Does

This skill runs a quick SWE-bench Lite evaluation with:

5 sample tasks (configurable)
Verbose output for visibility
Results saved to
```
results.json
```
Report saved to
```
report.md
```

Prerequisites Check

Before running, verify:

Docker is running:
```
docker ps
```

API key is set:

echo $ANTHROPIC_API_KEY

Config file exists:
- Check for
```
mcpbr.yaml
```
  in the current directory
- If missing, run
```
mcpbr init
```
  to generate it

Default Command

The default command for SWE-bench Lite:

mcpbr run -c mcpbr.yaml --dataset SWE-bench/SWE-bench_Lite -n 5 -v -o results.json -r report.md

Customization Options

Users can customize the run by modifying:

Sample size: Change
```
-n 5
```
to any number (or remove for full dataset)
Config file: Change
```
-c mcpbr.yaml
```
to point to a different config
Verbosity: Use
```
-vv
```
for very verbose output
Output files: Change
```
results.json
```
and
```
report.md
```
to different paths

Example Variations

Minimal quick test (1 task)

mcpbr run -c mcpbr.yaml -n 1 -v

Full evaluation (all ~300 tasks)

mcpbr run -c mcpbr.yaml --dataset SWE-bench/SWE-bench_Lite -v -o results.json

MCP-only (skip baseline)

mcpbr run -c mcpbr.yaml -n 5 -M -v -o results.json

Specific tasks

mcpbr run -c mcpbr.yaml -t astropy__astropy-12907 -t django__django-11099 -v

Expected Runtime & Cost

For 5 tasks with default settings:

Runtime: 15-30 minutes (depends on task complexity)
Cost: $2-5 (depends on task complexity and model used)

What to Do If It Fails

Docker not running: Start Docker Desktop
API key missing: Set with
```
export ANTHROPIC_API_KEY="sk-ant-..."
```
Config missing: Run
```
mcpbr init
```
to generate default config
Config invalid: Check that
```
{workdir}
```
placeholder is in the
```
args
```
array
MCP server fails: Test the server command independently

After the Run

Once complete, you'll have:

results.json: Full evaluation data with metrics, token usage, and per-task results
report.md: Human-readable summary with resolution rates and comparisons
Console output: Real-time progress and summary table

Review the results to see how your MCP server performed compared to the baseline!

Pro Tips

Start with
```
-n 1
```
to verify everything works before running larger evaluations
Use
```
--log-dir logs/
```
to save detailed per-task logs for debugging
Compare multiple runs by changing the MCP server config between runs
Use
```
--baseline-results baseline.json
```
to detect regressions between versions