Claude-skill-registry-data run-benchmark
Run an MCP evaluation using mcpbr on SWE-bench or other datasets.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry-data
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/mcpbr-eval" ~/.claude/skills/majiayu000-claude-skill-registry-data-run-benchmark && rm -rf "$T"
manifest:
data/mcpbr-eval/SKILL.mdsource content
Instructions
You are an expert at benchmarking AI agents using the
mcpbr CLI. Your goal is to run valid, reproducible evaluations.
Critical Constraints (DO NOT IGNORE)
-
Docker is Mandatory: Before running ANY
command, you MUST verify Docker is running (mcpbr
). If not, tell the user to start it.docker ps -
Config is Required:
FAILS without a config file. Never guess flags.mcpbr run- IF no config exists: Run
first to generate a template.mcpbr init - IF config exists: Read it (
or the specified config path) to verify thecat mcpbr.yaml
command is valid for the user's environment (e.g., check ifmcp_server
ornpx
is installed).uvx
- IF no config exists: Run
-
Workdir Placeholder: When generating configs, ensure
includesargs
. Do not resolve this path yourself;"{workdir}"
handles it.mcpbr -
API Key Required: The
environment variable must be set. Check for it before running evaluations.ANTHROPIC_API_KEY
Common Pitfalls to Avoid
- DO NOT use the
flag unless the user explicitly asks to override the model in the YAML.-m - DO NOT hallucinate dataset names. Valid datasets include:
(default for SWE-bench)SWE-bench/SWE-bench_LiteSWE-bench/SWE-bench_Verified
(for CyberGym benchmark)sunblaze-ucb/cybergym
(for MCPToolBench++)MCPToolBench/MCPToolBenchPP
- DO NOT hallucinate flags or options. Only use documented CLI flags.
- DO NOT forget to specify the config file with
or-c
.--config
Supported Benchmarks
mcpbr supports three benchmarks:
-
SWE-bench (default): Real GitHub issues requiring bug fixes
- Dataset:
orSWE-bench/SWE-bench_LiteSWE-bench/SWE-bench_Verified - Use:
ormcpbr run -c config.yaml--benchmark swe-bench
- Dataset:
-
CyberGym: Security vulnerabilities requiring PoC exploits
- Dataset:
sunblaze-ucb/cybergym - Use:
mcpbr run -c config.yaml --benchmark cybergym --level [0-3]
- Dataset:
-
MCPToolBench++: Large-scale tool use evaluation
- Dataset:
MCPToolBench/MCPToolBenchPP - Use:
mcpbr run -c config.yaml --benchmark mcptoolbench
- Dataset:
Execution Steps
Follow these steps in order:
-
Verify Prerequisites:
# Check Docker is running docker ps # Verify API key is set echo $ANTHROPIC_API_KEY -
Check for Config File:
- If
(or user-specified config) does NOT exist: Runmcpbr.yaml
to generate it.mcpbr init - If config exists: Read it to understand the configuration.
- If
-
Validate Config:
- Ensure
is valid (e.g.,mcp_server.command
,npx
,uvx
are installed).python - Ensure
includesmcp_server.args
placeholder."{workdir}" - Verify
,model
, and other parameters are correctly set.dataset
- Ensure
-
Construct the Command:
- Base command:
mcpbr run --config <path-to-config> - Add flags as needed based on user request:
or-n <number>
: Override sample size--sample <number>
or-v
: Verbose output-vv
: Save JSON results-o <path>
: Save Markdown report-r <path>
: Save per-instance logs--log-dir <path>
: MCP-only evaluation (skip baseline)-M
: Baseline-only evaluation (skip MCP)-B
: Override benchmark--benchmark <name>
: Set CyberGym difficulty level--level <0-3>
- Base command:
-
Run the Command: Execute the constructed command and monitor the output.
-
Handle Results:
- If the run completes successfully, inform the user about the results.
- If errors occur, diagnose and provide actionable feedback.
Example Commands
# Full evaluation with 5 tasks mcpbr run -c config.yaml -n 5 -v # MCP-only evaluation mcpbr run -c config.yaml -M -n 10 # Save results and report mcpbr run -c config.yaml -o results.json -r report.md # Run CyberGym at level 2 mcpbr run -c config.yaml --benchmark cybergym --level 2 -n 5 # Run specific tasks mcpbr run -c config.yaml -t astropy__astropy-12907 -t django__django-11099
Troubleshooting
If you encounter errors:
- Docker not running: Remind user to start Docker Desktop or Docker daemon.
- API key missing: Ask user to set
export ANTHROPIC_API_KEY="sk-ant-..." - Config file invalid: Re-generate with
or fix the YAML syntax.mcpbr init - MCP server fails to start: Test the server command independently.
- Timeout issues: Suggest increasing
in config.timeout_seconds
Important Reminders
- Always read the config file before making assumptions about what's configured.
- Never modify the config file without explicit user permission.
- Use the
command to check available models if needed.mcpbr models - Use the
command to list available benchmarks.mcpbr benchmarks