Claude-skill-registry-data run-benchmark

Run an MCP evaluation using mcpbr on SWE-bench or other datasets.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry-data

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/mcpbr-eval" ~/.claude/skills/majiayu000-claude-skill-registry-data-run-benchmark && rm -rf "$T"

manifest: data/mcpbr-eval/SKILL.md

source content

Instructions

You are an expert at benchmarking AI agents using the

mcpbr

CLI. Your goal is to run valid, reproducible evaluations.

Critical Constraints (DO NOT IGNORE)

Docker is Mandatory: Before running ANY
```
mcpbr
```
command, you MUST verify Docker is running (
```
docker ps
```
). If not, tell the user to start it.
Config is Required:
```
mcpbr run
```
FAILS without a config file. Never guess flags.
- IF no config exists: Run
```
mcpbr init
```
  first to generate a template.
- IF config exists: Read it (
```
cat mcpbr.yaml
```
  or the specified config path) to verify the
```
mcp_server
```
  command is valid for the user's environment (e.g., check if
```
npx
```
  or
```
uvx
```
  is installed).
Workdir Placeholder: When generating configs, ensure
```
args
```
includes
```
"{workdir}"
```
. Do not resolve this path yourself;
```
mcpbr
```
handles it.
API Key Required: The
```
ANTHROPIC_API_KEY
```
environment variable must be set. Check for it before running evaluations.

Common Pitfalls to Avoid

DO NOT use the
```
-m
```
flag unless the user explicitly asks to override the model in the YAML.
DO NOT hallucinate dataset names. Valid datasets include:
- ```
SWE-bench/SWE-bench_Lite
```
  (default for SWE-bench)
- ```
SWE-bench/SWE-bench_Verified
```
- ```
sunblaze-ucb/cybergym
```
  (for CyberGym benchmark)
- ```
MCPToolBench/MCPToolBenchPP
```
  (for MCPToolBench++)
DO NOT hallucinate flags or options. Only use documented CLI flags.
DO NOT forget to specify the config file with
```
-c
```
or
```
--config
```
.

Supported Benchmarks

mcpbr supports three benchmarks:

SWE-bench (default): Real GitHub issues requiring bug fixes

Dataset:

SWE-bench/SWE-bench_Lite

SWE-bench/SWE-bench_Verified

Use:

mcpbr run -c config.yaml

--benchmark swe-bench

CyberGym: Security vulnerabilities requiring PoC exploits

Dataset:
```
sunblaze-ucb/cybergym
```

Use:

mcpbr run -c config.yaml --benchmark cybergym --level [0-3]

MCPToolBench++: Large-scale tool use evaluation

Dataset:
```
MCPToolBench/MCPToolBenchPP
```

Use:

mcpbr run -c config.yaml --benchmark mcptoolbench

Execution Steps

Follow these steps in order:

Verify Prerequisites:

# Check Docker is running
docker ps

# Verify API key is set
echo $ANTHROPIC_API_KEY

Check for Config File:
- If
```
mcpbr.yaml
```
  (or user-specified config) does NOT exist: Run
```
mcpbr init
```
  to generate it.
- If config exists: Read it to understand the configuration.
Validate Config:
- Ensure
```
mcp_server.command
```
  is valid (e.g.,
```
npx
```
  ,
```
uvx
```
  ,
```
python
```
  are installed).
- Ensure
```
mcp_server.args
```
  includes
```
"{workdir}"
```
  placeholder.
- Verify
```
model
```
  ,
```
dataset
```
  , and other parameters are correctly set.
Construct the Command:
- Base command:
```
mcpbr run --config <path-to-config>
```
- Add flags as needed based on user request:
  - ```
  -n <number>
```
  or
```
  --sample <number>
```
  : Override sample size
- ```
-v
```
    or
```
-vv
```
    : Verbose output
  - ```
  -o <path>
```
  : Save JSON results
- ```
-r <path>
```
    : Save Markdown report
  - ```
  --log-dir <path>
```
  : Save per-instance logs
- ```
-M
```
    : MCP-only evaluation (skip baseline)
  - ```
  -B
```
  : Baseline-only evaluation (skip MCP)
- ```
--benchmark <name>
```
    : Override benchmark
  - ```
  --level <0-3>
```
  : Set CyberGym difficulty level
Run the Command: Execute the constructed command and monitor the output.
Handle Results:
- If the run completes successfully, inform the user about the results.
- If errors occur, diagnose and provide actionable feedback.

Example Commands

# Full evaluation with 5 tasks
mcpbr run -c config.yaml -n 5 -v

# MCP-only evaluation
mcpbr run -c config.yaml -M -n 10

# Save results and report
mcpbr run -c config.yaml -o results.json -r report.md

# Run CyberGym at level 2
mcpbr run -c config.yaml --benchmark cybergym --level 2 -n 5

# Run specific tasks
mcpbr run -c config.yaml -t astropy__astropy-12907 -t django__django-11099

Troubleshooting

If you encounter errors:

Docker not running: Remind user to start Docker Desktop or Docker daemon.
API key missing: Ask user to set
```
export ANTHROPIC_API_KEY="sk-ant-..."
```
Config file invalid: Re-generate with
```
mcpbr init
```
or fix the YAML syntax.
MCP server fails to start: Test the server command independently.
Timeout issues: Suggest increasing
```
timeout_seconds
```
in config.

Important Reminders

Always read the config file before making assumptions about what's configured.
Never modify the config file without explicit user permission.
Use the
```
mcpbr models
```
command to check available models if needed.
Use the
```
mcpbr benchmarks
```
command to list available benchmarks.