Claude-skill-registry-data run-benchmark

Run an MCP evaluation using mcpbr on SWE-bench or other datasets.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry-data
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/mcpbr-eval" ~/.claude/skills/majiayu000-claude-skill-registry-data-run-benchmark && rm -rf "$T"
manifest: data/mcpbr-eval/SKILL.md
source content

Instructions

You are an expert at benchmarking AI agents using the

mcpbr
CLI. Your goal is to run valid, reproducible evaluations.

Critical Constraints (DO NOT IGNORE)

  1. Docker is Mandatory: Before running ANY

    mcpbr
    command, you MUST verify Docker is running (
    docker ps
    ). If not, tell the user to start it.

  2. Config is Required:

    mcpbr run
    FAILS without a config file. Never guess flags.

    • IF no config exists: Run
      mcpbr init
      first to generate a template.
    • IF config exists: Read it (
      cat mcpbr.yaml
      or the specified config path) to verify the
      mcp_server
      command is valid for the user's environment (e.g., check if
      npx
      or
      uvx
      is installed).
  3. Workdir Placeholder: When generating configs, ensure

    args
    includes
    "{workdir}"
    . Do not resolve this path yourself;
    mcpbr
    handles it.

  4. API Key Required: The

    ANTHROPIC_API_KEY
    environment variable must be set. Check for it before running evaluations.

Common Pitfalls to Avoid

  • DO NOT use the
    -m
    flag unless the user explicitly asks to override the model in the YAML.
  • DO NOT hallucinate dataset names. Valid datasets include:
    • SWE-bench/SWE-bench_Lite
      (default for SWE-bench)
    • SWE-bench/SWE-bench_Verified
    • sunblaze-ucb/cybergym
      (for CyberGym benchmark)
    • MCPToolBench/MCPToolBenchPP
      (for MCPToolBench++)
  • DO NOT hallucinate flags or options. Only use documented CLI flags.
  • DO NOT forget to specify the config file with
    -c
    or
    --config
    .

Supported Benchmarks

mcpbr supports three benchmarks:

  1. SWE-bench (default): Real GitHub issues requiring bug fixes

    • Dataset:
      SWE-bench/SWE-bench_Lite
      or
      SWE-bench/SWE-bench_Verified
    • Use:
      mcpbr run -c config.yaml
      or
      --benchmark swe-bench
  2. CyberGym: Security vulnerabilities requiring PoC exploits

    • Dataset:
      sunblaze-ucb/cybergym
    • Use:
      mcpbr run -c config.yaml --benchmark cybergym --level [0-3]
  3. MCPToolBench++: Large-scale tool use evaluation

    • Dataset:
      MCPToolBench/MCPToolBenchPP
    • Use:
      mcpbr run -c config.yaml --benchmark mcptoolbench

Execution Steps

Follow these steps in order:

  1. Verify Prerequisites:

    # Check Docker is running
    docker ps
    
    # Verify API key is set
    echo $ANTHROPIC_API_KEY
    
  2. Check for Config File:

    • If
      mcpbr.yaml
      (or user-specified config) does NOT exist: Run
      mcpbr init
      to generate it.
    • If config exists: Read it to understand the configuration.
  3. Validate Config:

    • Ensure
      mcp_server.command
      is valid (e.g.,
      npx
      ,
      uvx
      ,
      python
      are installed).
    • Ensure
      mcp_server.args
      includes
      "{workdir}"
      placeholder.
    • Verify
      model
      ,
      dataset
      , and other parameters are correctly set.
  4. Construct the Command:

    • Base command:
      mcpbr run --config <path-to-config>
    • Add flags as needed based on user request:
      • -n <number>
        or
        --sample <number>
        : Override sample size
      • -v
        or
        -vv
        : Verbose output
      • -o <path>
        : Save JSON results
      • -r <path>
        : Save Markdown report
      • --log-dir <path>
        : Save per-instance logs
      • -M
        : MCP-only evaluation (skip baseline)
      • -B
        : Baseline-only evaluation (skip MCP)
      • --benchmark <name>
        : Override benchmark
      • --level <0-3>
        : Set CyberGym difficulty level
  5. Run the Command: Execute the constructed command and monitor the output.

  6. Handle Results:

    • If the run completes successfully, inform the user about the results.
    • If errors occur, diagnose and provide actionable feedback.

Example Commands

# Full evaluation with 5 tasks
mcpbr run -c config.yaml -n 5 -v

# MCP-only evaluation
mcpbr run -c config.yaml -M -n 10

# Save results and report
mcpbr run -c config.yaml -o results.json -r report.md

# Run CyberGym at level 2
mcpbr run -c config.yaml --benchmark cybergym --level 2 -n 5

# Run specific tasks
mcpbr run -c config.yaml -t astropy__astropy-12907 -t django__django-11099

Troubleshooting

If you encounter errors:

  1. Docker not running: Remind user to start Docker Desktop or Docker daemon.
  2. API key missing: Ask user to set
    export ANTHROPIC_API_KEY="sk-ant-..."
  3. Config file invalid: Re-generate with
    mcpbr init
    or fix the YAML syntax.
  4. MCP server fails to start: Test the server command independently.
  5. Timeout issues: Suggest increasing
    timeout_seconds
    in config.

Important Reminders

  • Always read the config file before making assumptions about what's configured.
  • Never modify the config file without explicit user permission.
  • Use the
    mcpbr models
    command to check available models if needed.
  • Use the
    mcpbr benchmarks
    command to list available benchmarks.