Claude-corps benchmark

Use when you need measured performance evidence by running a repeatable command on the current branch and a baseline ref

install

source · Clone the upstream repo

git clone https://github.com/josephneumann/claude-corps

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/josephneumann/claude-corps "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/benchmark" ~/.claude/skills/josephneumann-claude-corps-benchmark && rm -rf "$T"

manifest: skills/benchmark/SKILL.md

source content

/benchmark $ARGUMENTS

Run a repeatable benchmark command on the current branch and compare it with a baseline ref. This skill is for measurement, not speculation.

If the user wants architectural performance review without running code, use

/multi-review

with

performance-oracle

instead.

Arguments

/benchmark "<command>" [--baseline <ref>] [--iterations N]

Defaults:

```
--baseline origin/main
```
```
--iterations 5
```
warmup runs:
```
1
```

Reject the request if no command is provided.

Safety Rules

The benchmark command must be repeatable
Do not use commands that intentionally mutate repo-tracked files
Build artifacts and caches are acceptable if they are not tracked
If the command dirties the worktree, stop and report it

Phase 1: Pre-Flight

Record current branch and
```
git status --short
```
Verify the worktree is clean enough to benchmark safely
Resolve the baseline ref
Determine whether the benchmark command depends on repo-local dependencies, generated assets, or setup steps
Create a temporary detached worktree for the baseline
Establish equivalent runtime environments for current and baseline before timing anything

Use a temporary directory under

/tmp

or equivalent. Never benchmark the baseline by checking out over the user's current branch.

Environment setup rules:

Read local sources of truth first:
```
CLAUDE.md
```
,
```
README.md
```
, and project manifests such as
```
package.json
```
,
```
pyproject.toml
```
,
```
Cargo.toml
```
, or
```
Makefile
```
If the benchmark command depends on installed dependencies or generated artifacts, run the same setup flow in both environments before measuring
Prefer the project's existing setup command if one is documented, for example
```
pnpm install --frozen-lockfile
```
,
```
uv sync
```
,
```
cargo fetch
```
, or
```
make setup
```
Do not count setup time as benchmark time
If you cannot establish equivalent environments on both refs, stop and report that the benchmark would be untrustworthy

Phase 2: Measurement Protocol

For both current branch and baseline:

Run 1 warmup execution
Run
```
N
```
measured executions
Capture wall-clock duration for every measured run
If the command prints benchmark metrics already, include them in the notes, but wall-clock time is the required common metric

Prefer a consistent timing mechanism for every run. Keep environment conditions as similar as possible across both refs.

Phase 3: Summarize Results

Compute for current and baseline:

all measured run times
median
min
max
percentage delta of median vs baseline

If run-to-run variance is high, say so. Do not overclaim a tiny difference hidden by noise.

Required Output

## Benchmark Summary

Command: ...
Baseline: ...
Iterations: ...

### Current Branch
Runs: [...]
Median: ...
Min/Max: ...

### Baseline
Runs: [...]
Median: ...
Min/Max: ...

### Delta
Current vs baseline median: ...%

### Notes
- command-native metrics if any
- variance caveats
- anything that may have skewed results

Cleanup

Always remove the temporary baseline worktree before exiting, even on failure.

Failure Handling

If the benchmark command fails on either ref, report the failing ref and stop
If the baseline ref cannot be resolved, stop
If cleanup fails, report it explicitly

Interpretation Rules

Measured improvement beats theoretical reasoning
High variance weakens confidence
A benchmark proves only the command that was run, not overall product performance