Claude-kit benchmark
Compare Claude Code output with full config vs minimal config using standardized tasks per stack.
install
source · Clone the upstream repo
git clone https://github.com/luiseiman/dotforge
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/luiseiman/dotforge "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/benchmark" ~/.claude/skills/luiseiman-claude-kit-benchmark && rm -rf "$T"
manifest:
skills/benchmark/SKILL.mdsource content
Benchmark
Compare the effectiveness of a project's full dotforge configuration against a minimal baseline by executing the same standardized task in two isolated worktrees.
Cost warning: Each benchmark runs Claude Code twice (full + minimal). Use sparingly and only after Fases 0-2 are working.
Prerequisites
- Project must have
and.claude/settings.jsonCLAUDE.md - Project must be a git repository with a clean working tree
- Task definitions must exist in
$DOTFORGE_DIR/tests/benchmark-tasks/
Step 1: Select task
- Detect project stacks from
or infer from project files.claude/.forge-manifest.json - Load matching task from
$DOTFORGE_DIR/tests/benchmark-tasks/{stack}.yml - If multiple stacks match, let user choose or run the first match
- If no stack matches, use
generic.yml
Display:
═══ BENCHMARK SETUP ═══ Project: {{name}} Stack detected: {{stack}} Task: {{task title}} Description: {{task description}} ⚠ This will run Claude Code twice in isolated worktrees. Proceed? (yes/no)
Step 2: Prepare worktrees
Create two git worktrees from the current HEAD:
-
Full config —
git worktree add /tmp/bench-full-{{slug}} HEAD- Copy entire
directory as-is.claude/ - Copy
as-isCLAUDE.md
- Copy entire
-
Minimal config —
git worktree add /tmp/bench-minimal-{{slug}} HEAD- Create minimal
with only project name and "Build & Test" sectionCLAUDE.md - Create minimal
with only.claude/settings.json
(no hooks, no deny list)allowedTools - No
, no hooks, no agents.claude/rules/
- Create minimal
Step 3: Execute task
For each worktree, run the task prompt using Claude Code in non-interactive mode:
cd /tmp/bench-full-{{slug}} claude --print "{{task prompt}}" --allowedTools "Bash,Read,Write,Edit,Glob,Grep"
Same for minimal worktree.
Capture for each run:
- files_created: count of new files (git diff --name-only --diff-filter=A)
- files_modified: count of modified files
- tests_passing: run the test command from task definition, count pass/fail
- lint_issues: run lint command from task definition, count issues
- errors: grep stderr for error patterns
- has_test: boolean — did Claude create a test file?
Step 4: Compare and report
═══ BENCHMARK RESULTS — {{project}} ═══ Task: {{task title}} Stack: {{stack}} Date: {{YYYY-MM-DD}} Full Config Minimal Config Delta Files created: {{N}} {{N}} {{+/-N}} Tests created: {{yes/no}} {{yes/no}} — Tests passing: {{N/M}} {{N/M}} {{+/-N}} Lint issues: {{N}} {{N}} {{+/-N}} Errors: {{N}} {{N}} {{+/-N}} ── ANALYSIS ── {{if full is better across metrics: "Full config prevented {{N}} lint issues and {{N}} errors. ROI: rules + hooks justified for this project."}} {{if similar: "Minimal difference detected. Consider simplifying configuration or running /forge rule-check to identify inert rules."}} {{if minimal is better: "⚠ Full config may be adding overhead without benefit. Review rules for contradictions or excessive constraints."}}
Step 5: Cleanup
- Remove worktrees:
and minimalgit worktree remove /tmp/bench-full-{{slug}} - Optionally save results to
~/.claude/metrics/{{slug}}/benchmark-{{date}}.json
Results JSON schema:
{ "project": "{{slug}}", "date": "{{YYYY-MM-DD}}", "stack": "{{stack}}", "task": "{{task id}}", "full": { "files_created": 0, "tests_passing": 0, "lint_issues": 0, "errors": 0, "has_test": false }, "minimal": { "files_created": 0, "tests_passing": 0, "lint_issues": 0, "errors": 0, "has_test": false } }
Important constraints
- NEVER run benchmark automatically — always require explicit user confirmation
- NEVER push or commit from worktrees — they are disposable
- If Claude Code execution fails in either worktree, report the failure and still show partial results
- Timeout each run at 5 minutes — if Claude hasn't finished, capture partial state
- Clean up worktrees even on failure