Claude-code-skills ln-840-benchmark-compare
Use when benchmarking hex-line MCP against Claude built-in tools with scenario manifests, activation checks, and diff-based correctness.
git clone https://github.com/levnikolaevich/claude-code-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/levnikolaevich/claude-code-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills-catalog/ln-840-benchmark-compare" ~/.claude/skills/levnikolaevich-claude-code-skills-ln-840-benchmark-compare && rm -rf "$T"
skills-catalog/ln-840-benchmark-compare/SKILL.mdPaths: File paths (
,shared/) are relative to skills repo root. Locate this SKILL.md directory and go up one level for repo root.references/
Benchmark Compare
Type: L3 Worker Category: 8XX Optimization -> 840 Benchmark
Run a clean A/B benchmark in Claude Code: one session with built-in tools only, one with
hex-line. The benchmark is scenario-based, diff-validated, manifest-driven, and runtime-backed. It measures activation, correctness, time, cost, and tokens. The current runner is intentionally scoped to this internal A/B. It does not, by itself, prove best-in-class against external alternatives.
Input / Output
| Direction | Content |
|---|---|
| Input | Repo checkout containing , optional , optional |
| Output | Comparison report in plus machine-readable benchmark summary artifact |
Prerequisites
succeedsclaude --version
succeedsgit
existsmcp/hex-line-mcp/server.mjs
existsmcp/hex-line-mcp/hook.mjs
existsskills-catalog/ln-840-benchmark-compare/references/goals.md
existsskills-catalog/ln-840-benchmark-compare/references/expectations.json
existsskills-catalog/ln-840-benchmark-compare/references/mcp-bench.json
Quick Run
bash skills-catalog/ln-840-benchmark-compare/scripts/run-benchmark.sh \ [skills-catalog/ln-840-benchmark-compare/references/goals.md] \ [skills-catalog/ln-840-benchmark-compare/references/expectations.json]
Optional extra session profile:
EXTRA_SESSION_ID=other-mcp \ EXTRA_SESSION_LABEL="Other MCP" \ EXTRA_MCP_CONFIG=/abs/path/to/other-mcp.json \ EXTRA_SETTINGS='{"disableAllHooks":true}' \ bash skills-catalog/ln-840-benchmark-compare/scripts/run-benchmark.sh
Monitor Integration (Claude Code 2.1.98+)
MANDATORY READ: Load
shared/references/monitor_integration_pattern.md
Stream benchmark progress:
Monitor(command="bash skills-catalog/ln-840-benchmark-compare/scripts/run-benchmark.sh 2>&1 | grep --line-buffered -E 'scenario|PASS|FAIL|error|session'", timeout_ms=3600000, description="benchmark run")
Fallback:
Bash(run_in_background=true).
The runner handles:
- syntax preflight
- SessionStart preflight
- scenario extraction from
goals.md - isolated worktrees per scenario/session
- per-scenario diffs
- final comparison report
Current scope:
- built-in Claude session
- Claude plus
hex-line - optional third Claude-compatible session profile through
environment variablesEXTRA_SESSION_*
External baseline note:
- use the same
andgoals.mdexpectations.json - do not rewrite scenarios to fit the external tool
- do not make "top tool" claims from the internal A/B alone
- the optional third session profile is only valid when it can emit the same
log shape and diff artifactsstream-json
Workflow
Phase 1: Define The Canonical Suite
Use one canonical pair owned by this skill:
skills-catalog/ln-840-benchmark-compare/references/goals.mdskills-catalog/ln-840-benchmark-compare/references/expectations.json
Rules:
- The suite must be a balanced mix of common engineering scenarios.
- Do not design the suite to favor
.hex-line - Every scenario in
must have a matching entry ingoals.md
.expectations.json
is the source of truth for correctness.expectations.json- The same pair must be reused unchanged for any future external baseline.
Supported expectation fields per scenario:
| Field | Meaning |
|---|---|
| Scenario identifier used in result filenames |
| Files that must change |
| Files that must not change |
| Regex patterns required in the saved diff |
| Regex patterns that must not appear in the diff |
| Regex patterns required in the final assistant result text |
| Regex patterns that must match at least one Bash command |
| If , no extra changed files are allowed |
Phase 2: Preflight
The runner must pass:
node --check server.mjsnode --check hook.mjsnode --check extract-scenarios.mjsnode --check parse-results.mjs- SessionStart smoke check from
hook.mjs
If preflight fails, the benchmark is invalid and must stop before scenarios run.
Phase 3: Execute Per Scenario
For each
## scenario in goals.md:
- generate a standalone prompt file
- create two clean worktrees from the same commit
- run built-in Claude session
- run hex-line Claude session
- save
logs and.jsonl
artifacts.diff.txt - remove both worktrees
Built-in session:
- no MCP
- hooks disabled
Hex-line session:
- resolved MCP config pointing to
server.mjs outputStyle: "hex-line"
hook throughPreToolUsehook.mjs
Phase 4: Parse Results
parse-results.mjs evaluates each scenario for both sessions.
Scenario pass requires:
- valid run
- successful session completion
- changed files match expectations
- diff patterns match expectations
- result text patterns match expectations
- required commands were actually executed
Phase 5: Read The Report
The final report has these sections:
- Scenario Outcomes
- Activation
- Time
- Cost
- Tokens
- Tool Totals
- Validity
Interpretation rules:
means setup/adoption failure, not product performanceinvalid run- scenario
means correctness contract was not metFAIL - activation is part of product quality for
, not external noisehex-line - this report is necessary for internal A/B evaluation, but not sufficient for best-alternative claims
Report Contract
skills-catalog/ln-840-benchmark-compare/results/{date}-comparison.md must answer:
- Did each scenario complete correctly?
- Did
activate cleanly without discovery drift?hex-line - What changed in wall time, API time, cost, output tokens, and total tool calls?
- Was the run valid?
Do not treat raw time/cost as sufficient without scenario correctness.
External Baseline Policy
- This skill owns the canonical suite, not a universal leaderboard.
- If maintainers compare
against external alternatives, they must reuse the samehex-line
,goals.md
, and diff-based evaluation rules.expectations.json - External runs may use different harnesses, but they must preserve the same task text, starting commit, and correctness contract.
- If an external tool cannot satisfy the contract format, record that as a harness limitation instead of rewriting the suite to accommodate it.
- A report that only covers built-in Claude vs
must say so explicitly.hex-line
Runtime Contract
MANDATORY READ: Load shared/references/benchmark_worker_runtime_contract.md, shared/references/coordinator_summary_contract.md
Runtime CLI:
node shared/scripts/benchmark-worker-runtime/cli.mjs start --skill ln-840-benchmark-compare --identifier suite-default --manifest-file <file> node shared/scripts/benchmark-worker-runtime/cli.mjs checkpoint --skill ln-840-benchmark-compare --identifier suite-default --phase PHASE_0_CONFIG --payload '{...}' node shared/scripts/benchmark-worker-runtime/cli.mjs record-summary --skill ln-840-benchmark-compare --identifier suite-default --payload '{...}' node shared/scripts/benchmark-worker-runtime/cli.mjs complete --skill ln-840-benchmark-compare --identifier suite-default
Required state fields:
report_readysummary_recordedfinal_resultself_check_passed
Domain checkpoints:
PHASE_0_CONFIGPHASE_1_PREFLIGHTPHASE_2_LOAD_SUITEPHASE_3_RUN_SCENARIOSPHASE_4_PARSE_RESULTSPHASE_5_WRITE_REPORTPHASE_6_WRITE_SUMMARYPHASE_7_SELF_CHECK
Guard rules:
- do not advance without checkpointing the current phase
- do not complete before
summary is recordedbenchmark-worker - do not complete before self-check passes
Runtime Coordination
- Managed runs may pass deterministic
and exactrunId
.summaryArtifactPath - Standalone runs are supported. If both are omitted, runtime creates a standalone run and writes the default summary artifact path for the
family.benchmark-worker
Runtime Summary Artifact
MANDATORY READ: Load shared/references/coordinator_summary_contract.md
Emit a
benchmark-worker summary envelope after the comparison report is written.
Managed mode:
- write to the exact
summaryArtifactPath
Standalone mode:
- write
.hex-skills/runtime-artifacts/runs/{run_id}/benchmark-worker/ln-840-benchmark-compare--{identifier}.json
Recommended payload:
scenarios_totalscenarios_passedscenarios_failedactivation_validvalidity_verdictreport_pathwarningsmetrics
Known Pitfalls
| Pitfall | Solution |
|---|---|
| SessionStart not present in hex-line run | Fail preflight and stop |
Agent drifts into before hex-line use | Treat as activation problem and capture in report |
| Worktree already exists from prior crash | Remove it before adding a new one |
| Diff artifacts missing | Treat scenario correctness as failed |
| Simple scenario favors built-ins | Keep it in the suite if it is common; honesty beats cherry-picking |
| External comparison uses edited scenarios or relaxed expectations | Treat the comparison as invalid |
Definition of Done
-
defines the canonical balanced suitegoals.md -
fully describes scenario correctnessexpectations.json - Runner passes syntax and SessionStart preflight
- Each scenario runs in two clean worktrees from the same commit
- Parser evaluates activation and scenario correctness from logs plus diffs
- Final report is saved to
skills-catalog/ln-840-benchmark-compare/results/ -
summary artifact is written to the managed or standalone runtime pathbenchmark-worker - Temporary worktrees are removed
- Report states clearly whether it is internal A/B only or includes additional external baselines
Version: 2.0.0 Last Updated: 2026-03-24