Claude-skills autoresearch
Run a rigorous autonomous experiment loop for any optimization target using explicit hypotheses, repeated trials, structured experiment logs, and local HTML reports. Use when asked to "run autoresearch", "optimize X in a loop", "start experiments", or "improve this with benchmark-driven iteration".
git clone https://github.com/ckorhonen/claude-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/ckorhonen/claude-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/autoresearch" ~/.claude/skills/ckorhonen-claude-skills-autoresearch && rm -rf "$T"
skills/autoresearch/SKILL.mdAutoresearch
Autonomous optimization is only useful when it behaves like disciplined research, not benchmark gambling.
This skill runs a strict experiment loop:
- clarify the target before changing code
- write down hypotheses before testing them
- measure with repeated trials, not one-off wins
- record every experiment, including failures
- generate local reports and graphs that are not checked in
Core Principles
- No silent scope inference
If the request is vague, stop and ask a short up-front Q&A before starting. Do not silently invent the target workload, correctness bar, or tradeoffs.
- One experiment = one hypothesis
State the proposed change and why it should help before making it. Avoid bundles of unrelated tweaks.
- All experiments are logged
Every experiment must be enumerated in a machine-readable ledger, including discarded ideas, crashes, and failed checks.
- Repeated measures beat noisy anecdotes
Do not keep a change because of one fast run. Use warmups, repeated measurements, and an explicit decision rule.
- Artifacts stay local
Reports, CSVs, JSONL ledgers, and graphs belong in a local
.autoresearch/ directory and should not be committed.
Available scripts
— initializescripts/init_experiment.py
, ensure.autoresearch/session.json
stays untracked, and scaffold.autoresearch/
when needed.autoresearch.md
— run warmups and measured trials, parsescripts/run_experiment.py
lines, run optional checks, and emit a JSON experiment record.METRIC
— append an experiment toscripts/log_experiment.py
, decide.autoresearch/results.jsonl
vskeep
, and refresh CSV and HTML artifacts.discard
— regeneratescripts/render_report.py
and.autoresearch/results.csv
from the JSONL ledger..autoresearch/report.html
All scripts are non-interactive, expose
--help, emit structured JSON on stdout, and keep diagnostics on stderr.
Default workflow
-
Initialize the session after the up-front Q&A:
python3 scripts/init_experiment.py \ --goal "Reduce latency in hot path" \ --metric-name latency_ms \ --unit ms \ --direction lower \ --command ./autoresearch.sh \ --checks-command ./autoresearch.checks.sh \ --scope src/hot_path.ts -
Record the baseline:
# Option A: save to file first, then log python3 scripts/run_experiment.py \ --id baseline \ --hypothesis "Control run" \ --change-summary "No code changes" \ --baseline \ --output .autoresearch/baseline.json python3 scripts/log_experiment.py --input .autoresearch/baseline.json # Option B: pipe directly python3 scripts/run_experiment.py \ --id baseline \ --hypothesis "Control run" \ --change-summary "No code changes" \ --baseline \ | python3 scripts/log_experiment.py -
Record each candidate experiment:
python3 scripts/run_experiment.py \ --id exp-001 \ --hypothesis "Inlining removes allocation churn" \ --change-summary "Inline helper and pre-size buffer" \ --output .autoresearch/exp-001.json python3 scripts/log_experiment.py --input .autoresearch/exp-001.json -
Re-render artifacts on demand:
python3 scripts/render_report.py
Up-Front Q&A
Before starting, gather or confirm all of the following. If any item is vague or missing, ask focused questions first.
- Objective — what are we optimizing?
- Primary metric — exact name, unit, and whether lower or higher is better
- Minimum meaningful improvement — default to
if the user does not specify1% - Workload command — the exact command or script that represents reality
- Correctness gates — tests, typecheck, lint, eval set, accuracy floor, or any other non-negotiables
- Scope — which files may be changed, and which are off-limits
- Budget — time budget, compute budget, or stop criteria if the user has them
If the user says "just run with it," write the assumptions explicitly into
autoresearch.md before the first experiment, then encode them via python3 scripts/init_experiment.py ....
Workspace Setup
-
Prefer a dedicated worktree on a fresh branch:
git worktree add ../autoresearch-<goal>-<date> -b autoresearch/<goal>-<date>If a worktree is not practical, use a new branch in a clean working tree. Do not run this skill in a dirty tree with unrelated user changes.
-
Create:
— checked in, durable session briefautoresearch.md
— checked in, benchmark runnerautoresearch.sh
— checked in only when correctness gates are requiredautoresearch.checks.sh
— local artifact directory, not checked in.autoresearch/
-
Ensure local artifacts stay untracked:
rg -qxF '.autoresearch/' .git/info/exclude || printf '\n.autoresearch/\n' >> .git/info/exclude -
Run
if you need the exact interface, then initialize the session, collect a baseline withpython3 scripts/init_experiment.py --help
, and log it withpython3 scripts/run_experiment.py ...
before making any code changes.python3 scripts/log_experiment.py ...
Required Files
autoresearch.md
autoresearch.mdThis is the durable contract for the session. A fresh agent should be able to resume from it without guessing.
# Autoresearch: <goal> ## Objective <What is being optimized and why it matters.> ## Up-Front Answers - Primary metric: - Unit: - Direction: - Minimum meaningful improvement: - Workload command: - Correctness gates: - Budget / stop criteria: ## Scope - In scope: - Off limits: ## Decision Rule <How many warmups, how many measured trials, how "keep" is decided.> ## Experiment Ledger `.autoresearch/results.jsonl` ## Report Outputs - `.autoresearch/report.html` - `.autoresearch/results.csv` - `.autoresearch/plots/` ## Current Best Result <Best known baseline or kept variant.> ## What We've Learned <Key wins, dead ends, confounders, and structural insights.>
Update
autoresearch.md whenever assumptions change, a new best result is found, or a pattern becomes clear.
autoresearch.sh
autoresearch.shBash script with
set -euo pipefail that:
- performs only very fast pre-checks
- runs the real benchmark workload
- emits parseable
linesMETRIC name=value - stays stable across experiments so runs remain comparable
Keep it as small and deterministic as possible. If you change the benchmark protocol, record that in
autoresearch.md and establish a new baseline.
autoresearch.checks.sh
autoresearch.checks.shCreate this when the session has correctness gates. It must:
- validate correctness after a candidate benchmark passes
- not affect the primary metric timing
- keep output short and error-focused
- fail hard when the candidate is not safe to keep
If checks fail, the experiment disposition is
checks_failed, not keep. The default script workflow handles this automatically.
Local Artifact Contract
Everything below lives in
.autoresearch/ and should remain untracked.
.autoresearch/results.jsonl
.autoresearch/results.jsonlAppend one JSON object per experiment. This is the source of truth for enumerating all experiments.
Required fields:
{ "id": "exp-007", "timestamp": "2026-03-14T13:00:00Z", "hypothesis": "Inlining X removes allocation churn in hot path Y.", "change_summary": "Inline helper and pre-size buffer.", "files_touched": ["src/hot_path.ts"], "baseline_commit": "abc1234", "candidate_ref": "def5678", "metric_name": "latency_ms", "direction": "lower", "warmup_trials": [12.8, 12.7], "measured_trials": [11.9, 12.0, 11.8, 11.9, 12.1], "summary": { "median": 11.9, "mean": 11.94, "min": 11.8, "max": 12.1 }, "secondary_metrics": { "memory_mb": 84.1 }, "checks": "passed", "disposition": "keep", "reason": "Median improved by 5.2%, checks passed, complexity acceptable." }
.autoresearch/results.csv
.autoresearch/results.csvFlatten the JSONL into a spreadsheet-friendly summary after each experiment or at regular intervals.
.autoresearch/report.html
.autoresearch/report.htmlGenerate an HTML report that visualizes the session. At minimum include:
- a table enumerating every experiment and disposition
- primary metric over time
- best-so-far trend
- per-experiment measured-trial distribution
- checks pass/fail/crash counts
- summary of kept improvements vs. discarded ideas
.autoresearch/plots/
.autoresearch/plots/Store generated charts or exported images here when the report uses external assets.
Experiment Card
Before each experiment, write a short experiment card into
autoresearch.md or a temporary note:
idhypothesisplanned changewhy this should affect the metricfiles expected to changepredicted direction and rough magnituderollback plan
If you cannot explain why the change should help, do not run the experiment yet.
Measurement Protocol
Use this default unless the user specifies something stricter:
-
Warm up first
- Run at least
warmup trials that are not used for the decision.2
- Run at least
-
Measure repeatedly
- Run at least
measured trials for the baseline and for each serious candidate.5 - If noise is high or the win is marginal, increase the sample size. Do not accept ambiguous results.
- Run at least
-
Summarize robustly
- Use the median as the primary decision statistic.
- Also record mean, min, and max so variance is visible.
-
Apply a pre-declared threshold
- Keep a change only when it beats the current best by at least the minimum meaningful improvement threshold and passes all correctness checks.
-
Reset baseline when needed
- If the workload, machine conditions, dataset, or benchmark harness changes materially, establish a fresh baseline and note why.
Decision Rules
-
keep- candidate beats the current best according to the declared metric direction
- improvement clears the minimum meaningful improvement threshold
- correctness checks pass
- no unacceptable regression appears in a required secondary metric
-
discard- candidate is worse
- candidate is effectively equal
- result is too noisy to justify confidence
- complexity cost is not worth the gain
-
checks_failed- benchmark may have improved, but required correctness gates failed
-
crash- candidate broke the workload or could not be measured reliably
Never upgrade an ambiguous result to
keep because it feels promising.
Loop Behavior
Run autonomously, but do not run forever without learning. Continue until one of these is true:
- time or compute budget is exhausted
- no meaningful improvement has been found after several distinct hypotheses
- all promising ideas are exhausted
- the user interrupts or redirects the work
During the loop:
- prefer one meaningful change at a time
- avoid retrying the same failed idea with superficial variation
- write down dead ends so future agents do not repeat them
- keep the worktree clean between experiments
When discarding a candidate, restore back to the last committed good state in the isolated worktree before starting the next idea.
Common Pitfalls
Autoresearch is disciplined empiricism, but several failure modes emerge in long-running optimization loops. Knowing these patterns helps prevent wasted experiments and broken workflows.
1. Experiment Loop Doesn't Terminate When Optimization Plateaus
Symptom:
- Agent continues running new experiments for many cycles after the last improvement
- No clear stopping point; loop feels directionless after 10-15 experiments
- Time/compute budget consumed without discovering new insights
Root Cause:
- Missing exit conditions beyond budget exhaustion
- Agent assumes "more experiments = more knowledge" without checking for learning curve saturation
- No systematic detection of repeated hypothesis patterns or diminishing returns
Mitigation:
- Set a plateau threshold in
: stop after N consecutive experiments with no improvement, e.g., "Stop if 3 experiments in a row show <0.5% gain"autoresearch.md - Track hypothesis diversity — if new hypotheses are rewording old ideas, stop and document learnings
- Implement a moving-window heuristic: calculate improvement velocity over the last 5 experiments; stop when velocity < 0.1% per experiment
- Record stopping reason explicitly in
when exiting earlyautoresearch.md
2. Hypothesis Generation Becomes Repetitive After 5–10 Iterations
Symptom:
- Later experiments are minor variations on earlier ideas (e.g., "buffer size 16 vs. 32 vs. 24")
- Hypotheses lack distinct causal theories; just tweaking constants
- Diminishing variance in results, but no clear winner emerges
Root Cause:
- Agent defaults to local gradient descent without pausing to reflect on root causes
- No explicit hypothesis inventory or cross-check for novelty
- Hypothesis space is not well-explored before drilling into micro-optimizations
Mitigation:
- Before each experiment, write the hypothesis in a form that answers "why should this improve the metric?" If the answer is "because I'm turning a knob," pause and sketch a new direction first
- Document dead ends — when discarding an experiment, explicitly note why it failed so future hypotheses don't retread that ground
- Rotate hypothesis types every 3–4 experiments:
- algorithmic changes (e.g., trade hash table for skiplist)
- data structure changes (e.g., pre-allocation, layout)
- control flow changes (e.g., branch prediction, loop unrolling)
- external factors (e.g., compiler flags, sampling rate)
- Checkpoint & reflect: after 5 experiments, re-read
and explicitly write a "What We've Learned" section with 2–3 key insights. If that section is empty, you're spinningautoresearch.md
3. Log Files Grow Unbounded on Long Runs
Symptom:
directory balloons to 10s or 100s of MB after 50+ experiments.autoresearch/- Runtime of
slows down measurably over timepython3 scripts/render_report.py - Disk usage becomes an operational concern on resource-constrained machines
Root Cause:
- Each experiment produces a full JSON record, and records are never pruned
- If measurements include verbose debug output or traces, they accumulate in the JSONL
- HTML reports may embed all raw data inline, leading to multi-MB HTML files
Mitigation:
- Archive periodically: every 20 experiments, archive old results to
and restart with a fresh.autoresearch/archive/YYYYMMDD-run-N.jsonlresults.jsonl - Limit measurement output: ensure
does not emit verbose logs; capture onlyautoresearch.sh
linesMETRIC - Store summaries, not raw traces: in
, recordresults.jsonl
instead of all raw trial values if space is critical{"summary": {...}, "median": X} - Document archival in
: note when the ledger was archived and how to access previous runsautoresearch.md - Monitor artifact size: add a check in the loop:
du -sh .autoresearch/ | grep -qE '[0-9]+M' && echo "WARNING: artifact directory >100MB"
4. HTML Reports Don't Render Properly When Experiments > 100
Symptom:
- Browser lags or hangs when opening
after many experiments.autoresearch/report.html - Charts fail to render or display only partial data
- Page size balloons to 50+ MB, making it impractical to share or version-control
Root Cause:
generates a single monolithic HTML file with all data inlinedrender_report.py- JavaScript charting library chokes on 100+ data points without pagination or lazy loading
- No truncation or summary mode for large result sets
Mitigation:
- Paginate results: split the experiment ledger into chunks of 50 and generate separate report pages (
,report-01.html
, etc.) with navigation linksreport-02.html - Use summary mode for large runs: if
, render a condensed report that includes:len(results) > 100- table of kept experiments only (discard, crash, checks_failed filtered out)
- best-so-far trend line (not all individual points)
- per-phase statistical summary instead of per-experiment distributions
- Offload heavy data to JSON: embed a
with the full result set and reference it from the HTML with client-side rendering, so the HTML itself stays under 1 MB.autoresearch/data.json - Auto-detect and warn: in
, if result count > 100, log:render_report.py"WARNING: report contains >100 experiments; consider archiving old runs or using --summary-mode"
Report Generation
Generate or refresh
.autoresearch/report.html periodically and always at the end of the session:
python3 scripts/render_report.py
The report should answer:
- what experiments were tried
- which ones won, failed, or crashed
- how the best metric evolved over time
- what tradeoffs appeared in secondary metrics
- what the final recommended change is and why
If rich charting is cumbersome, generate a simple static HTML file with inline SVG charts or lightweight local assets. The report matters more than polish.
Resume Behavior
When resuming:
- read
autoresearch.md - inspect
.autoresearch/results.jsonl - inspect
if present.autoresearch/report.html - review git history for kept commits
- continue from the best known state, not from stale assumptions
User Messages During Experiments
If the user sends a message while an experiment is running, finish the current measurement cycle if it is short and safe to do so, then respond. If the run is long or risky, stop at the nearest safe checkpoint and reply.