Autoresearch-ai-plugin autoresearch
git clone https://github.com/proyecto26/autoresearch-ai-plugin
T=$(mktemp -d) && git clone --depth=1 https://github.com/proyecto26/autoresearch-ai-plugin "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/autoresearch" ~/.claude/skills/proyecto26-autoresearch-ai-plugin-autoresearch && rm -rf "$T"
skills/autoresearch/SKILL.mdAutoresearch: Autonomous Experiment Loop
An autonomous optimization loop where Claude edits code, runs a benchmark, measures a metric, and keeps improvements or reverts — repeating forever until stopped.
Core Concept
The loop is simple: edit → commit → run → measure → keep or discard → repeat.
- Primary metric is king. Lower (or higher, depending on direction) is better. Improved → keep the commit. Equal or worse →
.git revert - State survives context resets via
(append-only log) andautoresearch.jsonl
(living session document).autoresearch.md - Domain-agnostic. Works for any measurable target: test speed, bundle size, LLM training loss, Lighthouse scores, build times, etc.
- Be careful not to overfit to the benchmarks and do not cheat on the benchmarks. Optimize the real workload, not the measurement harness.
Setup Phase
When the user triggers autoresearch, gather the following (ask if not provided):
- Goal — what to optimize (e.g., "reduce unit test runtime")
- Command — the benchmark to run (e.g.,
,pnpm test
)uv run train.py - Primary metric — name, unit, and direction (
orlower
is better)higher - Secondary metrics — optional additional metrics to track for tradeoff monitoring (e.g., memory, compile time)
- Files in scope — which files can be modified
- Constraints — time budget, off-limits files, correctness requirements
Optionally check for
.claude/autoresearch-ai-plugin.local.md in the project root for persistent configuration:
--- enabled: true max_iterations: 50 working_dir: "/path/to/project" benchmark_timeout: 600 checks_timeout: 300 --- # Autoresearch Configuration Additional context or notes for this project's autoresearch setup.
— whether autoresearch is active (default: true)enabled
— stop after N experiments (default: 0 = unlimited)max_iterations
— override directory for experiment files (default: current directory)working_dir
— benchmark timeout in seconds (default: 600)benchmark_timeout
— correctness checks timeout in seconds (default: 300)checks_timeout
If the file doesn't exist, use defaults. The file should be added to
.gitignore (.claude/*.local.md).
Then execute these setup steps:
- Create a branch:
git checkout -b autoresearch/<goal>-<date> - Ensure session files are gitignored (critical —
will fail ifgit revert
is tracked):autoresearch.jsonlecho -e "autoresearch.jsonl\nrun.log" >> .gitignore git add .gitignore && git commit -m "autoresearch: add session files to gitignore" - Read all files in scope thoroughly to understand the codebase
- Write
— the session document (seeautoresearch.md
)examples/autoresearch.md - Write
— the benchmark script (seeautoresearch.sh
)examples/autoresearch.sh - Optionally write
— correctness checks (tests, lint, types)autoresearch.checks.sh - Commit session files
- Run baseline:
bash autoresearch.sh - Parse metrics from output (lines matching
)METRIC name=value - Record baseline in
(withautoresearch.jsonl
header first, then baseline result)"type":"config" - Begin the experiment loop
The Experiment Loop
LOOP FOREVER. Never ask "should I continue?" — just keep going.
The user might be asleep, away from the computer, or expects you to work indefinitely. If each experiment takes ~5 minutes, you can run ~12/hour, ~100 overnight. The loop runs until the user interrupts you, period.
Each iteration:
1. Read current git state and autoresearch.md 2. Choose an experimental change (informed by past results and ASI notes) 3. Edit files in scope 4. git add <files> && git commit -m "experiment: <description>" 5. Run: bash autoresearch.sh > run.log 2>&1 6. Parse METRIC lines from output 7. If autoresearch.checks.sh exists, run it (separate timeout, default 300s) 8. Decide: keep or discard 9. Log result to autoresearch.jsonl (include ASI annotations) 10. If discard/crash: git revert $(git rev-parse HEAD) --no-edit 11. Update autoresearch.md with learnings (every few experiments) 12. Repeat
Decision Rules
- Metric improved →
(commit stays, branch advances)keep - Metric equal or worse →
(rundiscard
)git revert $(git rev-parse HEAD) --no-edit - Crash or checks failed →
(revert, note the failure in ASI)discard - Simpler code for equal perf →
(removing complexity is a win)keep - Catastrophic secondary metric regression → consider
even if primary improved (e.g., 1% speed gain but 10x memory usage)discard - If stuck → think deeper, try a different approach. Consult
if it exists. Re-read source files for new angles. Try combining previous near-misses. Try more radical changes. Read any papers or docs referenced in the code.autoresearch.ideas.md
Simplicity Criterion
All else being equal, simpler is better. Weigh complexity cost against improvement magnitude:
- A 0.001 improvement that adds 20 lines of hacky code? Probably not worth it.
- A 0.001 improvement from deleting code? Definitely keep.
- Equal performance with much simpler code? Keep.
Handling User Messages During Experiments
If the user sends a message while the loop is running:
- Finish the current experiment cycle (don't abandon mid-run)
- Address the user's feedback or question
- Resume the loop immediately after — do not wait for permission
Benchmark Timeout
- Default benchmark timeout: 600 seconds (10 minutes)
- If a run exceeds the timeout, kill it and treat as a crash
- Checks timeout: 300 seconds (5 minutes), separate from benchmark
Don't Thrash
If 3 consecutive experiments fail or get discarded:
- Stop and think about why
- Re-read the source files for new angles
- Try a fundamentally different approach
- Consult
for untried ideasautoresearch.ideas.md
Metric Output Format
Benchmark scripts output metrics as structured lines:
METRIC total_time=4.23 METRIC memory_mb=512 METRIC val_bpb=1.042
Parse these with the helper script at
${CLAUDE_SKILL_DIR}/scripts/parse-metrics.sh:
bash autoresearch.sh 2>&1 | bash ${CLAUDE_SKILL_DIR}/scripts/parse-metrics.sh
Secondary Metrics
Beyond the primary metric, output additional
METRIC lines for tradeoff monitoring:
METRIC total_ms=4230 # primary METRIC compile_ms=1200 # secondary — helps identify bottlenecks METRIC memory_mb=512 # secondary — monitors resource usage METRIC cache_hit_rate=0.85 # secondary — instrumentation data
Secondary metrics are tracked in the JSONL log and help guide future experiments, but they rarely affect keep/discard decisions (only discard if a catastrophic secondary regression accompanies a marginal primary improvement).
Output instrumentation data — phase timings, error counts, cache rates, domain-specific signals. This data guides the next iteration and helps identify where optimization effort should focus.
Actionable Side Information (ASI)
ASI is structured annotation per experiment that survives reverts. When code changes are discarded, only the description and ASI remain — making them the only structured memory of what happened.
Record ASI for every experiment:
{ "hypothesis": "Reducing loop iterations by breaking early", "result": "Marginal speedup but code readability suffered", "next_action_hint": "Try vectorization instead of loop unrolling", "bottleneck": "Memory bandwidth on L2 cache misses" }
ASI fields are free-form — use whatever keys are useful:
— what you expectedhypothesis
— what actually happenedresult
— guidance for the next experimentnext_action_hint
— identified performance bottleneckbottleneck
— crash/failure diagnosticserror_details- Any other domain-specific observations
Logging to autoresearch.jsonl
Config Header (written once at setup)
{"type":"config","name":"Optimize unit test runtime","metricName":"total_ms","metricUnit":"ms","bestDirection":"lower"}
Experiment Results (appended after each run)
Each experiment appends one JSON line:
{"run":5,"commit":"abc1234","metric":4230,"metrics":{"compile_ms":1200,"memory_mb":512},"status":"keep","description":"parallelized test suites","timestamp":1700000000,"segment":0,"confidence":2.3,"asi":{"hypothesis":"parallel tests reduce wall time","next_action_hint":"try worker pool size tuning"}}
Fields:
— experiment number (1-indexed, sequential)run
— short git commit hash (7 chars)commit
— primary metric valuemetric
— secondary metrics dict (optional)metrics
— one of:status
,keep
,discard
,crashchecks_failed
— brief description of what was trieddescription
— Unix timestamp (seconds)timestamp
— session segment index (0-based, incremented when optimization target changes)segment
— MAD-based confidence score (null if < 3 experiments)confidence
— Actionable Side Information dict (optional, omit if empty)asi
Use
${CLAUDE_SKILL_DIR}/scripts/log-experiment.sh to append entries:
bash ${CLAUDE_SKILL_DIR}/scripts/log-experiment.sh \ --run 5 \ --commit "$(git rev-parse --short HEAD)" \ --metric 4230 \ --status keep \ --description "parallelized test suites" \ --metrics '{"compile_ms":1200,"memory_mb":512}' \ --segment 0 \ --confidence 2.3 \ --asi '{"hypothesis":"parallel tests reduce wall time"}'
Valid statuses:
keep, discard, crash, checks_failed
Segments (Multi-Phase Sessions)
When the optimization target changes mid-session (different benchmark, metric, or workload):
- Write a new config header to
with the updated targetautoresearch.jsonl - Increment the segment counter
- Old results stay in the JSONL but are filtered as previous phase
- Establish a new baseline for the new segment
This allows a single session to evolve — e.g., first optimize compilation speed, then switch to runtime performance.
Resuming After Context Reset
If
autoresearch.jsonl and autoresearch.md exist in the working directory:
- Read
for full context (goal, metrics, files, constraints, learnings)autoresearch.md - Read
to see all past experiments, current best, and ASI annotationsautoresearch.jsonl - Check
if it exists — prune stale entries, experiment with remaining ideasautoresearch.ideas.md - Check git log to verify current branch state matches expected state
- Resume the loop from where it left off — no re-setup needed
- Resume immediately — do not ask "should I continue?"
Confidence Scoring
After 3+ experiments, assess whether improvements are real or noise:
- Compute the Median Absolute Deviation (MAD) of all metric values in the current segment as a noise floor
- Confidence = |best improvement| / MAD
- ≥2.0× → likely real improvement (green)
- 1.0–2.0× → marginal, could be noise (yellow)
- <1.0× → within noise floor (red) — consider re-running to confirm
Record confidence on each experiment result in the JSONL log. When confidence is low, consider:
- Running the benchmark multiple times inside
and reporting the medianautoresearch.sh - Pinning CPU frequency or reducing system noise
- Making larger changes that produce clearer signal
See
references/confidence-scoring.md for detailed methodology.
Session Files
| File | Purpose | Created by |
|---|---|---|
| Living session document — goal, metrics, scope, learnings | Setup phase |
| Benchmark script — outputs lines | Setup phase |
| Optional correctness checks (tests, lint, types) | Setup phase |
| Append-only experiment log (survives restarts) | First experiment |
| Optional backlog of ideas to try | Anytime |
| Optional persistent configuration (max_iterations, working_dir, timeouts) | User-provided |
Cancel and Status
Cancelling an Autoresearch Session
When the user asks to cancel or stop autoresearch:
- Finish the current experiment cycle if one is running
- Read
to count total experiments and resultsautoresearch.jsonl - Report a summary: goal, total runs, kept improvements, best metric
- Remove
if it exists.claude/autoresearch-ai-plugin.local.md - Do NOT delete
orautoresearch.jsonl
— they contain valuable historyautoresearch.md - Do NOT revert any kept commits — the improvements are real
- Inform the user they can resume later with
/autoresearch
Checking Session Status
When the user asks about autoresearch status or progress:
- Check if
exists — if not, report "No active session"autoresearch.jsonl - Read
for the goal and primary metricautoresearch.md - Parse
to compute: total runs, kept/discarded/crashed counts, baseline vs best, improvement percentage, confidence scoreautoresearch.jsonl - Display a formatted summary
Additional Resources
Reference Files
— Detailed MAD-based confidence methodologyreferences/confidence-scoring.md
— Tips for writing good benchmarks, choosing experiments, ASI patterns, and avoiding pitfallsreferences/best-practices.md
Example Files
— Example session document templateexamples/autoresearch.md
— Example benchmark script with METRIC outputexamples/autoresearch.sh
— Example correctness checks scriptexamples/autoresearch.checks.sh
Utility Scripts
— Extract METRIC lines from benchmark outputscripts/parse-metrics.sh
— Append an experiment result to autoresearch.jsonlscripts/log-experiment.sh