Claude-code-minoan autoresearch
git clone https://github.com/tdimino/claude-code-minoan
T=$(mktemp -d) && git clone --depth=1 https://github.com/tdimino/claude-code-minoan "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/core-development/autoresearch" ~/.claude/skills/tdimino-claude-code-minoan-autoresearch && rm -rf "$T"
skills/core-development/autoresearch/SKILL.mdFive Invariants (never violate)
- Single mutable surface — one hypothesis per iteration, one change per experiment
- Fixed eval budget — eval runs in bounded time, no network calls in gates
- One scalar metric — composite score drives keep/discard, not vibes
- Binary keep/discard — improved = keep, else revert
git reset --hard HEAD~1 - Git-as-memory — every experiment is a commit, discards are reverts, history is the log
Safety rules
- Never modify
contents during hypothesis implementation.lab/ - Never skip eval — every commit must be evaluated before keep/discard
- Always revert on crash —
handler restores git stateatexit - Runner uses subscription auth (
with ANTHROPIC_API_KEY stripped)claude -p
Autoresearch
Scaffold and run autonomous code improvement loops in any git repo. The pattern: generate a hypothesis via
claude -p, implement it, run programmatic eval gates, keep if the composite score improves, discard if it doesn't. Proven across 50+ iterations on two codebases (shadow-engine: 0.69 to 1.0, perplexity-clone: search quality optimization).
Category
Runbooks — mechanical process with clear steps, not cognitive reasoning.
Quick Start
/autoresearch init # scaffold .lab/ in your repo /autoresearch run # start the loop (default: 50 iterations) /autoresearch status # check progress /autoresearch resume # recover interrupted run
Command Dispatch
Parse
$ARGUMENTS and route:
| Argument | Action |
|---|---|
| Run scaffold workflow (see Init below) |
| Regenerate eval gates from repo analysis |
| Launch the autoresearch loop |
| Show composite, timeline, convergence signals |
| Detect , present state, ask resume or fresh |
| (empty) | Show help text with available commands |
Init Workflow (/autoresearch init
)
/autoresearch init- Verify
exists in current directory.git/ - Run stack detection:
python3 ~/.claude/skills/autoresearch/scripts/detect_stack.py - Review the detected stack info (language, build_cmd, test_cmd, lint_cmd)
- Run the scaffold script:
python3 ~/.claude/skills/autoresearch/scripts/scaffold.py --repo-root . --yes - Review
— adjust.lab/config.json
,keep_threshold
,max_iterations
if neededgate_weights - Edit
— this is the most important file. Add:.lab/program.md- Specific areas to improve (not vague goals)
- Concrete hypothesis list (ranked)
- Constraints the agent must respect
- Run baseline eval to verify gates work:
python3 .lab/eval.py - Report the initial composite to the user
If
already exists, ask the user: resume existing lab, or archive to .lab/
.lab.bak.<timestamp>/ and start fresh?
Eval-Gen Workflow (/autoresearch eval-gen
)
/autoresearch eval-genRegenerate eval gates without re-scaffolding everything:
python3 ~/.claude/skills/autoresearch/scripts/eval_gen.py --repo-root . --output .lab/eval.py
Review the generated gates. The user may want to:
- Add custom gates for domain-specific behavior
- Adjust tier weights in
.lab/config.json - Add behavioral gates that test specific CLI invocations or API endpoints
Gates follow a 4-tier architecture:
| Tier | Weight | What it measures | Anti-cheat |
|---|---|---|---|
| T1: Build+Test | 0.20 | Compiles, tests pass, lint clean | Runs real commands, sums pass counts |
| T2: Behavioral | 0.40 | Integration tests, CLI output, API responses | Validates content, not file existence |
| T3: Pipeline | 0.25 | Build artifacts, installs, real I/O | File size >1KB, header validation |
| T4: Documentation | 0.15 | Test count floor, doc coverage | Counts code, never trusts comments |
Run Workflow (/autoresearch run
)
/autoresearch runpython3 .lab/runner.py --max-iterations 50
Or for a dry run (prints hypothesis, creates no files):
python3 .lab/runner.py --dry-run --max-iterations 1
Monitor progress in a separate terminal:
tail -f .lab/results.tsv
The runner:
- Loads config from
.lab/config.json - Reads program.md for constraints and hypothesis direction
- Creates an
branchautoresearch/{date} - Loops: hypothesis via
-> implement viaclaude -p
-> git commit -> eval -> keep/discardclaude -p - Logs every experiment to
with extended statuses:.lab/results.tsv
| Status | Meaning |
|---|---|
| Composite improved >= keep_threshold |
| Primary improved but secondary metric regressed |
| No improvement, reverted |
| Negative result that reveals structure, logged to dead-ends |
| Eval infrastructure failure, reverted |
| Experiment exceeded timeout, logged as crash |
- Checks 9 convergence signals after each experiment (see
)references/convergence-signals.md - Re-validates baseline every 10 real experiments
- Auto-generates
with cumulative progress.lab/eval-report.md
Status Workflow (/autoresearch status
)
/autoresearch statuspython3 ~/.claude/skills/autoresearch/scripts/report.py --repo-root .
Shows: composite (live), experiment timeline, keeps/discards/crashes, active convergence signals, branch genealogy, dead-ends.
Resume Workflow (/autoresearch resume
)
/autoresearch resume- Check if
exists.lab/ - If yes: read
,config.json
, tail ofresults.tsvlog.md - Present summary: objective, metrics, experiment count, current best vs baseline, last status
- Ask: resume (continue from last experiment) or fresh (archive
).lab.bak.<timestamp>/ - If resume: check for stale lock file, clean up if needed, then run
.lab/ Directory Layout
.lab/ # gitignored — experiment knowledge store config.json # All parameters (repo_name, build_cmd, keep_threshold, etc.) runner.py # Customized runner (from runner_template.py) eval.py # Generated + user-extended eval gates eval_base.py # Base framework (gate registration, composite scoring) program.md # Human-maintained constraints + priorities results.tsv # Experiment log (experiment_id, branch, parent, commit, # composite, status, duration_s, description) log.md # Narrative per-experiment entries branches.md # Branch registry dead-ends.md # Falsified approaches + why they failed parking-lot.md # Deferred ideas for later eval-report.md # Auto-generated cumulative report runner-*.log # Runner stdout/stderr logs .runner.lock # PID lock file (prevents concurrent runs)
Why
not .lab/
: Code state (git) and experiment knowledge (autoresearch/
.lab/) are fully decoupled. git reset --hard HEAD~1 (the core discard mechanic) never touches .lab/. Results survive branch operations.
Three-Tier Output Protocol
Eval gates emit structured diagnostics to stderr:
GATE build=PASS # Binary — blocks iteration on FAIL METRIC test_count=475 # Continuous — tracked in results.tsv TRACE gate_duration_ms=3200 # Execution data — for debugging only
Scripts Reference
| Script | Purpose | Run from |
|---|---|---|
| Detect language, build system, test runner | Skill dir |
| Create with all files | Skill dir |
| Generate adversarial eval gates | Skill dir |
| Render status report | Skill dir |
| Template copied to | Skill dir |
| Base eval framework copied to | Skill dir |
| Config template with documented fields | Skill dir |
| Program.md template | Skill dir |
All scripts run with
python3 (no special dependencies). Use uv run if preferred.
Gotchas
in environment: The runner strips it soANTHROPIC_API_KEY
uses subscription auth (not pay-per-use API). If you want API auth, setclaude -p
in config.json.use_api_key: true- Gate stochasticity: If gates produce different scores on the same code, the runner will thrash between keep/discard. All gates must be deterministic.
- Large dt on resume: If the machine suspends during a run, the runner handles it gracefully via atexit + lock file cleanup.
- Eval crashes vs gate crashes: An eval crash (eval.py itself fails) aborts the iteration. A gate crash (one gate throws) is logged in
and excluded from composite.crashed_gates
Post-Run Checklist
After every autoresearch run:
— review keeps/discardstail -f .lab/results.tsv- Read
for cumulative progress and ceiling detection.lab/eval-report.md - Merge the autoresearch branch to main if satisfied
- Update
dead ends with falsified approaches.lab/program.md - Run
to confirm final compositepython3 .lab/eval.py
Never
- Never modify
or.lab/eval_base.py
during a run.lab/runner.py - Never run two runners concurrently (lock file prevents this, but don't bypass)
- Never commit
to git (it's gitignored for a reason).lab/ - Never trust a composite that includes crashed gates