Claude-code-minoan autoresearch

install

source · Clone the upstream repo

git clone https://github.com/tdimino/claude-code-minoan

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/tdimino/claude-code-minoan "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/core-development/autoresearch" ~/.claude/skills/tdimino-claude-code-minoan-autoresearch && rm -rf "$T"

manifest: skills/core-development/autoresearch/SKILL.md

source content

Five Invariants (never violate)

Single mutable surface — one hypothesis per iteration, one change per experiment
Fixed eval budget — eval runs in bounded time, no network calls in gates
One scalar metric — composite score drives keep/discard, not vibes
Binary keep/discard — improved = keep, else revert
```
git reset --hard HEAD~1
```
Git-as-memory — every experiment is a commit, discards are reverts, history is the log

Safety rules

Never modify
```
.lab/
```
contents during hypothesis implementation
Never skip eval — every commit must be evaluated before keep/discard
Always revert on crash —
```
atexit
```
handler restores git state
Runner uses subscription auth (
```
claude -p
```
with ANTHROPIC_API_KEY stripped)

</critical>

Autoresearch

Scaffold and run autonomous code improvement loops in any git repo. The pattern: generate a hypothesis via

claude -p

, implement it, run programmatic eval gates, keep if the composite score improves, discard if it doesn't. Proven across 50+ iterations on two codebases (shadow-engine: 0.69 to 1.0, perplexity-clone: search quality optimization).

Quick Start

/autoresearch init          # scaffold .lab/ in your repo
/autoresearch run           # start the loop (default: 50 iterations)
/autoresearch status        # check progress
/autoresearch resume        # recover interrupted run

Command Dispatch

Parse

$ARGUMENTS

and route:

Argument	Action
`init`	Run scaffold workflow (see Init below)
`eval-gen`	Regenerate eval gates from repo analysis
`run [--max-iterations N] [--dry-run]`	Launch the autoresearch loop
`status`	Show composite, timeline, convergence signals
`resume`	Detect `.lab/` , present state, ask resume or fresh
(empty)	Show help text with available commands

Init Workflow (

/autoresearch init

)

Verify
```
.git/
```
exists in current directory

Run stack detection:

python3 ~/.claude/skills/autoresearch/scripts/detect_stack.py

Review the detected stack info (language, build_cmd, test_cmd, lint_cmd)

Run the scaffold script:

python3 ~/.claude/skills/autoresearch/scripts/scaffold.py --repo-root . --yes

Review

.lab/config.json

— adjust

keep_threshold

max_iterations

gate_weights

if needed

Edit
```
.lab/program.md
```
— this is the most important file. Add:
- Specific areas to improve (not vague goals)
- Concrete hypothesis list (ranked)
- Constraints the agent must respect

Run baseline eval to verify gates work:

python3 .lab/eval.py

Report the initial composite to the user

.lab/

already exists, ask the user: resume existing lab, or archive to

.lab.bak.<timestamp>/

and start fresh?

Eval-Gen Workflow (

/autoresearch eval-gen

)

Regenerate eval gates without re-scaffolding everything:

python3 ~/.claude/skills/autoresearch/scripts/eval_gen.py --repo-root . --output .lab/eval.py

Review the generated gates. The user may want to:

Add custom gates for domain-specific behavior
Adjust tier weights in
```
.lab/config.json
```
Add behavioral gates that test specific CLI invocations or API endpoints

Gates follow a 4-tier architecture:

Tier	Weight	What it measures	Anti-cheat
T1: Build+Test	0.20	Compiles, tests pass, lint clean	Runs real commands, sums pass counts
T2: Behavioral	0.40	Integration tests, CLI output, API responses	Validates content, not file existence
T3: Pipeline	0.25	Build artifacts, installs, real I/O	File size >1KB, header validation
T4: Documentation	0.15	Test count floor, doc coverage	Counts code, never trusts comments

Run Workflow (

/autoresearch run

)

python3 .lab/runner.py --max-iterations 50

Or for a dry run (prints hypothesis, creates no files):

python3 .lab/runner.py --dry-run --max-iterations 1

Monitor progress in a separate terminal:

tail -f .lab/results.tsv

The runner:

Loads config from
```
.lab/config.json
```
Reads program.md for constraints and hypothesis direction
Creates an
```
autoresearch/{date}
```
branch
Loops: hypothesis via
```
claude -p
```
-> implement via
```
claude -p
```
-> git commit -> eval -> keep/discard
Logs every experiment to
```
.lab/results.tsv
```
with extended statuses:

Status	Meaning
`KEEP`	Composite improved >= keep_threshold
`KEEP*`	Primary improved but secondary metric regressed
`DISCARD`	No improvement, reverted
`INTERESTING`	Negative result that reveals structure, logged to dead-ends
`CRASH`	Eval infrastructure failure, reverted
`TIMEOUT`	Experiment exceeded timeout, logged as crash

Checks 9 convergence signals after each experiment (see
```
references/convergence-signals.md
```
)
Re-validates baseline every 10 real experiments
Auto-generates
```
.lab/eval-report.md
```
with cumulative progress

Status Workflow (

/autoresearch status

)

python3 ~/.claude/skills/autoresearch/scripts/report.py --repo-root .

Shows: composite (live), experiment timeline, keeps/discards/crashes, active convergence signals, branch genealogy, dead-ends.

Resume Workflow (

/autoresearch resume

)

Check if
```
.lab/
```
exists
If yes: read
```
config.json
```
,
```
results.tsv
```
, tail of
```
log.md
```
Present summary: objective, metrics, experiment count, current best vs baseline, last status
Ask: resume (continue from last experiment) or fresh (archive
```
.lab.bak.<timestamp>/
```
)
If resume: check for stale lock file, clean up if needed, then run

.lab/ Directory Layout

.lab/                          # gitignored — experiment knowledge store
  config.json                # All parameters (repo_name, build_cmd, keep_threshold, etc.)
  runner.py                  # Customized runner (from runner_template.py)
  eval.py                    # Generated + user-extended eval gates
  eval_base.py               # Base framework (gate registration, composite scoring)
  program.md                 # Human-maintained constraints + priorities
  results.tsv                # Experiment log (experiment_id, branch, parent, commit,
                             #   composite, status, duration_s, description)
  log.md                     # Narrative per-experiment entries
  branches.md                # Branch registry
  dead-ends.md               # Falsified approaches + why they failed
  parking-lot.md             # Deferred ideas for later
  eval-report.md             # Auto-generated cumulative report
  runner-*.log               # Runner stdout/stderr logs
  .runner.lock               # PID lock file (prevents concurrent runs)

Why

.lab/

not
autoresearch/
: Code state (git) and experiment knowledge (

.lab/

) are fully decoupled.

git reset --hard HEAD~1

(the core discard mechanic) never touches

.lab/

. Results survive branch operations.

Three-Tier Output Protocol

Eval gates emit structured diagnostics to stderr:

GATE build=PASS              # Binary — blocks iteration on FAIL
METRIC test_count=475        # Continuous — tracked in results.tsv
TRACE gate_duration_ms=3200  # Execution data — for debugging only

Scripts Reference

Script	Purpose	Run from
`scripts/detect_stack.py`	Detect language, build system, test runner	Skill dir
`scripts/scaffold.py`	Create `.lab/` with all files	Skill dir
`scripts/eval_gen.py`	Generate adversarial eval gates	Skill dir
`scripts/report.py`	Render status report	Skill dir
`scripts/runner_template.py`	Template copied to `.lab/runner.py`	Skill dir
`assets/eval_base.py`	Base eval framework copied to `.lab/`	Skill dir
`assets/config.json.tmpl`	Config template with documented fields	Skill dir
`assets/program.md.tmpl`	Program.md template	Skill dir

All scripts run with

python3

(no special dependencies). Use

uv run

if preferred.

Gotchas

ANTHROPIC_API_KEY
in environment: The runner strips it so
```
claude -p
```
uses subscription auth (not pay-per-use API). If you want API auth, set
```
use_api_key: true
```
in config.json.
Gate stochasticity: If gates produce different scores on the same code, the runner will thrash between keep/discard. All gates must be deterministic.
Large dt on resume: If the machine suspends during a run, the runner handles it gracefully via atexit + lock file cleanup.
Eval crashes vs gate crashes: An eval crash (eval.py itself fails) aborts the iteration. A gate crash (one gate throws) is logged in
```
crashed_gates
```
and excluded from composite.

Post-Run Checklist

After every autoresearch run:

```
tail -f .lab/results.tsv
```
— review keeps/discards
Read
```
.lab/eval-report.md
```
for cumulative progress and ceiling detection
Merge the autoresearch branch to main if satisfied
Update
```
.lab/program.md
```
dead ends with falsified approaches
Run
```
python3 .lab/eval.py
```
to confirm final composite

Never

Never modify
```
.lab/eval_base.py
```
or
```
.lab/runner.py
```
during a run
Never run two runners concurrently (lock file prevents this, but don't bypass)
Never commit
```
.lab/
```
to git (it's gitignored for a reason)
Never trust a composite that includes crashed gates

</critical>

Claude-code-minoan autoresearch

Five Invariants (never violate)

Safety rules

Autoresearch

Category

Quick Start

Command Dispatch

Init Workflow (
`/autoresearch init`
)

Eval-Gen Workflow (
`/autoresearch eval-gen`
)

Run Workflow (
`/autoresearch run`
)

Status Workflow (
`/autoresearch status`
)

Resume Workflow (
`/autoresearch resume`
)

.lab/ Directory Layout

Three-Tier Output Protocol

Scripts Reference

Gotchas

Post-Run Checklist

Never

Claude-code-minoan autoresearch

Five Invariants (never violate)

Safety rules

Autoresearch

Category

Quick Start

Command Dispatch

Init Workflow (/autoresearch init)

Eval-Gen Workflow (/autoresearch eval-gen)

Run Workflow (/autoresearch run)

Status Workflow (/autoresearch status)

Resume Workflow (/autoresearch resume)

.lab/ Directory Layout

Three-Tier Output Protocol

Scripts Reference

Gotchas

Post-Run Checklist

Never

Init Workflow (
`/autoresearch init`
)

Eval-Gen Workflow (
`/autoresearch eval-gen`
)

Run Workflow (
`/autoresearch run`
)

Status Workflow (
`/autoresearch status`
)

Resume Workflow (
`/autoresearch resume`
)