git clone https://github.com/Aradotso/trending-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aradotso/trending-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/pi-autoresearch-loop" ~/.claude/skills/aradotso-trending-skills-pi-autoresearch-loop && rm -rf "$T"
skills/pi-autoresearch-loop/SKILL.md--- name: pi-autoresearch-loop description: Autonomous experiment loop for pi that continuously tries optimizations, measures results, and keeps what works triggers: - autoresearch - autonomous experiment loop - optimize automatically - run experiment loop - continuous optimization - benchmark and improve - start autoresearch session - keep what works discard what doesnt --- # pi-autoresearch — Autonomous Experiment Loop > Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection Autonomous experiment loop extension for [pi](https://github.com/antiwork/pi). Continuously proposes changes, benchmarks them, commits wins, reverts losses, and repeats — forever. Works for any measurable target: test speed, bundle size, build time, LLM training loss, Lighthouse scores. --- ## Installation ```bash pi install https://github.com/davebcn87/pi-autoresearch
Manual install:
cp -r extensions/pi-autoresearch ~/.pi/agent/extensions/ cp -r skills/autoresearch-create ~/.pi/agent/skills/
Then
/reload in pi.
Quick Start
/skill:autoresearch-create
The agent will:
- Ask about your goal, command, metric, and files in scope (or infer from context)
- Create a branch
- Write
andautoresearch.mdautoresearch.sh - Run the baseline
- Start looping immediately — no further input needed
Core Concepts
Two-file persistence model
Every session is fully recoverable from two files:
| File | Purpose |
|---|---|
| Append-only log — one JSON line per run (metric, status, commit, description) |
| Living document — objective, what's been tried, dead ends, key wins |
A fresh agent with zero memory can read these two files and continue exactly where the previous session left off.
Session files written by the skill
| File | Purpose |
|---|---|
| Session document — objective, metrics, files in scope, experiment history |
| Benchmark script — pre-checks, runs the workload, outputs lines |
| (optional) Backpressure checks — tests, types, lint. Failures block |
Extension Tools
init_experiment
init_experimentOne-time session configuration. Call once at session start.
await init_experiment({ name: "vitest-speed", metric: "seconds", unit: "s", direction: "lower", // "lower" | "higher" });
run_experiment
run_experimentRuns any shell command, times wall-clock duration, captures stdout/stderr.
const result = await run_experiment({ command: "pnpm test --run", timeout_seconds: 120, // optional, default 300 checks_timeout_seconds: 300, // optional, for checks script }); // result: { exit_code, duration_seconds, stdout, stderr }
log_experiment
log_experimentRecords result, auto-commits on
keep, updates the status widget and dashboard.
await log_experiment({ metric_value: 42.3, status: "keep", // "keep" | "discard" | "crash" | "checks_failed" description: "Enable parallel test workers in vitest config", commit_message: "perf: parallel vitest workers → 42.3s (-18%)", });
The Autonomous Loop
Once started, the agent runs this cycle indefinitely:
propose change → edit files → run_experiment → measure metric ↓ metric improved? YES → log_experiment(keep) → auto-commit → update autoresearch.md NO → log_experiment(discard) → git revert → try next idea ↓ repeat forever (until interrupted)
Interrupt anytime with
Escape, then ask for a summary of what was tried.
Benchmark Script Format
autoresearch.sh must output at least one METRIC line:
#!/bin/bash set -euo pipefail # Pre-checks [ -f package.json ] || { echo "No package.json"; exit 1; } # Run workload pnpm test --run # Output metric — required format echo "METRIC seconds=$SECONDS"
Multiple metrics are supported:
echo "METRIC duration_seconds=42.3" echo "METRIC test_count=847" echo "METRIC memory_mb=512"
The primary metric (set in
init_experiment) drives keep/discard decisions. Others are recorded for analysis.
Backpressure Checks (Optional)
Create
autoresearch.checks.sh to guard correctness after every passing benchmark:
#!/bin/bash set -euo pipefail pnpm test --run # full test suite pnpm typecheck # TypeScript pnpm lint # ESLint / Biome
Behavior:
- File absent → loop runs exactly as before, no change
- File present → runs automatically after every benchmark that exits 0
- Checks time does not count toward the primary metric
- Checks failure → logged as
, changes reverted (same as crash)checks_failed - Dashboard shows
separately fromchecks_failed
so you can distinguish correctness failures from benchmark errorscrash
UI
Status Widget
Always visible above the editor:
🔬 autoresearch 12 runs 8 kept │ best: 42.3s
Dashboard
Open with
/autoresearch — full results table with status, metric values, descriptions, and best run highlighted.
— toggle dashboardCtrl+X
— close dashboard / interrupt loopEscape
Example Domains
// Test speed { command: "pnpm test --run", metric: "seconds", direction: "lower", scope: ["vitest.config.ts", "src/**/*.test.ts"], } // Bundle size { command: "pnpm build && du -sb dist | cut -f1", metric: "bytes", direction: "lower", scope: ["vite.config.ts", "src/index.ts"], } // LLM training loss { command: "uv run train.py --epochs 1", metric: "val_bpb", direction: "lower", scope: ["train.py", "model.py", "config.yaml"], } // Build speed { command: "pnpm build", metric: "seconds", direction: "lower", scope: ["tsconfig.json", "vite.config.ts"], } // Lighthouse performance { command: "lighthouse http://localhost:3000 --output=json | jq '.categories.performance.score'", metric: "score", direction: "higher", scope: ["src/pages/index.tsx", "public/"], }
autoresearch.md Structure
The skill writes and maintains this file throughout the session:
# autoresearch: vitest-speed ## Objective Reduce test suite wall-clock time. Baseline: 51.7s. ## Metric - Name: seconds - Direction: lower is better - Baseline: 51.7s - Best so far: 42.3s (run 8) ## Files in scope - vitest.config.ts - src/**/*.test.ts ## What's been tried - [kept] Run 8: Enable parallel workers → 42.3s (-18%) - [discarded] Run 5: Increase pool size to 16 → 53.1s (+3%) - [kept] Run 3: Disable coverage in CI → 47.8s (-8%) ## Dead ends - Increasing pool beyond 8 causes memory pressure, net negative ## Next ideas - [ ] Try forks pool instead of threads - [ ] Investigate slow test files with --reporter=verbose
autoresearch.jsonl Format
One JSON object per line:
{"run":1,"metric_value":51.7,"status":"keep","description":"baseline","commit":"a1b2c3d","timestamp":"2025-01-15T10:00:00Z"} {"run":2,"metric_value":49.2,"status":"keep","description":"disable coverage","commit":"e4f5g6h","timestamp":"2025-01-15T10:03:21Z"} {"run":3,"metric_value":53.1,"status":"discard","description":"increase pool to 16","commit":null,"timestamp":"2025-01-15T10:07:45Z"} {"run":4,"metric_value":null,"status":"crash","description":"invalid vitest config syntax","commit":null,"timestamp":"2025-01-15T10:09:12Z"}
Read the log programmatically:
import { readFileSync } from "fs"; const runs = readFileSync("autoresearch.jsonl", "utf-8") .trim() .split("\n") .map((line) => JSON.parse(line)); const kept = runs.filter((r) => r.status === "keep"); const best = kept.reduce((a, b) => a.metric_value < b.metric_value ? a : b ); console.log(`Best: ${best.metric_value} — ${best.description}`);
Resuming a Session
The agent can resume from either file. Recommended resume prompt:
Read autoresearch.jsonl and autoresearch.md, then continue the experiment loop. Don't restart — pick up from run N and keep going.
Or use the skill:
/skill:autoresearch-create resume
Architecture
┌──────────────────────┐ ┌──────────────────────────┐ │ Extension (global) │ │ Skill (per-domain) │ │ │ │ │ │ run_experiment │◄────│ command: pnpm test │ │ log_experiment │ │ metric: seconds (lower) │ │ widget + dashboard │ │ scope: vitest configs │ │ │ │ ideas: pool, parallel… │ └──────────────────────┘ └──────────────────────────┘ │ ▼ autoresearch.jsonl ← append-only run log autoresearch.md ← living session document
The extension is domain-agnostic infrastructure. The skill encodes domain knowledge. One extension serves unlimited domains.
Troubleshooting
Loop not starting after skill runs
- Check that
is executable:autoresearch.shchmod +x autoresearch.sh - Verify the script outputs a
line on successMETRIC name=number - Run
manually to debugbash autoresearch.sh
Widget not showing
- Run
in pi to reload the extension/reload - Confirm the extension is in
~/.pi/agent/extensions/pi-autoresearch/
times outrun_experiment
- Increase
in yourtimeout_seconds
callrun_experiment - Default is 300s — long benchmarks (LLM training) may need 3600+
Checks script blocking everything
- Check
exit codes manually:autoresearch.checks.shbash autoresearch.checks.sh - Increase
if tests are slowchecks_timeout_seconds - Remove the file temporarily to isolate whether the benchmark or checks are failing
Session lost after context reset
- The agent needs only
+autoresearch.jsonl
to resumeautoresearch.md - Both files are committed to the branch — they survive any context reset
- Use the resume prompt above to continue
Metric value not captured
- Ensure the benchmark script exits 0 on success
- The
line must be on stdout, not stderrMETRIC - Format must be exactly
(no spaces aroundMETRIC name=number
)=
License
MIT