Vibecosystem experiment-loop

Autonomous experiment loop: hypothesize > modify > test > evaluate > keep/discard > repeat. Run N experiments automatically with measurable metrics. Works for performance optimization, A/B testing, prompt engineering, and any measurable improvement task.

install
source · Clone the upstream repo
git clone https://github.com/vibeeval/vibecosystem
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/vibeeval/vibecosystem "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/experiment-loop" ~/.claude/skills/vibeeval-vibecosystem-experiment-loop && rm -rf "$T"
manifest: skills/experiment-loop/SKILL.md
source content

Experiment Loop

Autonomous, iterative improvement inspired by Karpathy's autoresearch methodology. Define a metric, set a target, and let the loop run until the target is met or the iteration limit is reached.

The 5-Step Loop

1. HYPOTHESIZE  -> Form a specific, falsifiable improvement hypothesis
2. MODIFY       -> Apply the minimal code/config/prompt change
3. TEST         -> Run the measurement suite (benchmarks, tests, evals)
4. EVALUATE     -> Compare result against baseline and previous best
5. DECIDE       -> KEEP if better, DISCARD (git stash pop --index) if worse
      |
   Repeat until target met OR max_iterations reached

Each iteration is atomic: one hypothesis, one change, one measurement, one decision.

Experiment Definition

Define an experiment in your task or in

thoughts/EXPERIMENTS.md
:

experiment:
  name: "reduce-api-latency"
  metric: "p95 response time (ms)"
  baseline: 340
  target: 200
  direction: minimize          # minimize | maximize
  max_iterations: 10           # hard cap, never exceed
  measurement_cmd: "npm run bench:api"
  measurement_key: "p95"       # JSON key from bench output
  scope: "src/api/"            # files the loop is allowed to touch

Key Fields

FieldDescription
metric
Human-readable name of what you are measuring
baseline
Measured value before any changes (run this first)
target
Success condition -- loop exits when this is met
direction
minimize
for latency/size,
maximize
for coverage/score
max_iterations
Safety cap, default 10, absolute maximum 10
measurement_cmd
Shell command that produces JSON with the metric value
scope
Directories/files the loop is allowed to modify

Safety Protocol

Before every experiment iteration:

# Save current state
git stash push -u -m "experiment-loop: iteration N baseline"

# Run experiment
# ... apply hypothesis change ...
# ... run measurement ...

# Decision
if result is better:
    git stash drop          # keep changes, discard stash
else:
    git stash pop --index   # restore exactly: staged + unstaged

Never skip the stash. Never accumulate multiple iterations without a decision checkpoint. If the measurement command fails or times out, treat it as DISCARD.

Agent Integration

The experiment loop coordinates three vibecosystem agents:

PhaseAgentRole
Hypothesize
profiler
Identify bottlenecks, suggest what to change
Modify
spark
Apply the focused code change
Test + Evaluate
verifier
/
tdd-guide
Run benchmarks, tests, evals and parse results

Spawn

profiler
once at the start to get the initial hypothesis queue. Then run
spark
+
verifier
in tight loops per iteration.

Example Experiments

Bundle Size Reduction

experiment:
  name: "optimize-bundle-size"
  metric: "gzipped bundle size (KB)"
  baseline: 420
  target: 300
  direction: minimize
  max_iterations: 10
  measurement_cmd: "npm run build && node scripts/measure-bundle.js"
  measurement_key: "gzipped_kb"
  scope: "src/"

Hypothesis queue to try in order:

  1. Add tree-shaking for unused lodash imports (use named imports)
  2. Replace
    moment
    with
    date-fns
    (smaller footprint)
  3. Move large dependencies to dynamic
    import()
    at route boundaries
  4. Enable
    usedExports: true
    in webpack/rollup config
  5. Replace
    axios
    with native
    fetch
    wrapper

API Latency

experiment:
  name: "reduce-api-latency"
  metric: "p95 response time (ms)"
  baseline: 340
  target: 200
  direction: minimize
  max_iterations: 8
  measurement_cmd: "npm run bench:api"
  measurement_key: "p95"
  scope: "src/api/"

Hypothesis queue:

  1. Add Redis cache for repeated DB reads (TTL 60s)
  2. Replace N+1 queries with single JOIN query
  3. Add connection pool sizing (
    max: 20
    )
  4. Move synchronous validation to async parallel (
    Promise.all
    )
  5. Add response compression (gzip middleware)

Test Coverage

experiment:
  name: "improve-test-coverage"
  metric: "line coverage (%)"
  baseline: 64
  target: 80
  direction: maximize
  max_iterations: 10
  measurement_cmd: "npm test -- --coverage --json > coverage.json"
  measurement_key: "coverageMap.total.lines.pct"
  scope: "src/"

Prompt Engineering (LLM Eval)

experiment:
  name: "improve-extraction-accuracy"
  metric: "extraction F1 score"
  baseline: 0.71
  target: 0.85
  direction: maximize
  max_iterations: 10
  measurement_cmd: "python eval/run_evals.py --output eval/results.json"
  measurement_key: "f1"
  scope: "prompts/"

Results Log Format

Append each iteration result to

thoughts/EXPERIMENTS.md
:

## Experiment: reduce-api-latency
Started: 2026-04-07T10:00:00Z
Baseline: 340ms | Target: 200ms | Direction: minimize

### Iteration 1
- Hypothesis: Add Redis cache for repeated DB reads
- Change: `src/api/users.ts` lines 45-67 -- wrap DB call with cache layer
- Result: 280ms (improvement: -60ms, -17.6%)
- Decision: KEEP
- Cumulative best: 280ms

### Iteration 2
- Hypothesis: Replace N+1 queries with JOIN
- Change: `src/api/users.ts` lines 89-102 -- rewrite fetchWithPosts()
- Result: 210ms (improvement: -70ms, -25%)
- Decision: KEEP
- Cumulative best: 210ms

### Iteration 3
- Hypothesis: Add connection pool sizing max:20
- Change: `src/db/pool.ts` line 12 -- max: 10 -> 20
- Result: 215ms (regression: +5ms)
- Decision: DISCARD (restored via git stash pop)
- Cumulative best: 210ms

### Final Result
- Target: 200ms | Achieved: 210ms | Status: NEAR_MISS (within 5%)
- Iterations: 3 of 10 used
- Total improvement: -38% from baseline

Iteration Limits and Exit Conditions

ConditionAction
Target metEXIT -- log SUCCESS, keep all accumulated changes
max_iterations reachedEXIT -- log PARTIAL, keep best achieved state
3 consecutive DISCARDsPAUSE -- re-run profiler for new hypothesis queue
Measurement command failsDISCARD current iteration, continue loop
Git stash failsSTOP -- do not continue, report error

Running the Loop

Invoke this skill by describing the experiment:

Use experiment-loop to reduce the API p95 latency from 340ms to under 200ms.
Baseline measurement: npm run bench:api
Max iterations: 8
Scope: src/api/

The loop will:

  1. Read any existing
    thoughts/EXPERIMENTS.md
    for prior runs on the same metric
  2. Ask
    profiler
    for an ordered hypothesis queue
  3. Execute iterations with safety stashing
  4. Log each result immediately after measurement
  5. Report final state with all changes that were kept

Hard Limits

  • Maximum 10 experiments per invocation (no exceptions)
  • Scope must be specified -- loop will not touch files outside scope
  • Measurement command must be deterministic (no unbounded network calls)
  • Total wall-clock time cap: 30 minutes (prevents runaway loops)
  • Never auto-merge to main -- changes stay on current branch