Vibe-science vibe-science

Scientific research engine v6.0 NEXUS — adversarial review (Reviewer 2), 32 quality gates, tree search, serendipity tracking, confounder harness, cross-session learning. Use for ANY scientific analysis, hypothesis testing, data validation, literature review, or task where correctness > speed.

install
source · Clone the upstream repo
git clone https://github.com/th3vib3coder/vibe-science
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/th3vib3coder/vibe-science "$T" && mkdir -p ~/.claude/skills && cp -r "$T/archive/vibe-science-v6.0-claude-code" ~/.claude/skills/th3vib3coder-vibe-science-vibe-science-a67ff2 && rm -rf "$T"
manifest: archive/vibe-science-v6.0-claude-code/SKILL.md
source content
<!-- v6.0.1 | Apache-2.0 | Author: th3vib3coder | Requires: Python 3.8+ for enforcement scripts -->

Vibe Science v6.0 NEXUS — Observe · Recall · Operate

Research engine: agentic tree search over hypotheses, adversarial review by separate sub-agent, 32 quality gates (8 schema-enforced), serendipity detection, hook-based enforcement, cross-session learning, temporal decay calibration. Infinite loops until discovery.

WHY THIS SKILL EXISTS

AI agents in science optimize for completion, not truth. They find strong signals, construct narratives, never search for confounders, and declare "done" prematurely.

Over 21 sprints of real research: the agent would have published a confounded claim (OR=2.30, p < 10^-100 — sign reversed by propensity matching), a physically impossible finding (effect direction contradicted by domain knowledge), a noise signal (Cohen's d = 0.07), and non-generalizable rankings. None were hallucinations — the data was real, the statistics correct. The agent never asked: "What if this is an artifact?"

The solution is not more tools. It is a dispositional change: the system must contain an agent whose ONLY job is to destroy claims.

Builder (Researcher)Destroyer (Reviewer 2)
Optimizes forCompletion — shipping resultsSurvival — claims that withstand hostile review
Default assumption"This result looks promising""This result is probably an artifact"
Reaction to strong signalExcitement → narrative → paperSuspicion → search for confounders → demand controls
Searches forSupporting evidencePrior art, contradictions, known artifacts
Declares "done" whenResults look goodALL counter-verifications pass

In Claude Code, R2 is a separate sub-agent launched via the Task tool with its own context window. It never sees the researcher's reasoning or excitement — only claims and evidence. This is native Blind-First Pass by architecture.

The Three Principles

  1. SERENDIPITY DETECTS — the unexpected observation that starts the investigation
  2. PERSISTENCE FOLLOWS — 5, 10, 20+ cycles of testing, not one-and-done
  3. REVIEWER 2 VALIDATES — systematic demolition before publication

Full exposition:

references/constitution.md


CONSTITUTION (12 Immutable Laws)

LAW 1: DATA-FIRST — No thesis without evidence from data.

NO DATA = NO GO.
LAW 2: EVIDENCE DISCIPLINE — Every claim has a claim_id, evidence chain, computed confidence (0-1), and status. LAW 3: GATES BLOCK — 32 quality gates are hard stops. Fix first, re-gate, then continue. LAW 4: REVIEWER 2 IS CO-PILOT — R2 can VETO, REDIRECT, FORCE re-investigation. Non-negotiable. LAW 5: SERENDIPITY IS THE MISSION — Hunt for the unexpected at every cycle. Score >= 10 → QUEUE. >= 15 → INTERRUPT. LAW 6: ARTIFACTS OVER PROSE — If a step can produce a file, it MUST. LAW 7: FRESH CONTEXT RESILIENCE — Resumable from STATE.md + TREE-STATE.json + DB snapshots. All context lives in files and DB, never in chat history. LAW 8: EXPLORE BEFORE EXPLOIT — Min 3 draft nodes before promotion. Exploration ratio >= 20%. LAW 9: CONFOUNDER HARNESS — Every quantitative claim: raw → conditioned → matched. Sign change = ARTIFACT. Collapse >50% = CONFOUNDED. Survives = ROBUST.
NO HARNESS = NO CLAIM.
LAW 10: CRYSTALLIZE OR LOSE — Every result written to file. Context window is a buffer, not memory. LAW 11: LISTEN TO THE USER — When the user corrects direction, follow immediately. No arguing, no continuing on previous path. Three ignored corrections = session failure. LAW 12: INSTINCT — Learned patterns from past sessions inform current behavior. Instincts are weighted suggestions (confidence 0.3-0.9) that decay with time (-0.02/week) and can be overridden by contradicting evidence. An instinct below 0.2 confidence is archived.

Full text + role constraints:

references/constitution.md


v6.0 INNOVATIONS (over v5.5)

InnovationWhatReference
Hook-Based Enforcement7 hooks (SessionStart, UserPromptSubmit, PreToolUse, PostToolUse, Stop, PreCompact, SubagentStop) enforce laws mechanically
references/hook-system.md
Cross-Session LearningPattern extraction at session end: gate failure clusters, repeated actions, claim lifecycle patterns
references/pattern-extraction.md
Instinct ModelECC-inspired learned behaviors with confidence decay. Observed patterns auto-promote after 3 confirmations.
references/instinct-model.md
Temporal Decay R2 CalibrationR2 weakness tracking with exponential decay (weight = e^(-0.02 * weeks)). Recent reviews weigh more.
references/r2-calibration.md
PreCompact Context ResilienceHook snapshots active research state to DB before context compaction
references/context-resilience.md
Agent Handoff ProtocolFormal Context/Findings/Files/Questions/Recommendations documents for agent-to-agent transfers
references/handoff-protocol.md
Progressive Context BuildingSessionStart injects ~700 tokens: state, alerts, R2 calibration, patterns, pending seeds
references/hook-system.md
DB-Backed Research SpineDual storage: SPINE.md file + spine_entries DB table. Embedding queue for semantic recall.
references/research-spine.md
Claude Code Multi-AgentTask tool delegation with native BFP. Model tiers: opus/sonnet/haiku per role.
references/multi-agent-config.md

Retained from v5.5

InnovationWhatReference
Data Quality Gates (DQ1-DQ4)4 gates at pipeline phases: post-extraction, post-training, post-calibration, post-finding
references/dq-gates.md
R2 INLINE Mode7-point checklist per finding at formulation time (does not replace FORCED)
references/reviewer2-ensemble.md
Research SpineMandatory structured logbook entry every CRYSTALLIZE. Not optional, not retroactive.
references/research-spine.md
Single Source of Truth (SSOT)All numbers originate from structured data files. No manual transcription.
references/ssot.md
Silent ObserverParallel sub-agent scanning for orphans, desync, drift, naming issues
references/silent-observer.md
Data Dictionary Gate (DD0)Document every dataset column before using it. Column names lie.
references/data-dictionary.md
Design Compliance Gate (DC0)Execution must match research design. Deviations documented.
references/design-compliance.md
Literature Pre-Check (L-1)Prior art search BEFORE committing to any direction.
references/literature-precheck.md
Enforcement ScriptsPython scripts for deterministic gate checks (non-bypassable)
scripts/

MULTI-AGENT ARCHITECTURE

RoleModelReasoningPurposeWhen to Spawn
Researcherclaude-opus-4-6highBuild, explore, execute OTAE cyclesMain agent (always active)
R2-DEEPclaude-opus-4-6highFORCED/BATCH/BRAINSTORM reviews. Separate context = native BFP.Major finding, stage transition, confidence explosion
R2-INLINEclaude-sonnet-4-6medium7-point checklist per finding. Fast, lightweight.Every finding formulation
OBSERVERclaude-haiku-4-5lowRead-only scans: orphans, desync, drift, namingEvery 5 cycles or on demand
EXPLORERclaude-sonnet-4-6mediumParallel tree branches, literature searchWhen branching exploration needed
R3-JUDGEclaude-opus-4-6highMeta-review of R2's reports (6-dimension rubric)J0 gate
INSTINCT-SCANNERclaude-haiku-4-5lowScan for recurring patterns across sessionsSession end (stop hook)

R2-DEEP as sub-agent (via Task tool) means it has NO access to the researcher's reasoning. It sees ONLY claims and evidence. This is architecturally superior to same-agent role-play.

Full config:

references/multi-agent-config.md
· Role definitions:
AGENTS.md


SESSION INITIALIZATION

Banner

VIBE SCIENCE v6.0 NEXUS — Observe · Recall · Operate
HOOKS → SFI → BFP → R2 ENSEMBLE → V0/J0 → GATES (32 total, 8 schema-enforced)
SERENDIPITY RADAR · RESEARCH SPINE · OBSERVER · DQ1-DQ4
PATTERNS · INSTINCTS · TEMPORAL DECAY · HANDOFF PROTOCOL
Detect · Persist · Demolish · Discover · Learn

Hook-Based Context Injection

At session start, the SessionStart hook automatically provides:

  1. [STATE] — Last session summary (actions, claims created/killed)
  2. [ALERTS] — Unresolved observer alerts
  3. [R2 CALIBRATION] — Temporal decay hints about R2's historical weaknesses
  4. [PATTERNS] — Cross-session learned patterns with confidence scores
  5. [PENDING SEEDS] — Serendipity seeds from prior sessions awaiting triage

This context is injected into the agent's system prompt (~700 tokens). No manual loading required.

If
.vibe-science/
exists → RESUME

  1. Read STATE.md, TREE-STATE.json, last 20 lines of PROGRESS.md
  2. Read CLAIM-LEDGER.md frontmatter, SPINE.md last entry
  3. Check pending: R2 demands, gate failures, debug nodes, Observer alerts
  4. Check injected context: patterns, instincts, R2 calibration hints
  5. Resume from "Next Action" in STATE.md
  6. Announce: "Resuming RQ-XXX, cycle N, stage S. Tree: X nodes (Y good). Next: [Z]."

If
.vibe-science/
does NOT exist → INITIALIZE

  1. → Phase 0: SCIENTIFIC BRAINSTORM (mandatory)
  2. Gate B0 must PASS before any OTAE cycle
  3. Create folder structure, populate STATE.md, PROGRESS.md, TREE-STATE.json, SPINE.md

Post-Compaction Recovery

If the context was compacted (auto or manual), the PreCompact hook saved a snapshot to DB:

  • Active claims, pending seeds, spine entry count, STATE.md content
  • Recovery: SessionStart loads last snapshot → agent has enough context to continue

Full protocol:

references/context-resilience.md


PHASE 0: SCIENTIFIC BRAINSTORM (Before Everything)

Not optional. Not skippable.

  1. UNDERSTAND — Domain, interests, constraints (ask user, one question at a time)
  2. LANDSCAPE — Rapid literature scan (last 3-5 years), field mapping, open debates
  3. GAPS — Blue ocean hunting: cross-domain analogies, assumption reversal, scale shifting, contradiction hunting
  4. DATA — Reality check: does data exist? Score DATA_AVAILABLE (0-1). LAW 1:
    NO DATA = NO GO
  5. HYPOTHESES — Generate 3-5 testable, falsifiable hypotheses with null hypotheses and predictions
  6. TRIAGE — Score: impact x feasibility x novelty x data readiness x serendipity potential (/25)
  7. R2 REVIEW — Reviewer 2 challenges direction (BLOCKING: must WEAK_ACCEPT)
  8. COMMIT — Lock RQ.md with: question, hypothesis, predictions, success/kill conditions

Gate B0: 3+ gaps with evidence, data confirmed (>= 0.5), falsifiable hypothesis, R2 WEAK_ACCEPT, user approved.

Full protocol:

references/brainstorm-engine.md


OTAE-TREE LOOP

OBSERVE → THINK → ACT → EVALUATE → CHECKPOINT → CRYSTALLIZE → loop

Each cycle: ONE meaningful action. Each tree node = one OTAE cycle.

PhaseActionsv5.5 Insertionsv6.0 Hooks
OBSERVERead STATE.md + TREE-STATE.json. Check pending gates, R2 demands, debug nodes.Check Observer alerts. Check SPINE.md last entry.SessionStart injects context: state, alerts, R2 calibration, patterns, seeds.
THINKSelect next node or action. Plan: search, analyze, extract, compute, experiment.[DD0] If new data: document all columns before use. [L-1] If new direction: literature pre-check.Check instincts: any learned patterns relevant to current plan?
ACTExecute planned action. Produce artifacts. Debug if buggy (max 3, then prune).[DQ1] After extraction. [DQ2] After training. [DQ3] After calibration.PostToolUse auto-logs spine entries, runs observer checks.
EVALUATEExtract claims → CLAIM-LEDGER. Score confidence. Parse metrics. Detect serendipity.[DQ4] Every finding: numbers match source. [R2 INLINE] 7-point checklist per finding.Check instincts for relevant patterns. Update pattern confidence.
CHECKPOINTStage gate (S1-S5). R2 co-pilot (FORCED/BATCH/SHADOW). Serendipity radar. Stop conditions.[DC0] At stage transitions: design compliance check.R2 calibration hints inform review priorities.
CRYSTALLIZEUpdate STATE.md, TREE-STATE.json, PROGRESS.md, CLAIM-LEDGER.md.[SPINE] Mandatory structured entry. [SSOT] Run
sync_check.py
.
Stop hook generates narrative, exports STATE.md, extracts patterns.

v5.0 FORCED Review Path

SFI injection → BFP Phase 1 (blind) → Full review Phase 2 → V0 gate → R3/J0 gate → Schema validation → Normal gate evaluation.

Tree Structure

Tree modes: LINEAR (literature), BRANCHING (experiments), HYBRID (both). Tree search selects next node by confidence + metrics. Each node = one OTAE cycle.

Full protocol:

references/loop-otae.md
· Tree search:
references/tree-search.md


HOOKS ENFORCEMENT

7 hooks enforce the laws mechanically. They run as Node.js scripts triggered by Claude Code events.

HookEventWhat It DoesLaws Enforced
SessionStartSession beginsOpens DB, creates session, builds progressive context (~700 tokens), loads R2 calibration + patterns + seedsLAW 7 (resilience), LAW 12 (instinct)
UserPromptSubmitBefore each promptIdentifies agent role, logs prompt hash, performs semantic recall via vector searchLAW 10 (crystallize), LAW 7 (resilience)
PostToolUseAfter every toolGate enforcement (DQ4, CLAIM-LEDGER prerequisites, L-1), permission checks, auto-logging spine entries, observer checksLAW 3 (gates), LAW 6 (artifacts), LAW 10 (crystallize)
StopSession endingNarrative summary, blocks stop if unreviewed claims exist, exports STATE.md, extracts patternsLAW 4 (R2 co-pilot), LAW 7 (resilience), LAW 12 (instinct)
PreCompactBefore compactionSnapshots active claims, pending seeds, spine count, STATE.md to DBLAW 7 (resilience), LAW 10 (crystallize)
PreToolUseBefore Write/Edit toolBlocks CLAIM-LEDGER modifications without confounder_status field (regex matcher)LAW 9 (confounder harness)
SubagentStopSubagent finishesChecks killed claims have serendipity seeds (Salvagente Rule)LAW 4 (R2 co-pilot), LAW 5 (serendipity)

All hooks degrade gracefully if the DB is unavailable. They never hard-crash.

Full protocol:

references/hook-system.md


CROSS-SESSION LEARNING

Pattern Extraction (at session end)

The Stop hook extracts recurring patterns from cross-session data:

  1. GATE_FAILURE_CLUSTER — Same gate failing across 2+ sessions → pattern (e.g., "DQ1 fails when zero-variance columns present")
  2. REPEATED_ACTION — Same action+input appearing across 2+ sessions → pattern (e.g., "Same bug fix applied 3 times")
  3. CLAIM_LIFECYCLE — Claims killed for same reason across sessions → pattern (e.g., "Confounders kill first quantitative claim every session")

Patterns are stored in the

research_patterns
DB table with confidence scores. At session start, active patterns are surfaced in the
[PATTERNS]
context block.

Full protocol:

references/pattern-extraction.md

Instinct Model (learned behaviors)

Inspired by the ECC instinct system. Atomic behavior patterns with confidence:

  • Observation (0.3): Pattern noticed once
  • Pattern (0.5): Observed 3+ times
  • Instinct (0.7): Confirmed by evidence
  • Strong Instinct (0.9): Never contradicted

Decay: -0.02/week (exponential). Instincts below 0.2 are archived.

Scope: project (this RQ) or global (all RQs).

Full protocol:

references/instinct-model.md

R2 Calibration with Temporal Decay

R2's historical weaknesses are tracked with exponential temporal decay:

weight = exp(-0.02 * ageWeeks)

A weakness from 50 weeks ago contributes only ~37% of its original weight. This prevents stale calibration data from persisting indefinitely.

SessionStart injects calibration hints like: "R2 historically weak on 'batch_effect_check' (decay-weighted score: 2.3). High priority."

Full protocol:

references/r2-calibration.md


AGENT HANDOFF PROTOCOL

When transferring work between agents (R2 returning verdict, Explorer reporting branch, stage transitions), use formal handoff documents:

## HANDOFF: [Source Agent] → [Target Agent]
### Context
What was being done, which RQ, which stage, which cycle.
### Findings
Key results, claims affected, metrics.
### Files Modified
File paths with line ranges.
### Open Questions
Unresolved issues requiring attention.
### Recommendations
Suggested next steps.

This prevents context loss during agent transitions and satisfies LAW 7 (resilience) + LAW 10 (crystallize).

Full protocol:

references/handoff-protocol.md


5-STAGE EXPERIMENT MANAGER

StageNameGoalMax IterGate
1Preliminary InvestigationFirst working experiment or initial scan20S1: >= 1 good node
2Hyperparameter TuningOptimize best approach12S2: metric improved, 2+ configs
3Research AgendaExplore creative variants12S3: all sub-experiments attempted
4Ablation & ValidationValidate each component + multi-seed18S4: all ablated, contributions quantified
5Synthesis & ReviewFinal R2 ensemble + conclusion5S5: R2 ACCEPT + D2 PASS + all VERIFIED

Full protocol:

references/experiment-manager.md


REVIEWER 2 CO-PILOT

4 domain-agnostic reviewers: R2-Methods, R2-Stats, R2-Domain, R2-Engineering.

7 activation modes:

ModeTriggerBlocking?Sub-Agent?
BRAINSTORMPhase 0 completionYES — must WEAK_ACCEPTR2-DEEP
FORCEDMajor finding, stage transition, pivot, confidence explosion (>0.30/2cyc)YESR2-DEEP (SFI+BFP+V0+J0)
BATCH3 minor findings accumulatedYESR2-DEEP
SHADOWEvery 3 cycles automaticallyNO — can ESCALATE to FORCEDR2-DEEP
VETOR2 spots fatal flawYES — cannot be overridden except by humanR2-DEEP
REDIRECTR2 identifies better directionSoft — user choosesR2-DEEP
INLINEEvery finding at formulation timeNO — advisory, but loggedR2-INLINE (sonnet)

R2 INLINE 7-Point Checklist (v5.5+)

For every finding, before recording in CLAIM-LEDGER:

  1. Numbers match source data? (SSOT)
  2. Sample size adequate and reported?
  3. Alternative explanations considered?
  4. Prior art checked? (not rediscovering known result)
  5. Confounder risk identified? (even if full harness not yet run)
  6. Reproducible? (seed, parameters, data path documented)
  7. Terminology consistent across documents?

R2 Behavioral Requirements

  • ASSUME every claim is wrong
  • SEARCH for prior art, contradictions, artifacts
  • DEMAND confounder harness for every quantitative claim (LAW 9)
  • REFUSE premature closure — minimum 3 falsification attempts per major claim
  • ESCALATE, never soften — each pass MORE demanding
  • SALVAGENTE: When killing a claim, R2 MUST produce a serendipity seed
  • CALIBRATE (v6.0): Check temporal decay hints from SessionStart context. Prioritize historically weak areas.

Full ensemble protocol:

references/reviewer2-ensemble.md


SERENDIPITY RADAR

Three-part process: DETECTION → PERSISTENCE → VALIDATION.

Detection (every EVALUATE): 5 scans — anomalies, cross-branch patterns, contradictions, assumption drift, unexpected metrics.

Response: Score >= 10 → QUEUE. Score >= 15 → INTERRUPT (create serendipity node). Unaddressed flag after 5 cycles → ESCALATED.

Salvagente (v5.0): When R2 kills a claim (INSUFFICIENT/CONFOUNDED/PREMATURE), R2 MUST produce a serendipity seed (schema-validated).

Cross-Session Survival (v6.0): Pending seeds are stored in DB and loaded at session start. Seeds that survive across sessions get priority triage.

Full protocol:

references/serendipity-engine.md


GATES (32 Total)

CategoryGatesCountSchema-Enforced
PipelineG0-G67
LiteratureL-1, L0-L24L0 (source-validity), L2 (review-completeness)
DecisionD0-D23D1 (claim-promotion), D2 (rq-conclusion)
TreeT0-T34
BrainstormB01B0 (brainstorm-quality)
StageS1-S55S4 (stage4-exit), S5 (stage5-exit)
Data QualityDQ1-DQ44
Data DictionaryDD01
Design ComplianceDC01
VigilanceV01V0 (vigilance-check)
JudgeJ01
Total328 schema-enforced

Key Gate Summaries

  • G0: Input sanity — data exists, format correct, no corruption
  • G1: Schema compliance — data schema matches expectation
  • DQ1: Post-extraction — no zero-variance, no leakage, cross-checks match
  • DQ2: Post-training — outperforms baseline, no single-feature dominance, stable folds
  • DQ3: Post-calibration — plausible range, not suspiciously perfect, adequate sample
  • DQ4: Post-finding — numbers match source JSON, sample size reported, alternatives listed
  • DD0: Data dictionary — all columns documented before use
  • DC0: Design compliance — execution matches research design
  • L-1: Literature pre-check — prior art searched before committing direction
  • V0: Vigilance — SFI faults caught (RMS >= 0.80, FAR <= 0.10)
  • J0: Judge — R3 meta-review score >= 12/18, no dimension = 0

Full gate definitions:

references/gates-complete.md
DQ gate protocol:
references/dq-gates.md


ENFORCEMENT SCRIPTS

Python scripts for deterministic checks. Exit code 0 = PASS, non-zero = FAIL. Non-bypassable.

ScriptPurposeCLI Example
dq_gate.py
DQ1-DQ4 data quality checks
python scripts/dq_gate.py --gate DQ1 --data data.json
sync_check.py
SSOT: numbers in markdown match JSON source
python scripts/sync_check.py --json results.json --md FINDINGS.md
tree_health.py
T3 gate: exploration ratio, good/total ratio
python scripts/tree_health.py --tree TREE-STATE.json
gate_check.py
Generic gate: validate artifact against JSON Schema
python scripts/gate_check.py --gate B0 --artifact out.json --schema schemas/brainstorm-quality.schema.json
spine_entry.py
Create/validate Research Spine entries
python scripts/spine_entry.py --spine SPINE.md --type DATA_LOAD --action "Loaded dataset"
observer.py
Observer checks: orphans, desync, drift, naming
python scripts/observer.py --project .vibe-science/

All scripts: Python 3.8+, stdlib only (no external dependencies). Domain-configurable via

--config domain-config.yaml
.

Script Output Format (all scripts return JSON to stdout)

dq_gate.py — returns

{"gate": "DQ1", "status": "PASS"|"FAIL", "checks": [{"check": "zero_variance", "passed": true, "detail": "OK", "flagged": []}]}
. Each gate runs 4-5 named checks. Thresholds configurable via
--config
.

sync_check.py — returns

{"status": "PASS"|"FAIL", "total_numbers_in_markdown": 12, "matched": 12, "mismatched": 0, "tolerance": 0.001, "mismatches": [...]}
. Skips dates, claim IDs, gate names. Divides percentages by 100.

tree_health.py — returns

{"gate": "T3", "status": "PASS"|"FAIL", "checks": [...]}
. Checks:
good_ratio
(>=0.20),
exploration_ratio
(>=0.20),
no_stale_branches
(5+ non-improving = stale),
branch_diversity
(>=2 branches, skipped in LINEAR mode).

gate_check.py — returns

{"gate": "B0", "status": "PASS"|"FAIL", "schema_file": "...", "artifact_file": "...", "errors": [...], "error_count": 0}
. Lightweight validator (no jsonschema lib).

spine_entry.py — returns

{"status": "PASS"|"FAIL", "type": "DATA_LOAD", "action": "...", "entry": "### ..."}
. Creates SPINE.md if missing. Use
--validate-only
to check without writing.

observer.py — returns

{"status": "OK"|"WARN"|"HALT", "total_alerts": 0, "halt_count": 0, "warn_count": 0, "info_count": 0, "alerts": [...]}
. Exit 1 only on HALT or missing project dir.


FOLDER STRUCTURE

.vibe-science/
├── STATE.md                    # Current state (max 100 lines, rewritten each cycle)
├── PROGRESS.md                 # Append-only log
├── CLAIM-LEDGER.md             # All claims with evidence + confidence
├── SPINE.md                    # Research Spine (structured logbook)
├── ASSUMPTION-REGISTER.md      # All assumptions with risk
├── SERENDIPITY.md              # Unexpected discovery log
├── TREE-STATE.json             # Full tree serialization
├── KNOWLEDGE/                  # Cross-RQ accumulated knowledge
└── RQ-001-[slug]/              # Per Research Question
    ├── RQ.md                   # Question, hypothesis, criteria, kill conditions
    ├── 00-brainstorm/          # Phase 0 outputs
    ├── 01-discovery/           # Literature phase
    ├── 02-analysis/            # Analysis phase
    ├── 03-data/                # Data extraction + validation
    ├── 04-validation/          # Numerical validation
    ├── 05-reviewer2/           # R2 reviews
    ├── 06-runs/                # Run bundles
    ├── 07-audit/               # Decision log + snapshots
    ├── 08-tree/                # Tree search artifacts
    └── 09-writeup/             # Paper drafting

STOP CONDITIONS (checked every cycle)

  1. SUCCESS — All criteria satisfied + all findings R2-approved → Stage 5 → Final R2 → EXIT
  2. NEGATIVE RESULT — Hypothesis disproven or data unavailable → EXIT with documented negative
  3. SERENDIPITY PIVOT — Score >= 15 → triage → create new RQ or queue
  4. DIMINISHING RETURNS — cycles > 15 AND new_finding_rate < 1/3 → WARN → 3 targeted cycles or pivot
  5. DEAD END — All avenues exhausted → EXIT with what was learned
  6. TREE COLLAPSE — T3 fails AND no pending debug → R2 emergency review → pivot or conclude

RESOURCE ROUTING TABLE

Load ONLY when needed. Never load all at once.

ResourcePathWhen to Load
Constitution
references/constitution.md
Full law text needed
Brainstorm Engine
references/brainstorm-engine.md
Phase 0
OTAE Loop
references/loop-otae.md
First cycle or complex routing
Tree Search
references/tree-search.md
THINK-experiment / tree init
Experiment Manager
references/experiment-manager.md
Stage transitions
Auto-Experiment
references/auto-experiment.md
ACT-experiment
Evidence Engine
references/evidence-engine.md
EVALUATE phase
R2 Ensemble
references/reviewer2-ensemble.md
CHECKPOINT-r2
Search Protocol
references/search-protocol.md
ACT-search
Serendipity Engine
references/serendipity-engine.md
THINK-brainstorm / CHECKPOINT
Knowledge Base
references/knowledge-base.md
Session init / RQ conclusion
Data Extraction
references/data-extraction.md
ACT-extract
Writeup Engine
references/writeup-engine.md
Stage 5
Audit
references/audit-reproducibility.md
Run manifests
All Gates
references/gates-complete.md
EVALUATE phase
DQ Gates
references/dq-gates.md
DQ1-DQ4 checks
Data Dictionary
references/data-dictionary.md
DD0 — new data
Design Compliance
references/design-compliance.md
DC0 — stage transitions
Literature Pre-Check
references/literature-precheck.md
L-1 — new directions
Research Spine
references/research-spine.md
CRYSTALLIZE
SSOT Protocol
references/ssot.md
CRYSTALLIZE
Silent Observer
references/silent-observer.md
Observer checks
Multi-Agent Config
references/multi-agent-config.md
Session init
SFI Protocol
references/seeded-fault-injection.md
FORCED R2 reviews
Judge Agent
references/judge-agent.md
J0 gate
BFP Protocol
references/blind-first-pass.md
FORCED R2 reviews
Schema Validation
references/schema-validation.md
Gate validation
Circuit Breaker
references/circuit-breaker.md
R2 deadlocks
Hook System
references/hook-system.md
Understanding enforcement
Pattern Extraction
references/pattern-extraction.md
Session end / pattern review
R2 Calibration
references/r2-calibration.md
R2 review priorities
Handoff Protocol
references/handoff-protocol.md
Agent transitions
Instinct Model
references/instinct-model.md
Pattern management
Context Resilience
references/context-resilience.md
Recovery after compaction
Node Schema
assets/node-schema.md
Tree mode init
Stage Prompts
assets/stage-prompts.md
Stage-specific generation
Metric Parser
assets/metric-parser.md
ACT-experiment
Templates
assets/templates.md
CRYSTALLIZE / session init / handoffs
Domain Config
assets/domain-config-example.yaml
Domain-specific thresholds
Schemas
assets/schemas/*.schema.json
Gate validation

DEVIATION RULES

SituationAction
Search query typoAUTO-FIX silently, log
Missing database in searchADD database, log, continue
Minor findingACCUMULATE — batch review at 3
Major findingGATE — stop → verification → R2 FORCED
Serendipity observationLOG+TRIAGE → serendipity-engine
Cross-branch patternSERENDIPITY — score → if >= 15: INTERRUPT — create node
Dead end on current pathPIVOT — document → try alternative → escalate if none
No data availableSTOP — LAW 1: NO DATA = NO GO
Confidence explosion (>0.30/2cyc)FORCED R2 — possible confirmation bias
Node buggy 3 timesPRUNE — mark pruned, select next
Tree health T3 failsEMERGENCY — R2 review → strategy revision
Stage gate failsBLOCK — fix, re-gate, advance
User corrects directionOBEY — LAW 11: follow immediately, no argument
Architectural change neededASK HUMAN — strategic decisions need human input
Cross-session pattern detectedINSTINCT — check confidence → if >= 0.5: apply; if < 0.5: investigate
Context compactedRECOVER — load PreCompact snapshot from DB, resume from STATE.md