Auto-claude-code-research-in-sleep result-to-claim
Use when experiments complete to judge what claims the results support, what they do not, and what evidence is still missing. A secondary Codex agent evaluates results against intended claims and routes to the next action (pivot, supplement, or confirm). Use after experiments finish - before writing the paper or running ablations.
git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep
T=$(mktemp -d) && git clone --depth=1 https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/skills-codex/result-to-claim" ~/.claude/skills/wanshuiyin-auto-claude-code-research-in-sleep-result-to-claim-69fda9 && rm -rf "$T"
skills/skills-codex/result-to-claim/SKILL.mdResult-to-Claim Gate
Experiments produce numbers; this gate decides what those numbers mean. Collect results from available sources, get an objective judgment, then route based on the verdict.
Context: $ARGUMENTS
Constants
- REVIEWER_MODEL =
- Used via a secondary Codex agent for objective claim assessment.gpt-5.4
When to Use
- After a set of experiments completes (main results, not just sanity checks)
- Before committing to claims in a paper or review response
- When results are ambiguous and you need an objective second opinion
Workflow
Step 1: Collect Results
Gather experiment data from whatever sources are available in the project:
- W&B (preferred):
- metrics, training curves, comparisonswandb.Api().run("<entity>/<project>/<run_id>").history()
- full results table with baselines and verdictsEXPERIMENT_LOG.md
- check which experiments are done vs still runningEXPERIMENT_TRACKER.md- Log files -
if no other sourcessh server "tail -100 /path/to/training.log"
or project notes - intended claims and experiment designdocs/research_contract.md
Assemble the key information:
- What experiments were run (method, dataset, config)
- Main metrics and baseline comparisons (deltas)
- The intended claim these experiments were designed to test
- Any known confounds or caveats
Step 2: Secondary Codex Judgment
Send the collected results to a secondary Codex agent for objective evaluation:
spawn_agent: model: REVIEWER_MODEL reasoning_effort: xhigh message: | RESULT-TO-CLAIM EVALUATION I need you to judge whether experimental results support the intended claim. Intended claim: [the claim these experiments test] Experiments run: [list experiments with method, dataset, metrics] Results: [paste key numbers, comparison deltas, significance] Baselines: [baseline numbers and sources - reproduced or from paper] Known caveats: [any confounding factors, limited datasets, missing comparisons] Please evaluate: 1. claim_supported: yes | partial | no 2. what_results_support: what the data actually shows 3. what_results_dont_support: where the data falls short of the claim 4. missing_evidence: specific evidence gaps 5. suggested_claim_revision: if the claim should be strengthened, weakened, or reframed 6. next_experiments_needed: specific experiments to fill gaps (if any) 7. confidence: high | medium | low Be honest. Do not inflate claims beyond what the data supports. A single positive result on one dataset does not support a general claim.
If delegation is unavailable, run the same evaluation locally and mark the verdict
[pending external review] instead of blocking the pipeline.
Step 3: Parse and Normalize
Extract structured fields from the response:
- claim_supported: yes | partial | no - what_results_support: "..." - what_results_dont_support: "..." - missing_evidence: "..." - suggested_claim_revision: "..." - next_experiments_needed: "..." - confidence: high | medium | low
Step 4: Route Based on Verdict
no
- Claim not supported
no- Record a postmortem in
:findings.md- What was tested, what failed, and hypotheses for why
- Constraints for future attempts (what not to try again)
- Update the project pipeline status in project notes
- Decide whether to pivot to the next idea from
or try an alternative approachIDEA_CANDIDATES.md
partial
- Claim partially supported
partial- Update the working claim to reflect what is supported
- Record the gap in
findings.md - Design and run supplementary experiments to fill evidence gaps
- Re-run
after supplementary experiments complete/result-to-claim - If the same claim gets multiple
verdicts, record the analysis inpartial
and consider narrowing the claim scope or switching ideasfindings.md
yes
- Claim supported
yes- Record the confirmed claim in project notes
- If ablation studies are incomplete, trigger
/ablation-planner - If all evidence is in, move to paper writing
Rules
- The secondary Codex agent is the judge, not the local executor. The local executor collects evidence and routes; the reviewer agent evaluates. This prevents post-hoc rationalization.
- Do not inflate claims beyond what the data supports. If the verdict says
, do not round up topartial
.yes - A single positive result on one dataset does not support a general claim. Be honest about scope.
- If
is low, treat the judgment as inconclusive and add experiments rather than committing to a claim.confidence - If reviewer delegation is unavailable, make the best local judgment you can and mark it
.[pending external review] - Always record the verdict and reasoning in
, regardless of outcome.findings.md