Claude-skill-registry experiment-design-checklist

Generates a rigorous experiment design given a hypothesis. Use when asked to design experiments, plan experiments, create an experimental setup, or figure out how to test a research hypothesis. Covers controls, baselines, ablations, metrics, statistical tests, and compute estimates.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/experiment-design-checklist" ~/.claude/skills/majiayu000-claude-skill-registry-experiment-design-checklist && rm -rf "$T"
manifest: skills/data/experiment-design-checklist/SKILL.md
source content

Experiment Design Checklist

Prevent the "I ran experiments for 3 months and they're meaningless" disaster through rigorous upfront design.

The Core Principle

Before running ANY experiment, you should be able to answer:

  1. What specific claim will this experiment support or refute?
  2. What would convince a skeptical reviewer?
  3. What could go wrong that would invalidate the results?

Process

Step 1: State the Hypothesis Precisely

Convert your research question into falsifiable predictions:

Template:

If [intervention/method], then [measurable outcome], because [mechanism].

Examples:

  • "If we add auxiliary contrastive loss, then downstream task accuracy increases by >2%, because representations become more separable."
  • "If we use learned positional encodings, then performance on sequences >4096 tokens improves, because the model can extrapolate beyond training length."

Null hypothesis: What does "no effect" look like? This is what you're trying to reject.

Step 2: Identify Variables

Independent Variables (what you manipulate):

VariableLevelsRationale
[Var 1][Level A, B, C][Why these levels]

Dependent Variables (what you measure):

MetricHow MeasuredWhy This Metric
[Metric 1][Procedure][Justification]

Control Variables (what you hold constant):

VariableFixed ValueWhy Fixed
[Var 1][Value][Prevents confound X]

Step 3: Choose Baselines

Every experiment needs comparisons. No result is meaningful in isolation.

Baseline Hierarchy:

  1. Random/Trivial Baseline

    • What does random chance achieve?
    • Sanity check that the task isn't trivial
  2. Simple Baseline

    • Simplest reasonable approach
    • Often embarrassingly effective
  3. Standard Baseline

    • Well-known method from literature
    • Apples-to-apples comparison
  4. State-of-the-Art Baseline

    • Current best published result
    • Only if you're claiming SOTA
  5. Ablated Self

    • Your method minus key components
    • Shows each component contributes

For each baseline, document:

  • Source (paper, implementation)
  • Hyperparameters used
  • Whether you re-ran or used reported numbers
  • Any modifications made

Step 4: Design Ablations

Ablations answer: "Is each component necessary?"

Ablation Template:

VariantWhat's Removed/ChangedExpected EffectIf No Effect...
Full ModelNothingBest performance-
w/o Component ARemove APerformance drops X%A isn't helping
w/o Component BRemove BPerformance drops Y%B isn't helping
Component A onlyOnly A, no BShows A's isolated contribution-

Good ablations are:

  • Surgical (one change at a time)
  • Interpretable (clear what was changed)
  • Informative (result tells you something)

Step 5: Address Confounds

Things that could explain your results OTHER than your hypothesis:

Common Confounds:

ConfoundHow to CheckHow to Control
Hyperparameter tuning advantageSame tuning budget for allReport tuning procedure
Compute advantageMatched FLOPs/paramsReport compute used
Data leakageCheck train/test overlapStrict separation
Random seed luckMultiple seedsReport variance
Implementation bugs (baseline)Verify baseline numbersUse official implementations
Cherry-picked examplesRandom or systematic selectionPre-register selection criteria

Step 6: Statistical Rigor

Sample Size:

  • How many random seeds? (Minimum: 3, better: 5+)
  • How many data splits? (If applicable)
  • Power analysis: Can you detect expected effect size?

What to Report:

  • Mean ± standard deviation (or standard error)
  • Confidence intervals where appropriate
  • Statistical significance tests if claiming "better"

Appropriate Tests:

ComparisonTestAssumptions
Two methods, normal datat-testNormality, equal variance
Two methods, unknown distMann-Whitney UOrdinal data
Multiple methodsANOVA + post-hocNormality
Multiple methods, unknownKruskal-WallisOrdinal data
Paired comparisonsWilcoxon signed-rankSame test instances

Avoid:

  • p-hacking (running until significant)
  • Multiple comparison problems (Bonferroni correct)
  • Reporting only favorable metrics

Step 7: Compute Budget

Before running, estimate:

ComponentEstimateNotes
Single training runX GPU-hours[Details]
Hyperparameter searchY runs × X hours[Search strategy]
BaselinesZ runs × W hours[Which baselines]
AblationsN variants × X hours[Which ablations]
SeedsM seeds × above[How many seeds]
TotalT GPU-hoursBuffer: 1.5-2x

Go/No-Go Decision: Is this feasible with available resources?

Step 8: Pre-Registration (Optional but Recommended)

Write down BEFORE running:

  • Exact hypotheses
  • Primary metrics (not chosen post-hoc)
  • Analysis plan
  • What would constitute "success"

This prevents unconscious goal-post moving.

Output: Experiment Design Document

# Experiment Design: [Title]

## Hypothesis
[Precise statement]

## Variables
### Independent
[Table]

### Dependent
[Table]

### Controls
[Table]

## Baselines
1. [Baseline 1]: [Source, details]
2. [Baseline 2]: [Source, details]

## Ablations
[Table]

## Confound Mitigation
[Table]

## Statistical Plan
- Seeds: [N]
- Tests: [Which tests for which comparisons]
- Significance threshold: [α level]

## Compute Budget
[Table with total estimate]

## Success Criteria
- Primary: [What must be true]
- Secondary: [Nice to have]

## Timeline
- Phase 1: [What, when]
- Phase 2: [What, when]

## Known Risks
1. [Risk 1]: [Mitigation]
2. [Risk 2]: [Mitigation]

Red Flags in Experiment Design

🚩 "We'll figure out the metrics later" 🚩 "One run should be enough" 🚩 "We don't need baselines, it's obviously better" 🚩 "Let's just see what happens" 🚩 "We can always run more if it's not significant" 🚩 No compute estimate before starting 🚩 Vague success criteria