Claude-skill-registry Experiment Design

Comprehensive guide to A/B testing, multivariate testing, statistical significance, and experiment analysis for data-driven product decisions

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/experiment-design" ~/.claude/skills/majiayu000-claude-skill-registry-experiment-design && rm -rf "$T"

manifest: skills/data/experiment-design/SKILL.md

source content

Experiment Design

Types of Experiments

1. A/B Test (Two Variants)

What: Compare two versions (A vs B)

Example:

Control (A): Blue "Buy Now" button
Treatment (B): Green "Buy Now" button

When to Use:

Testing single change
Clear hypothesis
Binary decision (ship or don't ship)

Pros:

Simple to implement
Easy to analyze
Clear winner

Cons:

Only tests one change
Can't test interactions

2. Multivariate Test (Multiple Changes)

What: Test multiple changes simultaneously

Example:

Variable 1: Button color (Blue, Green, Red)
Variable 2: Button text ("Buy Now", "Add to Cart", "Get Started")
Variants: 3 × 3 = 9 combinations

When to Use:

Testing multiple elements
Want to find best combination
Have enough traffic

Pros:

Test interactions between variables
Find optimal combination

Cons:

Requires much more traffic
Complex analysis
Longer test duration

3. Sequential Testing

What: Continuously monitor and stop early if clear winner

Example:

Start A/B test
Check results daily
Stop when statistical significance reached (could be day 3 or day 14)

When to Use:

Want to ship winners fast
High traffic
Using tools that support it (Statsig, GrowthBook)

Pros:

Faster results
Less opportunity cost

Cons:

Requires special statistical methods
Can't "peek" with traditional A/B tests

4. Holdout Groups (Long-Term Effects)

What: Keep small % of users on old experience permanently

Example:

95% of users: New feature
5% of users: Old experience (holdout)

When to Use:

Measure long-term effects
Detect delayed negative impacts
Validate cumulative changes

Pros:

Detects long-term issues
Measures true impact

Cons:

Some users get worse experience
Requires ongoing monitoring

When to Experiment

✅ Experiment When:

Significant Features (High Impact)
- Major redesign
- New pricing model
- Core flow changes
Uncertain Outcomes
- Don't know if it will work
- Conflicting opinions
- No clear data
Multiple Solution Options
- Two different approaches
- Want to pick the best
Optimization Opportunities
- Incremental improvements
- Conversion optimization
- Engagement optimization

❌ Don't Experiment When:

Obvious Bugs/Fixes
- Broken functionality
- Security issues
- Legal compliance
Very Low Traffic
- Can't reach statistical significance
- Would take months
Trivial Changes
- Copy typo fix
- Minor styling adjustment
Ethical Issues
- Manipulative dark patterns
- Harmful to users

Experiment Design Process

Step 1: Define Hypothesis

Template:

"If we [change], then [metric] will [improve by X%], because [reasoning]."

Example:

"If we change the CTA button from blue to green, then click-through rate will increase by 10%, because green is more attention-grabbing."

Step 2: Choose Metrics

Primary Metric: What you're optimizing

Example: Click-through rate

Secondary Metrics: Other important outcomes

Example: Conversion rate, revenue per user

Counter Metrics: Watch for negatives

Example: Bounce rate, time on page

Step 3: Determine Sample Size

Inputs:

Baseline conversion rate: 5%
Expected improvement: 10% relative lift (5% → 5.5%)
Significance level: 0.05 (95% confidence)
Power: 0.80 (80% chance of detecting effect)

Output:

Sample size needed: ~31,000 users per variant

Tools:

Evan Miller's calculator: https://www.evanmiller.org/ab-testing/sample-size.html
Optimizely sample size calculator

Step 4: Set Test Duration

Factors:

Sample size needed
Daily traffic
Weekly patterns (run at least 1-2 weeks)
Business cycles

Example:

Sample size: 31,000 per variant (62,000 total)
Daily traffic: 5,000
Duration: 62,000 / 5,000 = 12.4 days → Run for 2 weeks

Step 5: Design Variants

Control (A): Current experience Treatment (B): New experience

Best Practices:

Change only one thing (for A/B test)
Make change meaningful (not trivial)
Ensure variants are distinct

Step 6: Launch Test

Checklist:

Step 7: Analyze Results

Check:

Statistical significance (p < 0.05)
Practical significance (is improvement meaningful?)
Secondary metrics (any red flags?)
Segment analysis (works for everyone?)

Step 8: Decide (Ship, Iterate, Kill)

Ship if:

Positive, significant, no red flags

Iterate if:

Mixed results, some segments good

Kill if:

Negative, not significant, opportunity cost too high

Choosing Metrics

Primary Metric (What We're Optimizing)

Characteristics:

Directly tied to hypothesis
Sensitive to change
Measurable in test duration

Examples:

Click-through rate (CTR)
Conversion rate
Sign-up completion rate
Time to first action

Bad Primary Metrics:

Revenue (too noisy, delayed)
Retention (takes too long to measure)
NPS (survey-based, low sample)

Secondary Metrics (Guardrails, Side Effects)

Purpose: Ensure we're not breaking other things

Examples:

Revenue per user
Engagement (sessions per user)
Feature adoption
Customer satisfaction

Counter Metrics (Watch for Negatives)

Purpose: Detect unintended negative consequences

Examples:

Bounce rate (users leaving immediately)
Error rate (technical issues)
Support tickets (confusion)
Churn rate (users leaving)

Example: Checkout Flow Test

Hypothesis:

"If we reduce checkout from 5 steps to 3 steps, conversion will increase by 15%."

Metrics:

Primary: Checkout conversion rate
Secondary: Average order value, time to complete checkout
Counter: Cart abandonment rate, error rate, support tickets

Statistical Significance

P-Value < 0.05 (95% Confidence)

What it Means:

Less than 5% chance result is due to random chance
95% confident the effect is real

Example:

Control: 5.0% conversion
Treatment: 5.5% conversion
P-value: 0.03 ✅ (< 0.05, statistically significant)

Interpretation:

"We're 95% confident that the treatment is better than control."

Statistical Power (80%+)

What it Means:

80% chance of detecting an effect if it exists
Reduces false negatives

Example:

Power: 80%
Means: 20% chance of missing a real effect

Minimum Detectable Effect (MDE)

What it Means:

Smallest effect size you can reliably detect
Depends on sample size

Example:

Baseline: 5% conversion
Sample size: 10,000 per variant
MDE: 0.5% absolute (10% relative)
Can detect: 5.0% → 5.5% or larger

Trade-off:

Larger sample size → Smaller MDE (detect smaller effects)
Smaller sample size → Larger MDE (only detect big effects)

Sample Size Calculation

Formula (Simplified)

n = (Z_α/2 + Z_β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₁ - p₂)²

Where:
- n = sample size per variant
- Z_α/2 = 1.96 (for 95% confidence)
- Z_β = 0.84 (for 80% power)
- p₁ = baseline conversion rate
- p₂ = expected conversion rate

Example Calculation

Inputs:

Baseline conversion rate (p₁): 5% = 0.05
Expected improvement: 10% relative lift
New conversion rate (p₂): 5.5% = 0.055
Significance level (α): 0.05
Power (1-β): 0.80

Calculation:

n = (1.96 + 0.84)² × (0.05×0.95 + 0.055×0.945) / (0.05 - 0.055)²
n = 7.84 × (0.0475 + 0.052) / 0.000025
n = 7.84 × 0.0995 / 0.000025
n ≈ 31,200 per variant

Total sample size: 62,400 users

Using Online Calculators

Evan Miller's Calculator:

Go to https://www.evanmiller.org/ab-testing/sample-size.html
Enter baseline conversion rate: 5%
Enter minimum detectable effect: 10% (relative)
Get sample size: ~31,000 per variant

Optimizely Calculator:

Go to Optimizely sample size calculator
Enter baseline: 5%
Enter minimum detectable effect: 0.5% (absolute)
Get sample size: ~31,000 per variant

Test Duration

Minimum Duration: 1-2 Weeks

Why:

Capture weekly patterns (weekday vs weekend)
Avoid day-of-week bias
Account for user behavior cycles

Example:

Don't run Monday-Wednesday only
Run at least Monday-Sunday (1 full week)

Full Business Cycles

Examples:

E-commerce: Include payday (1st and 15th of month)
B2B SaaS: Include full week (avoid Friday-only)
Seasonal: Avoid holidays (unless testing holiday-specific)

Enough Data for Significance

Formula:

Duration = Sample Size Needed / Daily Traffic

Example:

Sample size: 62,000 total
Daily traffic: 5,000
Duration: 62,000 / 5,000 = 12.4 days
Run for: 2 weeks (14 days)

Not Too Long (Opportunity Cost)

Trade-off:

Longer test = More confidence
Longer test = Delayed learnings, slower iteration

Guideline:

Most tests: 1-4 weeks
High-traffic sites: 1-2 weeks
Low-traffic sites: 2-4 weeks
Don't run > 1 month (diminishing returns)

Experiment Variants

Control (Current Experience)

What: The existing experience

Example:

Current checkout flow (5 steps)
Current button color (blue)
Current pricing page

Purpose: Baseline for comparison

Treatment (New Experience)

What: The proposed change

Example:

New checkout flow (3 steps)
New button color (green)
New pricing page

Purpose: Test hypothesis

Multiple Treatments (If Testing Different Approaches)

Example:

Control: 5-step checkout
Treatment A: 3-step checkout (combine steps)
Treatment B: 1-page checkout (all on one page)

Traffic Split:

Control: 33%
Treatment A: 33%
Treatment B: 34%

Analysis:

Compare each treatment to control
Compare treatments to each other

Randomization

User-Level Randomization (Consistent Experience)

What: Each user always sees same variant

How:

const variant = hashUserId(userId) % 2 === 0 ? 'control' : 'treatment';

When to Use:

Logged-in users
Want consistent experience
Testing flows (multi-step)

Pros:

Consistent experience
No confusion

Cons:

Requires user ID

Session-Level (For Anonymous Users)

What: Each session sees same variant (but different sessions can differ)

How:

const variant = hashSessionId(sessionId) % 2 === 0 ? 'control' : 'treatment';

When to Use:

Anonymous users
Single-page tests

Pros:

Works for anonymous users

Cons:

Same user can see different variants across sessions

Stratified Sampling (For Segments)

What: Ensure even distribution across segments

Example:

Segment 1: Free users (50% control, 50% treatment)
Segment 2: Paid users (50% control, 50% treatment)

Why:

Avoid imbalanced segments
Enable segment analysis

Common Pitfalls

1. Peeking (Stopping Test Early When "Winning")

Problem:

Day 3: Treatment is winning! (p = 0.04) → Ship it!
Day 7: Treatment is losing... (p = 0.12) → Oops.

Why It's Bad:

Increases false positive rate
P-value fluctuates during test

Solution:

Decide sample size upfront
Don't look until test completes
Or use sequential testing (proper method)

2. Sample Ratio Mismatch (Uneven Splits)

Problem:

Expected: 50% control, 50% treatment
Actual: 48% control, 52% treatment

Why It's Bad:

Indicates randomization bug
Results may be invalid

Solution:

Check sample ratio before analyzing
Investigate if mismatch > 1%

3. Novelty Effect (Users Trying New Thing)

Problem:

Week 1: Treatment is winning! (+20%)
Week 4: Treatment is same as control (0%)

Why It's Bad:

Users try new thing out of curiosity
Effect fades over time

Solution:

Run test longer (2-4 weeks)
Use holdout group for long-term measurement
Segment by new vs returning users

4. Seasonality (Testing During Holidays)

Problem:

Test during Black Friday: +50% conversion
Test during normal week: +5% conversion

Why It's Bad:

Holiday behavior is different
Results don't generalize

Solution:

Avoid testing during holidays
Or run test across multiple weeks (include holiday + normal)

Sequential Testing

What is Sequential Testing?

Traditional A/B Test:

Decide sample size upfront
Run until sample size reached
Analyze once at end

Sequential Testing:

Monitor continuously
Stop early if clear winner
Adjust significance threshold

How It Works

Algorithm:

Use adjusted significance threshold (not 0.05)
Account for multiple looks
Stop when threshold crossed

Example (Simplified):

Day 1: p = 0.10 → Continue
Day 3: p = 0.03 → Continue
Day 5: p = 0.001 → Stop! (clear winner)

Tools That Support Sequential Testing

Statsig: Built-in sequential testing
GrowthBook: Bayesian statistics
Optimizely: Stats Engine (sequential)

Benefits

Faster results (stop early if clear winner)
Less opportunity cost
Detect large effects quickly

Drawbacks

Requires special tools
Can't use traditional p-value
More complex

Holdout Groups

What is a Holdout Group?

Definition: Small % of users kept on old experience permanently

Example:

95% of users: New feature
5% of users: Old experience (holdout)

Why Use Holdout Groups?

Measure Long-Term Effects:

A/B test shows +10% conversion in 2 weeks
Holdout shows +5% conversion after 6 months
Learning: Effect diminishes over time

Detect Delayed Negative Impacts:

A/B test shows +15% signups
Holdout shows +10% churn after 3 months
Learning: Feature attracts wrong users

How Long to Keep Holdout?

Guideline:

1-3 months for most features
6-12 months for major changes
Permanent for critical features

When to Remove Holdout?

Remove if:

No long-term differences detected
Opportunity cost too high (5% of users on worse experience)
Feature is critical (everyone should have it)

Experiment Analysis

Step 1: Compare Primary Metric

Example:

Control: 5.0% conversion
Treatment: 5.5% conversion
Lift: +10% relative
P-value: 0.03 ✅

Decision: Treatment is statistically significantly better.

Step 2: Check Secondary Metrics

Example:

Revenue per user: $10.50 (control) vs $11.20 (treatment) ✅
Time to checkout: 3.2 min (control) vs 2.8 min (treatment) ✅

Decision: Secondary metrics also improved.

Step 3: Check Counter Metrics

Example:

Bounce rate: 30% (control) vs 32% (treatment) ⚠️
Error rate: 0.5% (control) vs 0.5% (treatment) ✅

Decision: Slight increase in bounce rate, investigate.

Step 4: Segment Analysis

Did it work for everyone?

Segment	Control	Treatment	Lift
Mobile	4.5%	5.2%	+15% ✅
Desktop	5.5%	5.8%	+5% ✅
Free users	3.0%	3.6%	+20% ✅
Paid users	7.0%	7.1%	+1% ⚠️

Learning: Works great for mobile and free users, minimal impact on paid users.

Step 5: Statistical Significance

Check:

P-value < 0.05 ✅
Confidence interval doesn't include 0 ✅

Example:

Lift: +10%
95% CI: [+5%, +15%]
Interpretation: We're 95% confident the true lift is between 5% and 15%.

Step 6: Practical Significance

Is the improvement meaningful?

Example:

Statistically significant: Yes (p = 0.04)
Lift: +0.1% (5.0% → 5.005%)
Decision: Not practically significant (too small to matter)

Guideline:

Small lift but high volume → Ship (e.g., +0.1% on 1M users = 1,000 more conversions)
Large lift but low volume → Maybe ship (e.g., +50% on 100 users = 50 more conversions)

Decision Framework

Ship If:

✅ Positive: Treatment is better than control ✅ Significant: P-value < 0.05 ✅ No Red Flags: Secondary and counter metrics look good ✅ Works for Key Segments: At least works for majority

Example:

Conversion: +10% (p = 0.03) ✅
Revenue: +8% (p = 0.05) ✅
Bounce rate: No change ✅
Works for mobile and desktop ✅
Decision: Ship!

Iterate If:

⚠️ Mixed Results: Some metrics up, some down ⚠️ Works for Some Segments Only: E.g., only mobile, not desktop ⚠️ Close to Significance: P = 0.06 (just missed)

Example:

Conversion: +10% (p = 0.03) ✅
Revenue: -5% (p = 0.08) ⚠️
Decision: Iterate. Conversion is up but revenue is down. Investigate why.

Kill If:

❌ Negative: Treatment is worse than control ❌ Not Significant: P-value > 0.05 ❌ Opportunity Cost Too High: Could be working on better ideas

Example:

Conversion: +2% (p = 0.15) ❌
Took 4 weeks to test
Decision: Kill. Not significant, move on to next idea.

Tools

Feature Flags

LaunchDarkly:

Feature flag management
Gradual rollouts
Kill switches

Split.io:

Feature flags + experimentation
Real-time metrics

Unleash:

Open-source feature flags
Self-hosted option

Experimentation Platforms

Optimizely:

Full-stack experimentation
Visual editor for web
Stats Engine (sequential testing)

VWO (Visual Website Optimizer):

A/B testing for web
Heatmaps, session recordings
Visual editor

GrowthBook:

Open-source experimentation
Bayesian statistics
Feature flags

Statsig:

Modern experimentation platform
Sequential testing
Free tier

Analytics

Amplitude:

Product analytics
Funnel analysis
Cohort analysis

Mixpanel:

Event-based analytics
A/B test analysis
Retention analysis

PostHog:

Open-source product analytics
Feature flags
Session replay

A/B Testing for Engineers

1. Feature Flag Implementation

Node.js (LaunchDarkly):

const LaunchDarkly = require('launchdarkly-node-server-sdk');

const client = LaunchDarkly.init(process.env.LAUNCHDARKLY_SDK_KEY);

await client.waitForInitialization();

app.get('/checkout', async (req, res) => {
  const user = {
    key: req.user.id,
    email: req.user.email,
    custom: {
      plan: req.user.plan
    }
  };
  
  const showNewCheckout = await client.variation('new-checkout-flow', user, false);
  
  if (showNewCheckout) {
    res.render('checkout-new');
  } else {
    res.render('checkout-old');
  }
});

Python (Statsig):

from statsig import statsig

statsig.initialize(os.environ['STATSIG_SERVER_KEY'])

@app.route('/checkout')
def checkout():
    user = {
        'userID': current_user.id,
        'email': current_user.email,
        'custom': {
            'plan': current_user.plan
        }
    }
    
    show_new_checkout = statsig.check_gate(user, 'new_checkout_flow')
    
    if show_new_checkout:
        return render_template('checkout_new.html')
    else:
        return render_template('checkout_old.html')

2. Metric Instrumentation

Segment (Event Tracking):

const Analytics = require('analytics-node');
const analytics = new Analytics(process.env.SEGMENT_WRITE_KEY);

// Track checkout started
analytics.track({
  userId: user.id,
  event: 'Checkout Started',
  properties: {
    variant: showNewCheckout ? 'treatment' : 'control',
    cart_value: cart.total,
    items_count: cart.items.length
  }
});

// Track checkout completed
analytics.track({
  userId: user.id,
  event: 'Checkout Completed',
  properties: {
    variant: showNewCheckout ? 'treatment' : 'control',
    order_id: order.id,
    revenue: order.total
  }
});

3. Data Pipeline

Architecture:

Application
    ↓ (events)
Segment
    ↓ (forwards to)
├── Amplitude (analytics)
├── Mixpanel (analytics)
├── Data Warehouse (BigQuery, Snowflake)
└── Statsig (experimentation)

4. Results Dashboard

Grafana Dashboard:

{
  "dashboard": {
    "title": "A/B Test: New Checkout Flow",
    "panels": [
      {
        "title": "Conversion Rate by Variant",
        "targets": [
          {
            "expr": "sum(checkout_completed{variant='control'}) / sum(checkout_started{variant='control'})",
            "legendFormat": "Control"
          },
          {
            "expr": "sum(checkout_completed{variant='treatment'}) / sum(checkout_started{variant='treatment'})",
            "legendFormat": "Treatment"
          }
        ]
      },
      {
        "title": "Sample Size",
        "targets": [
          {
            "expr": "sum(checkout_started{variant='control'})",
            "legendFormat": "Control"
          },
          {
            "expr": "sum(checkout_started{variant='treatment'})",
            "legendFormat": "Treatment"
          }
        ]
      }
    ]
  }
}

Real Experiment Examples

Example 1: Button Color Test (Classic)

Hypothesis:

"If we change the CTA button from blue to orange, click-through rate will increase by 10%, because orange is more attention-grabbing."

Test:

Control: Blue button
Treatment: Orange button
Sample size: 10,000 per variant
Duration: 1 week

Results:

Control: 5.2% CTR
Treatment: 5.7% CTR
Lift: +9.6%
P-value: 0.04 ✅

Decision: Ship orange button.

Example 2: Checkout Flow Optimization

Hypothesis:

"If we reduce checkout from 5 steps to 3 steps, conversion will increase by 15%, because users abandon due to flow length."

Test:

Control: 5-step checkout
Treatment: 3-step checkout (combined steps)
Sample size: 50,000 per variant
Duration: 2 weeks

Results:

Control: 8.5% conversion
Treatment: 9.8% conversion
Lift: +15.3%
P-value: 0.001 ✅

Secondary Metrics:

Time to checkout: 4.2 min → 3.1 min ✅
Error rate: 2.1% → 1.8% ✅

Decision: Ship 3-step checkout.

Example 3: Pricing Page Variants

Hypothesis:

"If we show annual pricing first (instead of monthly), annual plan adoption will increase by 25%, because anchoring effect."

Test:

Control: Monthly pricing shown first
Treatment: Annual pricing shown first
Sample size: 20,000 per variant
Duration: 3 weeks

Results:

Control: 12% annual adoption
Treatment: 18% annual adoption
Lift: +50%
P-value: 0.001 ✅

Counter Metrics:

Overall conversion: 10.5% → 10.2% ⚠️ (slight drop)

Decision: Ship, but monitor overall conversion.

Example 4: Onboarding Flow

Hypothesis:

"If we add an interactive tutorial in onboarding, activation rate will increase by 30%, because users don't know how to get started."

Test:

Control: No tutorial
Treatment: Interactive tutorial (5 steps)
Sample size: 15,000 per variant
Duration: 2 weeks

Results:

Control: 25% activation rate
Treatment: 28% activation rate
Lift: +12%
P-value: 0.08 ❌ (not significant)

Segment Analysis:

New users: +20% (p = 0.03) ✅
Returning users: +2% (p = 0.5) ❌

Decision: Iterate. Show tutorial only to new users.

Advanced: Bayesian A/B Testing

Traditional (Frequentist) A/B Testing

Approach:

Null hypothesis: No difference between A and B
P-value: Probability of seeing this result if null is true
Reject null if p < 0.05

Interpretation:

"There's a 95% chance the result is not due to random chance."

Bayesian A/B Testing

Approach:

Prior belief: What we believe before test
Likelihood: Data from test
Posterior belief: Updated belief after test

Interpretation:

"There's a 95% probability that B is better than A."

Benefits of Bayesian

Easier to Interpret:
- "95% probability B is better" (intuitive)
- vs "p = 0.03" (confusing)
Can Stop Early:
- No peeking problem
- Stop when confident enough
Incorporates Prior Knowledge:
- Use historical data
- More accurate with small samples

Tools That Use Bayesian

GrowthBook: Bayesian by default
VWO: Bayesian engine option
Google Optimize: Bayesian (deprecated)

Example

Test:

Control: 5.0% conversion (1000 users)
Treatment: 5.5% conversion (1000 users)

Frequentist:

P-value: 0.15 (not significant)
Decision: Can't conclude

Bayesian:

Probability B > A: 87%
Expected lift: +10%
Decision: Likely better, but not confident enough (need 95%)

Summary

Quick Reference

Experiment Types:

A/B test: Two variants
Multivariate: Multiple changes
Sequential: Stop early
Holdout: Long-term measurement

When to Experiment:

Significant features
Uncertain outcomes
Multiple options
Optimization

Process:

Define hypothesis
Choose metrics
Calculate sample size
Set duration
Design variants
Launch
Analyze
Decide

Metrics:

Primary: What we're optimizing
Secondary: Guardrails
Counter: Watch for negatives

Statistical Significance:

P-value < 0.05
Power > 80%
Minimum detectable effect

Common Pitfalls:

Peeking
Sample ratio mismatch
Novelty effect
Seasonality

Decision Framework:

Ship: Positive, significant, no red flags
Iterate: Mixed results
Kill: Negative, not significant

Tools:

Feature flags: LaunchDarkly, Split.io
Experimentation: Optimizely, Statsig, GrowthBook
Analytics: Amplitude, Mixpanel, PostHog