Claude-skill-registry Experiment Design

Comprehensive guide to A/B testing, multivariate testing, statistical significance, and experiment analysis for data-driven product decisions

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/experiment-design" ~/.claude/skills/majiayu000-claude-skill-registry-experiment-design && rm -rf "$T"
manifest: skills/data/experiment-design/SKILL.md
source content

Experiment Design

Types of Experiments

1. A/B Test (Two Variants)

What: Compare two versions (A vs B)

Example:

  • Control (A): Blue "Buy Now" button
  • Treatment (B): Green "Buy Now" button

When to Use:

  • Testing single change
  • Clear hypothesis
  • Binary decision (ship or don't ship)

Pros:

  • Simple to implement
  • Easy to analyze
  • Clear winner

Cons:

  • Only tests one change
  • Can't test interactions

2. Multivariate Test (Multiple Changes)

What: Test multiple changes simultaneously

Example:

  • Variable 1: Button color (Blue, Green, Red)
  • Variable 2: Button text ("Buy Now", "Add to Cart", "Get Started")
  • Variants: 3 × 3 = 9 combinations

When to Use:

  • Testing multiple elements
  • Want to find best combination
  • Have enough traffic

Pros:

  • Test interactions between variables
  • Find optimal combination

Cons:

  • Requires much more traffic
  • Complex analysis
  • Longer test duration

3. Sequential Testing

What: Continuously monitor and stop early if clear winner

Example:

  • Start A/B test
  • Check results daily
  • Stop when statistical significance reached (could be day 3 or day 14)

When to Use:

  • Want to ship winners fast
  • High traffic
  • Using tools that support it (Statsig, GrowthBook)

Pros:

  • Faster results
  • Less opportunity cost

Cons:

  • Requires special statistical methods
  • Can't "peek" with traditional A/B tests

4. Holdout Groups (Long-Term Effects)

What: Keep small % of users on old experience permanently

Example:

  • 95% of users: New feature
  • 5% of users: Old experience (holdout)

When to Use:

  • Measure long-term effects
  • Detect delayed negative impacts
  • Validate cumulative changes

Pros:

  • Detects long-term issues
  • Measures true impact

Cons:

  • Some users get worse experience
  • Requires ongoing monitoring

When to Experiment

✅ Experiment When:

  1. Significant Features (High Impact)

    • Major redesign
    • New pricing model
    • Core flow changes
  2. Uncertain Outcomes

    • Don't know if it will work
    • Conflicting opinions
    • No clear data
  3. Multiple Solution Options

    • Two different approaches
    • Want to pick the best
  4. Optimization Opportunities

    • Incremental improvements
    • Conversion optimization
    • Engagement optimization

❌ Don't Experiment When:

  1. Obvious Bugs/Fixes

    • Broken functionality
    • Security issues
    • Legal compliance
  2. Very Low Traffic

    • Can't reach statistical significance
    • Would take months
  3. Trivial Changes

    • Copy typo fix
    • Minor styling adjustment
  4. Ethical Issues

    • Manipulative dark patterns
    • Harmful to users

Experiment Design Process

Step 1: Define Hypothesis

Template:

"If we [change], then [metric] will [improve by X%], because [reasoning]."

Example:

"If we change the CTA button from blue to green, then click-through rate will increase by 10%, because green is more attention-grabbing."

Step 2: Choose Metrics

Primary Metric: What you're optimizing

  • Example: Click-through rate

Secondary Metrics: Other important outcomes

  • Example: Conversion rate, revenue per user

Counter Metrics: Watch for negatives

  • Example: Bounce rate, time on page

Step 3: Determine Sample Size

Inputs:

  • Baseline conversion rate: 5%
  • Expected improvement: 10% relative lift (5% → 5.5%)
  • Significance level: 0.05 (95% confidence)
  • Power: 0.80 (80% chance of detecting effect)

Output:

  • Sample size needed: ~31,000 users per variant

Tools:

Step 4: Set Test Duration

Factors:

  • Sample size needed
  • Daily traffic
  • Weekly patterns (run at least 1-2 weeks)
  • Business cycles

Example:

  • Sample size: 31,000 per variant (62,000 total)
  • Daily traffic: 5,000
  • Duration: 62,000 / 5,000 = 12.4 days → Run for 2 weeks

Step 5: Design Variants

Control (A): Current experience Treatment (B): New experience

Best Practices:

  • Change only one thing (for A/B test)
  • Make change meaningful (not trivial)
  • Ensure variants are distinct

Step 6: Launch Test

Checklist:

  • Hypothesis documented
  • Metrics instrumented
  • Sample size calculated
  • Randomization working
  • QA tested both variants
  • Monitoring dashboard ready

Step 7: Analyze Results

Check:

  • Statistical significance (p < 0.05)
  • Practical significance (is improvement meaningful?)
  • Secondary metrics (any red flags?)
  • Segment analysis (works for everyone?)

Step 8: Decide (Ship, Iterate, Kill)

Ship if:

  • Positive, significant, no red flags

Iterate if:

  • Mixed results, some segments good

Kill if:

  • Negative, not significant, opportunity cost too high

Choosing Metrics

Primary Metric (What We're Optimizing)

Characteristics:

  • Directly tied to hypothesis
  • Sensitive to change
  • Measurable in test duration

Examples:

  • Click-through rate (CTR)
  • Conversion rate
  • Sign-up completion rate
  • Time to first action

Bad Primary Metrics:

  • Revenue (too noisy, delayed)
  • Retention (takes too long to measure)
  • NPS (survey-based, low sample)

Secondary Metrics (Guardrails, Side Effects)

Purpose: Ensure we're not breaking other things

Examples:

  • Revenue per user
  • Engagement (sessions per user)
  • Feature adoption
  • Customer satisfaction

Counter Metrics (Watch for Negatives)

Purpose: Detect unintended negative consequences

Examples:

  • Bounce rate (users leaving immediately)
  • Error rate (technical issues)
  • Support tickets (confusion)
  • Churn rate (users leaving)

Example: Checkout Flow Test

Hypothesis:

"If we reduce checkout from 5 steps to 3 steps, conversion will increase by 15%."

Metrics:

  • Primary: Checkout conversion rate
  • Secondary: Average order value, time to complete checkout
  • Counter: Cart abandonment rate, error rate, support tickets

Statistical Significance

P-Value < 0.05 (95% Confidence)

What it Means:

  • Less than 5% chance result is due to random chance
  • 95% confident the effect is real

Example:

  • Control: 5.0% conversion
  • Treatment: 5.5% conversion
  • P-value: 0.03 ✅ (< 0.05, statistically significant)

Interpretation:

"We're 95% confident that the treatment is better than control."

Statistical Power (80%+)

What it Means:

  • 80% chance of detecting an effect if it exists
  • Reduces false negatives

Example:

  • Power: 80%
  • Means: 20% chance of missing a real effect

Minimum Detectable Effect (MDE)

What it Means:

  • Smallest effect size you can reliably detect
  • Depends on sample size

Example:

  • Baseline: 5% conversion
  • Sample size: 10,000 per variant
  • MDE: 0.5% absolute (10% relative)
  • Can detect: 5.0% → 5.5% or larger

Trade-off:

  • Larger sample size → Smaller MDE (detect smaller effects)
  • Smaller sample size → Larger MDE (only detect big effects)

Sample Size Calculation

Formula (Simplified)

n = (Z_α/2 + Z_β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₁ - p₂)²

Where:
- n = sample size per variant
- Z_α/2 = 1.96 (for 95% confidence)
- Z_β = 0.84 (for 80% power)
- p₁ = baseline conversion rate
- p₂ = expected conversion rate

Example Calculation

Inputs:

  • Baseline conversion rate (p₁): 5% = 0.05
  • Expected improvement: 10% relative lift
  • New conversion rate (p₂): 5.5% = 0.055
  • Significance level (α): 0.05
  • Power (1-β): 0.80

Calculation:

n = (1.96 + 0.84)² × (0.05×0.95 + 0.055×0.945) / (0.05 - 0.055)²
n = 7.84 × (0.0475 + 0.052) / 0.000025
n = 7.84 × 0.0995 / 0.000025
n ≈ 31,200 per variant

Total sample size: 62,400 users

Using Online Calculators

Evan Miller's Calculator:

  1. Go to https://www.evanmiller.org/ab-testing/sample-size.html
  2. Enter baseline conversion rate: 5%
  3. Enter minimum detectable effect: 10% (relative)
  4. Get sample size: ~31,000 per variant

Optimizely Calculator:

  1. Go to Optimizely sample size calculator
  2. Enter baseline: 5%
  3. Enter minimum detectable effect: 0.5% (absolute)
  4. Get sample size: ~31,000 per variant

Test Duration

Minimum Duration: 1-2 Weeks

Why:

  • Capture weekly patterns (weekday vs weekend)
  • Avoid day-of-week bias
  • Account for user behavior cycles

Example:

  • Don't run Monday-Wednesday only
  • Run at least Monday-Sunday (1 full week)

Full Business Cycles

Examples:

  • E-commerce: Include payday (1st and 15th of month)
  • B2B SaaS: Include full week (avoid Friday-only)
  • Seasonal: Avoid holidays (unless testing holiday-specific)

Enough Data for Significance

Formula:

Duration = Sample Size Needed / Daily Traffic

Example:

  • Sample size: 62,000 total
  • Daily traffic: 5,000
  • Duration: 62,000 / 5,000 = 12.4 days
  • Run for: 2 weeks (14 days)

Not Too Long (Opportunity Cost)

Trade-off:

  • Longer test = More confidence
  • Longer test = Delayed learnings, slower iteration

Guideline:

  • Most tests: 1-4 weeks
  • High-traffic sites: 1-2 weeks
  • Low-traffic sites: 2-4 weeks
  • Don't run > 1 month (diminishing returns)

Experiment Variants

Control (Current Experience)

What: The existing experience

Example:

  • Current checkout flow (5 steps)
  • Current button color (blue)
  • Current pricing page

Purpose: Baseline for comparison

Treatment (New Experience)

What: The proposed change

Example:

  • New checkout flow (3 steps)
  • New button color (green)
  • New pricing page

Purpose: Test hypothesis

Multiple Treatments (If Testing Different Approaches)

Example:

  • Control: 5-step checkout
  • Treatment A: 3-step checkout (combine steps)
  • Treatment B: 1-page checkout (all on one page)

Traffic Split:

  • Control: 33%
  • Treatment A: 33%
  • Treatment B: 34%

Analysis:

  • Compare each treatment to control
  • Compare treatments to each other

Randomization

User-Level Randomization (Consistent Experience)

What: Each user always sees same variant

How:

const variant = hashUserId(userId) % 2 === 0 ? 'control' : 'treatment';

When to Use:

  • Logged-in users
  • Want consistent experience
  • Testing flows (multi-step)

Pros:

  • Consistent experience
  • No confusion

Cons:

  • Requires user ID

Session-Level (For Anonymous Users)

What: Each session sees same variant (but different sessions can differ)

How:

const variant = hashSessionId(sessionId) % 2 === 0 ? 'control' : 'treatment';

When to Use:

  • Anonymous users
  • Single-page tests

Pros:

  • Works for anonymous users

Cons:

  • Same user can see different variants across sessions

Stratified Sampling (For Segments)

What: Ensure even distribution across segments

Example:

  • Segment 1: Free users (50% control, 50% treatment)
  • Segment 2: Paid users (50% control, 50% treatment)

Why:

  • Avoid imbalanced segments
  • Enable segment analysis

Common Pitfalls

1. Peeking (Stopping Test Early When "Winning")

Problem:

Day 3: Treatment is winning! (p = 0.04) → Ship it!
Day 7: Treatment is losing... (p = 0.12) → Oops.

Why It's Bad:

  • Increases false positive rate
  • P-value fluctuates during test

Solution:

  • Decide sample size upfront
  • Don't look until test completes
  • Or use sequential testing (proper method)

2. Sample Ratio Mismatch (Uneven Splits)

Problem:

Expected: 50% control, 50% treatment
Actual: 48% control, 52% treatment

Why It's Bad:

  • Indicates randomization bug
  • Results may be invalid

Solution:

  • Check sample ratio before analyzing
  • Investigate if mismatch > 1%

3. Novelty Effect (Users Trying New Thing)

Problem:

Week 1: Treatment is winning! (+20%)
Week 4: Treatment is same as control (0%)

Why It's Bad:

  • Users try new thing out of curiosity
  • Effect fades over time

Solution:

  • Run test longer (2-4 weeks)
  • Use holdout group for long-term measurement
  • Segment by new vs returning users

4. Seasonality (Testing During Holidays)

Problem:

Test during Black Friday: +50% conversion
Test during normal week: +5% conversion

Why It's Bad:

  • Holiday behavior is different
  • Results don't generalize

Solution:

  • Avoid testing during holidays
  • Or run test across multiple weeks (include holiday + normal)

Sequential Testing

What is Sequential Testing?

Traditional A/B Test:

  • Decide sample size upfront
  • Run until sample size reached
  • Analyze once at end

Sequential Testing:

  • Monitor continuously
  • Stop early if clear winner
  • Adjust significance threshold

How It Works

Algorithm:

  • Use adjusted significance threshold (not 0.05)
  • Account for multiple looks
  • Stop when threshold crossed

Example (Simplified):

Day 1: p = 0.10 → Continue
Day 3: p = 0.03 → Continue
Day 5: p = 0.001 → Stop! (clear winner)

Tools That Support Sequential Testing

  • Statsig: Built-in sequential testing
  • GrowthBook: Bayesian statistics
  • Optimizely: Stats Engine (sequential)

Benefits

  • Faster results (stop early if clear winner)
  • Less opportunity cost
  • Detect large effects quickly

Drawbacks

  • Requires special tools
  • Can't use traditional p-value
  • More complex

Holdout Groups

What is a Holdout Group?

Definition: Small % of users kept on old experience permanently

Example:

  • 95% of users: New feature
  • 5% of users: Old experience (holdout)

Why Use Holdout Groups?

Measure Long-Term Effects:

  • A/B test shows +10% conversion in 2 weeks
  • Holdout shows +5% conversion after 6 months
  • Learning: Effect diminishes over time

Detect Delayed Negative Impacts:

  • A/B test shows +15% signups
  • Holdout shows +10% churn after 3 months
  • Learning: Feature attracts wrong users

How Long to Keep Holdout?

Guideline:

  • 1-3 months for most features
  • 6-12 months for major changes
  • Permanent for critical features

When to Remove Holdout?

Remove if:

  • No long-term differences detected
  • Opportunity cost too high (5% of users on worse experience)
  • Feature is critical (everyone should have it)

Experiment Analysis

Step 1: Compare Primary Metric

Example:

  • Control: 5.0% conversion
  • Treatment: 5.5% conversion
  • Lift: +10% relative
  • P-value: 0.03 ✅

Decision: Treatment is statistically significantly better.

Step 2: Check Secondary Metrics

Example:

  • Revenue per user: $10.50 (control) vs $11.20 (treatment) ✅
  • Time to checkout: 3.2 min (control) vs 2.8 min (treatment) ✅

Decision: Secondary metrics also improved.

Step 3: Check Counter Metrics

Example:

  • Bounce rate: 30% (control) vs 32% (treatment) ⚠️
  • Error rate: 0.5% (control) vs 0.5% (treatment) ✅

Decision: Slight increase in bounce rate, investigate.

Step 4: Segment Analysis

Did it work for everyone?

SegmentControlTreatmentLift
Mobile4.5%5.2%+15% ✅
Desktop5.5%5.8%+5% ✅
Free users3.0%3.6%+20% ✅
Paid users7.0%7.1%+1% ⚠️

Learning: Works great for mobile and free users, minimal impact on paid users.

Step 5: Statistical Significance

Check:

  • P-value < 0.05 ✅
  • Confidence interval doesn't include 0 ✅

Example:

  • Lift: +10%
  • 95% CI: [+5%, +15%]
  • Interpretation: We're 95% confident the true lift is between 5% and 15%.

Step 6: Practical Significance

Is the improvement meaningful?

Example:

  • Statistically significant: Yes (p = 0.04)
  • Lift: +0.1% (5.0% → 5.005%)
  • Decision: Not practically significant (too small to matter)

Guideline:

  • Small lift but high volume → Ship (e.g., +0.1% on 1M users = 1,000 more conversions)
  • Large lift but low volume → Maybe ship (e.g., +50% on 100 users = 50 more conversions)

Decision Framework

Ship If:

Positive: Treatment is better than control ✅ Significant: P-value < 0.05 ✅ No Red Flags: Secondary and counter metrics look good ✅ Works for Key Segments: At least works for majority

Example:

  • Conversion: +10% (p = 0.03) ✅
  • Revenue: +8% (p = 0.05) ✅
  • Bounce rate: No change ✅
  • Works for mobile and desktop ✅
  • Decision: Ship!

Iterate If:

⚠️ Mixed Results: Some metrics up, some down ⚠️ Works for Some Segments Only: E.g., only mobile, not desktop ⚠️ Close to Significance: P = 0.06 (just missed)

Example:

  • Conversion: +10% (p = 0.03) ✅
  • Revenue: -5% (p = 0.08) ⚠️
  • Decision: Iterate. Conversion is up but revenue is down. Investigate why.

Kill If:

Negative: Treatment is worse than control ❌ Not Significant: P-value > 0.05 ❌ Opportunity Cost Too High: Could be working on better ideas

Example:

  • Conversion: +2% (p = 0.15) ❌
  • Took 4 weeks to test
  • Decision: Kill. Not significant, move on to next idea.

Tools

Feature Flags

LaunchDarkly:

  • Feature flag management
  • Gradual rollouts
  • Kill switches

Split.io:

  • Feature flags + experimentation
  • Real-time metrics

Unleash:

  • Open-source feature flags
  • Self-hosted option

Experimentation Platforms

Optimizely:

  • Full-stack experimentation
  • Visual editor for web
  • Stats Engine (sequential testing)

VWO (Visual Website Optimizer):

  • A/B testing for web
  • Heatmaps, session recordings
  • Visual editor

GrowthBook:

  • Open-source experimentation
  • Bayesian statistics
  • Feature flags

Statsig:

  • Modern experimentation platform
  • Sequential testing
  • Free tier

Analytics

Amplitude:

  • Product analytics
  • Funnel analysis
  • Cohort analysis

Mixpanel:

  • Event-based analytics
  • A/B test analysis
  • Retention analysis

PostHog:

  • Open-source product analytics
  • Feature flags
  • Session replay

A/B Testing for Engineers

1. Feature Flag Implementation

Node.js (LaunchDarkly):

const LaunchDarkly = require('launchdarkly-node-server-sdk');

const client = LaunchDarkly.init(process.env.LAUNCHDARKLY_SDK_KEY);

await client.waitForInitialization();

app.get('/checkout', async (req, res) => {
  const user = {
    key: req.user.id,
    email: req.user.email,
    custom: {
      plan: req.user.plan
    }
  };
  
  const showNewCheckout = await client.variation('new-checkout-flow', user, false);
  
  if (showNewCheckout) {
    res.render('checkout-new');
  } else {
    res.render('checkout-old');
  }
});

Python (Statsig):

from statsig import statsig

statsig.initialize(os.environ['STATSIG_SERVER_KEY'])

@app.route('/checkout')
def checkout():
    user = {
        'userID': current_user.id,
        'email': current_user.email,
        'custom': {
            'plan': current_user.plan
        }
    }
    
    show_new_checkout = statsig.check_gate(user, 'new_checkout_flow')
    
    if show_new_checkout:
        return render_template('checkout_new.html')
    else:
        return render_template('checkout_old.html')

2. Metric Instrumentation

Segment (Event Tracking):

const Analytics = require('analytics-node');
const analytics = new Analytics(process.env.SEGMENT_WRITE_KEY);

// Track checkout started
analytics.track({
  userId: user.id,
  event: 'Checkout Started',
  properties: {
    variant: showNewCheckout ? 'treatment' : 'control',
    cart_value: cart.total,
    items_count: cart.items.length
  }
});

// Track checkout completed
analytics.track({
  userId: user.id,
  event: 'Checkout Completed',
  properties: {
    variant: showNewCheckout ? 'treatment' : 'control',
    order_id: order.id,
    revenue: order.total
  }
});

3. Data Pipeline

Architecture:

Application
    ↓ (events)
Segment
    ↓ (forwards to)
├── Amplitude (analytics)
├── Mixpanel (analytics)
├── Data Warehouse (BigQuery, Snowflake)
└── Statsig (experimentation)

4. Results Dashboard

Grafana Dashboard:

{
  "dashboard": {
    "title": "A/B Test: New Checkout Flow",
    "panels": [
      {
        "title": "Conversion Rate by Variant",
        "targets": [
          {
            "expr": "sum(checkout_completed{variant='control'}) / sum(checkout_started{variant='control'})",
            "legendFormat": "Control"
          },
          {
            "expr": "sum(checkout_completed{variant='treatment'}) / sum(checkout_started{variant='treatment'})",
            "legendFormat": "Treatment"
          }
        ]
      },
      {
        "title": "Sample Size",
        "targets": [
          {
            "expr": "sum(checkout_started{variant='control'})",
            "legendFormat": "Control"
          },
          {
            "expr": "sum(checkout_started{variant='treatment'})",
            "legendFormat": "Treatment"
          }
        ]
      }
    ]
  }
}

Real Experiment Examples

Example 1: Button Color Test (Classic)

Hypothesis:

"If we change the CTA button from blue to orange, click-through rate will increase by 10%, because orange is more attention-grabbing."

Test:

  • Control: Blue button
  • Treatment: Orange button
  • Sample size: 10,000 per variant
  • Duration: 1 week

Results:

  • Control: 5.2% CTR
  • Treatment: 5.7% CTR
  • Lift: +9.6%
  • P-value: 0.04 ✅

Decision: Ship orange button.

Example 2: Checkout Flow Optimization

Hypothesis:

"If we reduce checkout from 5 steps to 3 steps, conversion will increase by 15%, because users abandon due to flow length."

Test:

  • Control: 5-step checkout
  • Treatment: 3-step checkout (combined steps)
  • Sample size: 50,000 per variant
  • Duration: 2 weeks

Results:

  • Control: 8.5% conversion
  • Treatment: 9.8% conversion
  • Lift: +15.3%
  • P-value: 0.001 ✅

Secondary Metrics:

  • Time to checkout: 4.2 min → 3.1 min ✅
  • Error rate: 2.1% → 1.8% ✅

Decision: Ship 3-step checkout.

Example 3: Pricing Page Variants

Hypothesis:

"If we show annual pricing first (instead of monthly), annual plan adoption will increase by 25%, because anchoring effect."

Test:

  • Control: Monthly pricing shown first
  • Treatment: Annual pricing shown first
  • Sample size: 20,000 per variant
  • Duration: 3 weeks

Results:

  • Control: 12% annual adoption
  • Treatment: 18% annual adoption
  • Lift: +50%
  • P-value: 0.001 ✅

Counter Metrics:

  • Overall conversion: 10.5% → 10.2% ⚠️ (slight drop)

Decision: Ship, but monitor overall conversion.

Example 4: Onboarding Flow

Hypothesis:

"If we add an interactive tutorial in onboarding, activation rate will increase by 30%, because users don't know how to get started."

Test:

  • Control: No tutorial
  • Treatment: Interactive tutorial (5 steps)
  • Sample size: 15,000 per variant
  • Duration: 2 weeks

Results:

  • Control: 25% activation rate
  • Treatment: 28% activation rate
  • Lift: +12%
  • P-value: 0.08 ❌ (not significant)

Segment Analysis:

  • New users: +20% (p = 0.03) ✅
  • Returning users: +2% (p = 0.5) ❌

Decision: Iterate. Show tutorial only to new users.


Advanced: Bayesian A/B Testing

Traditional (Frequentist) A/B Testing

Approach:

  • Null hypothesis: No difference between A and B
  • P-value: Probability of seeing this result if null is true
  • Reject null if p < 0.05

Interpretation:

"There's a 95% chance the result is not due to random chance."

Bayesian A/B Testing

Approach:

  • Prior belief: What we believe before test
  • Likelihood: Data from test
  • Posterior belief: Updated belief after test

Interpretation:

"There's a 95% probability that B is better than A."

Benefits of Bayesian

  1. Easier to Interpret:

    • "95% probability B is better" (intuitive)
    • vs "p = 0.03" (confusing)
  2. Can Stop Early:

    • No peeking problem
    • Stop when confident enough
  3. Incorporates Prior Knowledge:

    • Use historical data
    • More accurate with small samples

Tools That Use Bayesian

  • GrowthBook: Bayesian by default
  • VWO: Bayesian engine option
  • Google Optimize: Bayesian (deprecated)

Example

Test:

  • Control: 5.0% conversion (1000 users)
  • Treatment: 5.5% conversion (1000 users)

Frequentist:

  • P-value: 0.15 (not significant)
  • Decision: Can't conclude

Bayesian:

  • Probability B > A: 87%
  • Expected lift: +10%
  • Decision: Likely better, but not confident enough (need 95%)

Summary

Quick Reference

Experiment Types:

  • A/B test: Two variants
  • Multivariate: Multiple changes
  • Sequential: Stop early
  • Holdout: Long-term measurement

When to Experiment:

  • Significant features
  • Uncertain outcomes
  • Multiple options
  • Optimization

Process:

  1. Define hypothesis
  2. Choose metrics
  3. Calculate sample size
  4. Set duration
  5. Design variants
  6. Launch
  7. Analyze
  8. Decide

Metrics:

  • Primary: What we're optimizing
  • Secondary: Guardrails
  • Counter: Watch for negatives

Statistical Significance:

  • P-value < 0.05
  • Power > 80%
  • Minimum detectable effect

Common Pitfalls:

  • Peeking
  • Sample ratio mismatch
  • Novelty effect
  • Seasonality

Decision Framework:

  • Ship: Positive, significant, no red flags
  • Iterate: Mixed results
  • Kill: Negative, not significant

Tools:

  • Feature flags: LaunchDarkly, Split.io
  • Experimentation: Optimizely, Statsig, GrowthBook
  • Analytics: Amplitude, Mixpanel, PostHog