Claude-skill-registry Experiment Design
Comprehensive guide to A/B testing, multivariate testing, statistical significance, and experiment analysis for data-driven product decisions
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/experiment-design" ~/.claude/skills/majiayu000-claude-skill-registry-experiment-design && rm -rf "$T"
skills/data/experiment-design/SKILL.mdExperiment Design
Types of Experiments
1. A/B Test (Two Variants)
What: Compare two versions (A vs B)
Example:
- Control (A): Blue "Buy Now" button
- Treatment (B): Green "Buy Now" button
When to Use:
- Testing single change
- Clear hypothesis
- Binary decision (ship or don't ship)
Pros:
- Simple to implement
- Easy to analyze
- Clear winner
Cons:
- Only tests one change
- Can't test interactions
2. Multivariate Test (Multiple Changes)
What: Test multiple changes simultaneously
Example:
- Variable 1: Button color (Blue, Green, Red)
- Variable 2: Button text ("Buy Now", "Add to Cart", "Get Started")
- Variants: 3 × 3 = 9 combinations
When to Use:
- Testing multiple elements
- Want to find best combination
- Have enough traffic
Pros:
- Test interactions between variables
- Find optimal combination
Cons:
- Requires much more traffic
- Complex analysis
- Longer test duration
3. Sequential Testing
What: Continuously monitor and stop early if clear winner
Example:
- Start A/B test
- Check results daily
- Stop when statistical significance reached (could be day 3 or day 14)
When to Use:
- Want to ship winners fast
- High traffic
- Using tools that support it (Statsig, GrowthBook)
Pros:
- Faster results
- Less opportunity cost
Cons:
- Requires special statistical methods
- Can't "peek" with traditional A/B tests
4. Holdout Groups (Long-Term Effects)
What: Keep small % of users on old experience permanently
Example:
- 95% of users: New feature
- 5% of users: Old experience (holdout)
When to Use:
- Measure long-term effects
- Detect delayed negative impacts
- Validate cumulative changes
Pros:
- Detects long-term issues
- Measures true impact
Cons:
- Some users get worse experience
- Requires ongoing monitoring
When to Experiment
✅ Experiment When:
-
Significant Features (High Impact)
- Major redesign
- New pricing model
- Core flow changes
-
Uncertain Outcomes
- Don't know if it will work
- Conflicting opinions
- No clear data
-
Multiple Solution Options
- Two different approaches
- Want to pick the best
-
Optimization Opportunities
- Incremental improvements
- Conversion optimization
- Engagement optimization
❌ Don't Experiment When:
-
Obvious Bugs/Fixes
- Broken functionality
- Security issues
- Legal compliance
-
Very Low Traffic
- Can't reach statistical significance
- Would take months
-
Trivial Changes
- Copy typo fix
- Minor styling adjustment
-
Ethical Issues
- Manipulative dark patterns
- Harmful to users
Experiment Design Process
Step 1: Define Hypothesis
Template:
"If we [change], then [metric] will [improve by X%], because [reasoning]."
Example:
"If we change the CTA button from blue to green, then click-through rate will increase by 10%, because green is more attention-grabbing."
Step 2: Choose Metrics
Primary Metric: What you're optimizing
- Example: Click-through rate
Secondary Metrics: Other important outcomes
- Example: Conversion rate, revenue per user
Counter Metrics: Watch for negatives
- Example: Bounce rate, time on page
Step 3: Determine Sample Size
Inputs:
- Baseline conversion rate: 5%
- Expected improvement: 10% relative lift (5% → 5.5%)
- Significance level: 0.05 (95% confidence)
- Power: 0.80 (80% chance of detecting effect)
Output:
- Sample size needed: ~31,000 users per variant
Tools:
- Evan Miller's calculator: https://www.evanmiller.org/ab-testing/sample-size.html
- Optimizely sample size calculator
Step 4: Set Test Duration
Factors:
- Sample size needed
- Daily traffic
- Weekly patterns (run at least 1-2 weeks)
- Business cycles
Example:
- Sample size: 31,000 per variant (62,000 total)
- Daily traffic: 5,000
- Duration: 62,000 / 5,000 = 12.4 days → Run for 2 weeks
Step 5: Design Variants
Control (A): Current experience Treatment (B): New experience
Best Practices:
- Change only one thing (for A/B test)
- Make change meaningful (not trivial)
- Ensure variants are distinct
Step 6: Launch Test
Checklist:
- Hypothesis documented
- Metrics instrumented
- Sample size calculated
- Randomization working
- QA tested both variants
- Monitoring dashboard ready
Step 7: Analyze Results
Check:
- Statistical significance (p < 0.05)
- Practical significance (is improvement meaningful?)
- Secondary metrics (any red flags?)
- Segment analysis (works for everyone?)
Step 8: Decide (Ship, Iterate, Kill)
Ship if:
- Positive, significant, no red flags
Iterate if:
- Mixed results, some segments good
Kill if:
- Negative, not significant, opportunity cost too high
Choosing Metrics
Primary Metric (What We're Optimizing)
Characteristics:
- Directly tied to hypothesis
- Sensitive to change
- Measurable in test duration
Examples:
- Click-through rate (CTR)
- Conversion rate
- Sign-up completion rate
- Time to first action
Bad Primary Metrics:
- Revenue (too noisy, delayed)
- Retention (takes too long to measure)
- NPS (survey-based, low sample)
Secondary Metrics (Guardrails, Side Effects)
Purpose: Ensure we're not breaking other things
Examples:
- Revenue per user
- Engagement (sessions per user)
- Feature adoption
- Customer satisfaction
Counter Metrics (Watch for Negatives)
Purpose: Detect unintended negative consequences
Examples:
- Bounce rate (users leaving immediately)
- Error rate (technical issues)
- Support tickets (confusion)
- Churn rate (users leaving)
Example: Checkout Flow Test
Hypothesis:
"If we reduce checkout from 5 steps to 3 steps, conversion will increase by 15%."
Metrics:
- Primary: Checkout conversion rate
- Secondary: Average order value, time to complete checkout
- Counter: Cart abandonment rate, error rate, support tickets
Statistical Significance
P-Value < 0.05 (95% Confidence)
What it Means:
- Less than 5% chance result is due to random chance
- 95% confident the effect is real
Example:
- Control: 5.0% conversion
- Treatment: 5.5% conversion
- P-value: 0.03 ✅ (< 0.05, statistically significant)
Interpretation:
"We're 95% confident that the treatment is better than control."
Statistical Power (80%+)
What it Means:
- 80% chance of detecting an effect if it exists
- Reduces false negatives
Example:
- Power: 80%
- Means: 20% chance of missing a real effect
Minimum Detectable Effect (MDE)
What it Means:
- Smallest effect size you can reliably detect
- Depends on sample size
Example:
- Baseline: 5% conversion
- Sample size: 10,000 per variant
- MDE: 0.5% absolute (10% relative)
- Can detect: 5.0% → 5.5% or larger
Trade-off:
- Larger sample size → Smaller MDE (detect smaller effects)
- Smaller sample size → Larger MDE (only detect big effects)
Sample Size Calculation
Formula (Simplified)
n = (Z_α/2 + Z_β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₁ - p₂)² Where: - n = sample size per variant - Z_α/2 = 1.96 (for 95% confidence) - Z_β = 0.84 (for 80% power) - p₁ = baseline conversion rate - p₂ = expected conversion rate
Example Calculation
Inputs:
- Baseline conversion rate (p₁): 5% = 0.05
- Expected improvement: 10% relative lift
- New conversion rate (p₂): 5.5% = 0.055
- Significance level (α): 0.05
- Power (1-β): 0.80
Calculation:
n = (1.96 + 0.84)² × (0.05×0.95 + 0.055×0.945) / (0.05 - 0.055)² n = 7.84 × (0.0475 + 0.052) / 0.000025 n = 7.84 × 0.0995 / 0.000025 n ≈ 31,200 per variant
Total sample size: 62,400 users
Using Online Calculators
Evan Miller's Calculator:
- Go to https://www.evanmiller.org/ab-testing/sample-size.html
- Enter baseline conversion rate: 5%
- Enter minimum detectable effect: 10% (relative)
- Get sample size: ~31,000 per variant
Optimizely Calculator:
- Go to Optimizely sample size calculator
- Enter baseline: 5%
- Enter minimum detectable effect: 0.5% (absolute)
- Get sample size: ~31,000 per variant
Test Duration
Minimum Duration: 1-2 Weeks
Why:
- Capture weekly patterns (weekday vs weekend)
- Avoid day-of-week bias
- Account for user behavior cycles
Example:
- Don't run Monday-Wednesday only
- Run at least Monday-Sunday (1 full week)
Full Business Cycles
Examples:
- E-commerce: Include payday (1st and 15th of month)
- B2B SaaS: Include full week (avoid Friday-only)
- Seasonal: Avoid holidays (unless testing holiday-specific)
Enough Data for Significance
Formula:
Duration = Sample Size Needed / Daily Traffic
Example:
- Sample size: 62,000 total
- Daily traffic: 5,000
- Duration: 62,000 / 5,000 = 12.4 days
- Run for: 2 weeks (14 days)
Not Too Long (Opportunity Cost)
Trade-off:
- Longer test = More confidence
- Longer test = Delayed learnings, slower iteration
Guideline:
- Most tests: 1-4 weeks
- High-traffic sites: 1-2 weeks
- Low-traffic sites: 2-4 weeks
- Don't run > 1 month (diminishing returns)
Experiment Variants
Control (Current Experience)
What: The existing experience
Example:
- Current checkout flow (5 steps)
- Current button color (blue)
- Current pricing page
Purpose: Baseline for comparison
Treatment (New Experience)
What: The proposed change
Example:
- New checkout flow (3 steps)
- New button color (green)
- New pricing page
Purpose: Test hypothesis
Multiple Treatments (If Testing Different Approaches)
Example:
- Control: 5-step checkout
- Treatment A: 3-step checkout (combine steps)
- Treatment B: 1-page checkout (all on one page)
Traffic Split:
- Control: 33%
- Treatment A: 33%
- Treatment B: 34%
Analysis:
- Compare each treatment to control
- Compare treatments to each other
Randomization
User-Level Randomization (Consistent Experience)
What: Each user always sees same variant
How:
const variant = hashUserId(userId) % 2 === 0 ? 'control' : 'treatment';
When to Use:
- Logged-in users
- Want consistent experience
- Testing flows (multi-step)
Pros:
- Consistent experience
- No confusion
Cons:
- Requires user ID
Session-Level (For Anonymous Users)
What: Each session sees same variant (but different sessions can differ)
How:
const variant = hashSessionId(sessionId) % 2 === 0 ? 'control' : 'treatment';
When to Use:
- Anonymous users
- Single-page tests
Pros:
- Works for anonymous users
Cons:
- Same user can see different variants across sessions
Stratified Sampling (For Segments)
What: Ensure even distribution across segments
Example:
- Segment 1: Free users (50% control, 50% treatment)
- Segment 2: Paid users (50% control, 50% treatment)
Why:
- Avoid imbalanced segments
- Enable segment analysis
Common Pitfalls
1. Peeking (Stopping Test Early When "Winning")
Problem:
Day 3: Treatment is winning! (p = 0.04) → Ship it! Day 7: Treatment is losing... (p = 0.12) → Oops.
Why It's Bad:
- Increases false positive rate
- P-value fluctuates during test
Solution:
- Decide sample size upfront
- Don't look until test completes
- Or use sequential testing (proper method)
2. Sample Ratio Mismatch (Uneven Splits)
Problem:
Expected: 50% control, 50% treatment Actual: 48% control, 52% treatment
Why It's Bad:
- Indicates randomization bug
- Results may be invalid
Solution:
- Check sample ratio before analyzing
- Investigate if mismatch > 1%
3. Novelty Effect (Users Trying New Thing)
Problem:
Week 1: Treatment is winning! (+20%) Week 4: Treatment is same as control (0%)
Why It's Bad:
- Users try new thing out of curiosity
- Effect fades over time
Solution:
- Run test longer (2-4 weeks)
- Use holdout group for long-term measurement
- Segment by new vs returning users
4. Seasonality (Testing During Holidays)
Problem:
Test during Black Friday: +50% conversion Test during normal week: +5% conversion
Why It's Bad:
- Holiday behavior is different
- Results don't generalize
Solution:
- Avoid testing during holidays
- Or run test across multiple weeks (include holiday + normal)
Sequential Testing
What is Sequential Testing?
Traditional A/B Test:
- Decide sample size upfront
- Run until sample size reached
- Analyze once at end
Sequential Testing:
- Monitor continuously
- Stop early if clear winner
- Adjust significance threshold
How It Works
Algorithm:
- Use adjusted significance threshold (not 0.05)
- Account for multiple looks
- Stop when threshold crossed
Example (Simplified):
Day 1: p = 0.10 → Continue Day 3: p = 0.03 → Continue Day 5: p = 0.001 → Stop! (clear winner)
Tools That Support Sequential Testing
- Statsig: Built-in sequential testing
- GrowthBook: Bayesian statistics
- Optimizely: Stats Engine (sequential)
Benefits
- Faster results (stop early if clear winner)
- Less opportunity cost
- Detect large effects quickly
Drawbacks
- Requires special tools
- Can't use traditional p-value
- More complex
Holdout Groups
What is a Holdout Group?
Definition: Small % of users kept on old experience permanently
Example:
- 95% of users: New feature
- 5% of users: Old experience (holdout)
Why Use Holdout Groups?
Measure Long-Term Effects:
- A/B test shows +10% conversion in 2 weeks
- Holdout shows +5% conversion after 6 months
- Learning: Effect diminishes over time
Detect Delayed Negative Impacts:
- A/B test shows +15% signups
- Holdout shows +10% churn after 3 months
- Learning: Feature attracts wrong users
How Long to Keep Holdout?
Guideline:
- 1-3 months for most features
- 6-12 months for major changes
- Permanent for critical features
When to Remove Holdout?
Remove if:
- No long-term differences detected
- Opportunity cost too high (5% of users on worse experience)
- Feature is critical (everyone should have it)
Experiment Analysis
Step 1: Compare Primary Metric
Example:
- Control: 5.0% conversion
- Treatment: 5.5% conversion
- Lift: +10% relative
- P-value: 0.03 ✅
Decision: Treatment is statistically significantly better.
Step 2: Check Secondary Metrics
Example:
- Revenue per user: $10.50 (control) vs $11.20 (treatment) ✅
- Time to checkout: 3.2 min (control) vs 2.8 min (treatment) ✅
Decision: Secondary metrics also improved.
Step 3: Check Counter Metrics
Example:
- Bounce rate: 30% (control) vs 32% (treatment) ⚠️
- Error rate: 0.5% (control) vs 0.5% (treatment) ✅
Decision: Slight increase in bounce rate, investigate.
Step 4: Segment Analysis
Did it work for everyone?
| Segment | Control | Treatment | Lift |
|---|---|---|---|
| Mobile | 4.5% | 5.2% | +15% ✅ |
| Desktop | 5.5% | 5.8% | +5% ✅ |
| Free users | 3.0% | 3.6% | +20% ✅ |
| Paid users | 7.0% | 7.1% | +1% ⚠️ |
Learning: Works great for mobile and free users, minimal impact on paid users.
Step 5: Statistical Significance
Check:
- P-value < 0.05 ✅
- Confidence interval doesn't include 0 ✅
Example:
- Lift: +10%
- 95% CI: [+5%, +15%]
- Interpretation: We're 95% confident the true lift is between 5% and 15%.
Step 6: Practical Significance
Is the improvement meaningful?
Example:
- Statistically significant: Yes (p = 0.04)
- Lift: +0.1% (5.0% → 5.005%)
- Decision: Not practically significant (too small to matter)
Guideline:
- Small lift but high volume → Ship (e.g., +0.1% on 1M users = 1,000 more conversions)
- Large lift but low volume → Maybe ship (e.g., +50% on 100 users = 50 more conversions)
Decision Framework
Ship If:
✅ Positive: Treatment is better than control ✅ Significant: P-value < 0.05 ✅ No Red Flags: Secondary and counter metrics look good ✅ Works for Key Segments: At least works for majority
Example:
- Conversion: +10% (p = 0.03) ✅
- Revenue: +8% (p = 0.05) ✅
- Bounce rate: No change ✅
- Works for mobile and desktop ✅
- Decision: Ship!
Iterate If:
⚠️ Mixed Results: Some metrics up, some down ⚠️ Works for Some Segments Only: E.g., only mobile, not desktop ⚠️ Close to Significance: P = 0.06 (just missed)
Example:
- Conversion: +10% (p = 0.03) ✅
- Revenue: -5% (p = 0.08) ⚠️
- Decision: Iterate. Conversion is up but revenue is down. Investigate why.
Kill If:
❌ Negative: Treatment is worse than control ❌ Not Significant: P-value > 0.05 ❌ Opportunity Cost Too High: Could be working on better ideas
Example:
- Conversion: +2% (p = 0.15) ❌
- Took 4 weeks to test
- Decision: Kill. Not significant, move on to next idea.
Tools
Feature Flags
LaunchDarkly:
- Feature flag management
- Gradual rollouts
- Kill switches
Split.io:
- Feature flags + experimentation
- Real-time metrics
Unleash:
- Open-source feature flags
- Self-hosted option
Experimentation Platforms
Optimizely:
- Full-stack experimentation
- Visual editor for web
- Stats Engine (sequential testing)
VWO (Visual Website Optimizer):
- A/B testing for web
- Heatmaps, session recordings
- Visual editor
GrowthBook:
- Open-source experimentation
- Bayesian statistics
- Feature flags
Statsig:
- Modern experimentation platform
- Sequential testing
- Free tier
Analytics
Amplitude:
- Product analytics
- Funnel analysis
- Cohort analysis
Mixpanel:
- Event-based analytics
- A/B test analysis
- Retention analysis
PostHog:
- Open-source product analytics
- Feature flags
- Session replay
A/B Testing for Engineers
1. Feature Flag Implementation
Node.js (LaunchDarkly):
const LaunchDarkly = require('launchdarkly-node-server-sdk'); const client = LaunchDarkly.init(process.env.LAUNCHDARKLY_SDK_KEY); await client.waitForInitialization(); app.get('/checkout', async (req, res) => { const user = { key: req.user.id, email: req.user.email, custom: { plan: req.user.plan } }; const showNewCheckout = await client.variation('new-checkout-flow', user, false); if (showNewCheckout) { res.render('checkout-new'); } else { res.render('checkout-old'); } });
Python (Statsig):
from statsig import statsig statsig.initialize(os.environ['STATSIG_SERVER_KEY']) @app.route('/checkout') def checkout(): user = { 'userID': current_user.id, 'email': current_user.email, 'custom': { 'plan': current_user.plan } } show_new_checkout = statsig.check_gate(user, 'new_checkout_flow') if show_new_checkout: return render_template('checkout_new.html') else: return render_template('checkout_old.html')
2. Metric Instrumentation
Segment (Event Tracking):
const Analytics = require('analytics-node'); const analytics = new Analytics(process.env.SEGMENT_WRITE_KEY); // Track checkout started analytics.track({ userId: user.id, event: 'Checkout Started', properties: { variant: showNewCheckout ? 'treatment' : 'control', cart_value: cart.total, items_count: cart.items.length } }); // Track checkout completed analytics.track({ userId: user.id, event: 'Checkout Completed', properties: { variant: showNewCheckout ? 'treatment' : 'control', order_id: order.id, revenue: order.total } });
3. Data Pipeline
Architecture:
Application ↓ (events) Segment ↓ (forwards to) ├── Amplitude (analytics) ├── Mixpanel (analytics) ├── Data Warehouse (BigQuery, Snowflake) └── Statsig (experimentation)
4. Results Dashboard
Grafana Dashboard:
{ "dashboard": { "title": "A/B Test: New Checkout Flow", "panels": [ { "title": "Conversion Rate by Variant", "targets": [ { "expr": "sum(checkout_completed{variant='control'}) / sum(checkout_started{variant='control'})", "legendFormat": "Control" }, { "expr": "sum(checkout_completed{variant='treatment'}) / sum(checkout_started{variant='treatment'})", "legendFormat": "Treatment" } ] }, { "title": "Sample Size", "targets": [ { "expr": "sum(checkout_started{variant='control'})", "legendFormat": "Control" }, { "expr": "sum(checkout_started{variant='treatment'})", "legendFormat": "Treatment" } ] } ] } }
Real Experiment Examples
Example 1: Button Color Test (Classic)
Hypothesis:
"If we change the CTA button from blue to orange, click-through rate will increase by 10%, because orange is more attention-grabbing."
Test:
- Control: Blue button
- Treatment: Orange button
- Sample size: 10,000 per variant
- Duration: 1 week
Results:
- Control: 5.2% CTR
- Treatment: 5.7% CTR
- Lift: +9.6%
- P-value: 0.04 ✅
Decision: Ship orange button.
Example 2: Checkout Flow Optimization
Hypothesis:
"If we reduce checkout from 5 steps to 3 steps, conversion will increase by 15%, because users abandon due to flow length."
Test:
- Control: 5-step checkout
- Treatment: 3-step checkout (combined steps)
- Sample size: 50,000 per variant
- Duration: 2 weeks
Results:
- Control: 8.5% conversion
- Treatment: 9.8% conversion
- Lift: +15.3%
- P-value: 0.001 ✅
Secondary Metrics:
- Time to checkout: 4.2 min → 3.1 min ✅
- Error rate: 2.1% → 1.8% ✅
Decision: Ship 3-step checkout.
Example 3: Pricing Page Variants
Hypothesis:
"If we show annual pricing first (instead of monthly), annual plan adoption will increase by 25%, because anchoring effect."
Test:
- Control: Monthly pricing shown first
- Treatment: Annual pricing shown first
- Sample size: 20,000 per variant
- Duration: 3 weeks
Results:
- Control: 12% annual adoption
- Treatment: 18% annual adoption
- Lift: +50%
- P-value: 0.001 ✅
Counter Metrics:
- Overall conversion: 10.5% → 10.2% ⚠️ (slight drop)
Decision: Ship, but monitor overall conversion.
Example 4: Onboarding Flow
Hypothesis:
"If we add an interactive tutorial in onboarding, activation rate will increase by 30%, because users don't know how to get started."
Test:
- Control: No tutorial
- Treatment: Interactive tutorial (5 steps)
- Sample size: 15,000 per variant
- Duration: 2 weeks
Results:
- Control: 25% activation rate
- Treatment: 28% activation rate
- Lift: +12%
- P-value: 0.08 ❌ (not significant)
Segment Analysis:
- New users: +20% (p = 0.03) ✅
- Returning users: +2% (p = 0.5) ❌
Decision: Iterate. Show tutorial only to new users.
Advanced: Bayesian A/B Testing
Traditional (Frequentist) A/B Testing
Approach:
- Null hypothesis: No difference between A and B
- P-value: Probability of seeing this result if null is true
- Reject null if p < 0.05
Interpretation:
"There's a 95% chance the result is not due to random chance."
Bayesian A/B Testing
Approach:
- Prior belief: What we believe before test
- Likelihood: Data from test
- Posterior belief: Updated belief after test
Interpretation:
"There's a 95% probability that B is better than A."
Benefits of Bayesian
-
Easier to Interpret:
- "95% probability B is better" (intuitive)
- vs "p = 0.03" (confusing)
-
Can Stop Early:
- No peeking problem
- Stop when confident enough
-
Incorporates Prior Knowledge:
- Use historical data
- More accurate with small samples
Tools That Use Bayesian
- GrowthBook: Bayesian by default
- VWO: Bayesian engine option
- Google Optimize: Bayesian (deprecated)
Example
Test:
- Control: 5.0% conversion (1000 users)
- Treatment: 5.5% conversion (1000 users)
Frequentist:
- P-value: 0.15 (not significant)
- Decision: Can't conclude
Bayesian:
- Probability B > A: 87%
- Expected lift: +10%
- Decision: Likely better, but not confident enough (need 95%)
Summary
Quick Reference
Experiment Types:
- A/B test: Two variants
- Multivariate: Multiple changes
- Sequential: Stop early
- Holdout: Long-term measurement
When to Experiment:
- Significant features
- Uncertain outcomes
- Multiple options
- Optimization
Process:
- Define hypothesis
- Choose metrics
- Calculate sample size
- Set duration
- Design variants
- Launch
- Analyze
- Decide
Metrics:
- Primary: What we're optimizing
- Secondary: Guardrails
- Counter: Watch for negatives
Statistical Significance:
- P-value < 0.05
- Power > 80%
- Minimum detectable effect
Common Pitfalls:
- Peeking
- Sample ratio mismatch
- Novelty effect
- Seasonality
Decision Framework:
- Ship: Positive, significant, no red flags
- Iterate: Mixed results
- Kill: Negative, not significant
Tools:
- Feature flags: LaunchDarkly, Split.io
- Experimentation: Optimizely, Statsig, GrowthBook
- Analytics: Amplitude, Mixpanel, PostHog