Claude-Skills ab-test-setup

install
source · Clone the upstream repo
git clone https://github.com/borghei/Claude-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/borghei/Claude-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/product-team/ab-test-setup" ~/.claude/skills/borghei-claude-skills-ab-test-setup-2fce94 && rm -rf "$T"
manifest: product-team/ab-test-setup/SKILL.md
source content

A/B Test Setup - Experimentation Design & Analysis

Category: Product Team Tags: A/B testing, experiments, statistical significance, sample size, feature flags, hypothesis testing

Overview

A/B Test Setup provides the complete framework for designing experiments that produce statistically valid, actionable results. Most A/B tests fail not because the variant was wrong, but because the test was poorly designed: wrong sample size, wrong metric, or someone peeked at results and stopped early. This skill prevents those mistakes.


The Experiment Lifecycle

1. HYPOTHESIZE  →  2. DESIGN  →  3. CALCULATE  →  4. IMPLEMENT
       ↑                                                    │
       │                                                    ▼
7. ITERATE  ←  6. DOCUMENT  ←  5. ANALYZE  ←  [Run to completion]

Step 1: Hypothesis Formulation

The Hypothesis Template

Because [observation or data point],
we believe [specific change]
will cause [measurable outcome]
for [defined audience segment].

We'll know this is true when [primary metric] changes by [minimum detectable effect].
We'll watch [guardrail metrics] to ensure no negative impact.

Good vs Bad Hypotheses

QualityHypothesisProblem
Bad"Changing the button color might increase clicks"No data basis, no target, no measurement plan
Mediocre"A green button will get more clicks than blue"No "why", no target size, no guardrails
Good"Because heatmaps show 40% of users don't notice our CTA, making the button 2x larger with contrasting color will increase CTA clicks by 15%+ for new visitors. Guardrail: page load time stays under 2s."Data-backed, specific change, measurable outcome, defined audience, guardrail

Hypothesis Sources (Where to Find Test Ideas)

SourceWhat to Look ForExample
Analytics dataDrop-off points, low-performing pages"80% of users drop off at step 3 of onboarding"
User researchConfusion, frustration, unmet needs"Users don't understand what the product does from the homepage"
Heatmaps/session recordingsIgnored elements, rage clicks"Nobody scrolls past the fold on pricing page"
Support ticketsRecurring complaints, feature confusion"Users constantly ask how to invite team members"
Competitor analysisDifferent approaches to same problem"Competitor uses a wizard; we use a form"
Sales objectionsCommon reasons prospects don't convert"Prospects want to see pricing before signing up"

Step 2: Test Design

Test Types

TypeVariantsTraffic NeedBest For
A/B2 (control + 1 variant)ModerateSingle change validation
A/B/n3+ variantsHighComparing multiple approaches
Multivariate (MVT)Combinations of changesVery highOptimizing multiple elements
Split URLDifferent pagesModerateMajor redesigns
BanditDynamic allocationLow-moderateRevenue optimization

Default recommendation: Standard A/B test. Only use A/B/n or MVT when you have enough traffic and a specific need.

What to Test (By Impact)

CategoryHigh ImpactMedium ImpactLow Impact
CopyHeadline/value prop, CTA textBody copy, social proofMicrocopy, labels
DesignPage layout, above-fold contentVisual hierarchy, imageryColor, font size
UXNumber of steps, form fieldsButton placement, navigationAnimations, transitions
PricingPrice point, plan namesFeature packaging, anchoringBilling frequency display
Social ProofTestimonials vs none, logosTestimonial format, placementTestimonial count

Metric Selection

Every test needs three types of metrics:

Primary Metric (1 only)

  • The single metric that determines success
  • Directly tied to the hypothesis
  • Must be measurable within the test duration
  • Examples: signup rate, click-through rate, purchase rate

Secondary Metrics (2-3)

  • Explain why the primary metric moved
  • Provide context for decision-making
  • Examples: time on page, scroll depth, feature adoption rate

Guardrail Metrics (1-3)

  • Things that must NOT get worse
  • Stop the test if significantly negative
  • Examples: error rate, support ticket volume, page load time, refund rate

Step 3: Sample Size Calculation

Quick Reference Table

Minimum visitors PER VARIANT needed (95% confidence, 80% power):

Baseline Rate5% Lift10% Lift15% Lift20% Lift50% Lift
1%620,000156,00070,00039,0006,400
2%305,00077,00034,00019,5003,200
3%200,00051,00023,00012,8002,100
5%116,00029,50013,2007,5001,250
10%54,00013,8006,2003,500600
20%24,0006,2002,8001,600280
50%6,1001,60072041075

Duration Calculation

Duration (days) = (Sample size per variant * Number of variants) / Daily traffic to test page

Minimum duration: 7 days (to capture day-of-week effects) Maximum recommended: 6 weeks (beyond this, external factors contaminate results)

What If You Don't Have Enough Traffic?

SituationSolution
Need 100K visitors, get 5K/weekIncrease minimum detectable effect (test bolder changes)
Very low traffic (<1K/week)Use qualitative testing (user testing, surveys) instead
Medium traffic (5-20K/week)Run for 4-6 weeks, test big changes only
High traffic (50K+/week)You can test subtle changes, run multiple tests

Step 4: Implementation

Client-Side Implementation

JavaScript modifies the page after initial render.

Pros: Quick to implement, no deploy needed Cons: Can cause flicker (flash of original content), blocked by ad blockers Tools: PostHog, Optimizely, VWO, Google Optimize

Anti-flicker pattern:

// Add to <head> before any rendering
<style>.ab-test-hide { opacity: 0 !important; }</style>
<script>document.documentElement.classList.add('ab-test-hide');</script>

// In your test script (runs after variant assignment):
document.documentElement.classList.remove('ab-test-hide');

Server-Side Implementation

Variant determined before page renders. No flicker, no client-side dependency.

Pros: No flicker, not blocked by ad blockers, works for logged-in features Cons: Requires engineering work, deploy needed Tools: PostHog, LaunchDarkly, Split, Unleash, custom feature flags

Basic feature flag pattern:

# Server-side variant assignment
def get_variant(user_id: str, experiment: str) -> str:
    # Deterministic hash ensures same user always sees same variant
    hash_input = f"{user_id}:{experiment}"
    hash_value = hashlib.md5(hash_input.encode()).hexdigest()
    bucket = int(hash_value[:8], 16) % 100

    if bucket < 50:
        return "control"
    else:
        return "variant"

Traffic Allocation

StrategySplitWhen to Use
Standard50/50Default. Maximum statistical power.
Conservative90/10 or 80/20Risky changes, revenue-impacting tests
RampedStart 95/5, increase to 50/50New infrastructure, technical risk

Critical rules:

  • Users must see the same variant on every visit (sticky assignment by user ID or cookie)
  • Allocation must be balanced across time of day and day of week
  • Never change allocation mid-test

Step 5: Running the Test

Pre-Launch Checklist

  • Hypothesis documented with primary metric and minimum detectable effect
  • Sample size calculated, expected duration estimated
  • Both variants implemented and QA'd on all device types
  • Tracking verified (events fire correctly for both variants)
  • No other tests running on the same page/feature
  • Stakeholders informed of test duration and "no peeking" rule
  • External factor calendar checked (no major launches, holidays, press)

During the Test

DO:

  • Monitor for technical errors (variant not rendering, tracking broken)
  • Check that traffic split is balanced daily
  • Document any external events that might affect results

DO NOT:

  • Look at results before reaching sample size ("peeking problem")
  • Make changes to either variant
  • Add traffic from new sources mid-test
  • Stop the test early because one variant "looks like it's winning"

The Peeking Problem (Critical)

Looking at results before reaching the planned sample size and stopping because one variant looks better leads to a 25-40% false positive rate (vs the intended 5%).

Why: Statistical significance fluctuates wildly with small samples. A variant can show p < 0.05 at 20% of planned sample size and p > 0.30 at full sample.

Solutions:

  1. Pre-commit to sample size and do not check results until reached
  2. If you must monitor: use sequential testing methods (group sequential design, always-valid p-values)
  3. Set calendar reminder for expected completion date -- that is when you look

Step 6: Analysis

Analysis Checklist

  1. Did we reach planned sample size? If not, results are preliminary only.
  2. Is it statistically significant? p < 0.05 = 95% confidence the difference is real.
  3. What's the confidence interval? Tells you the range of likely true effect.
  4. Is the effect size meaningful? A 0.1% lift that's "significant" may not be worth implementing.
  5. Are secondary metrics consistent? Do they support the primary result?
  6. Any guardrail violations? Did anything get worse?
  7. Segment analysis: Different results for mobile vs desktop? New vs returning?

Interpreting Results

ResultPrimary MetricConfidenceAction
Clear winnerVariant +15%, p < 0.01HighImplement variant
Modest winnerVariant +5%, p < 0.05MediumImplement if easy, else run longer
Flat< 2% difference, p > 0.20High (no effect)Keep control, test something bolder
LoserVariant -10%, p < 0.05HighKeep control, investigate why
Inconclusive5% difference, p = 0.08LowNeed more traffic or bolder test
Mixed signalsPrimary up, guardrail downInvestigateDig into segments, do not ship blindly

Common Analysis Mistakes

MistakeConsequencePrevention
Stopping at first significance25-40% false positive rateCommit to sample size
Cherry-picking segmentsFinding "winners" that don't replicatePre-register segments of interest
Ignoring confidence intervalsOverestimating effect sizeAlways report CI alongside p-value
Multiple comparisonsInflated Type I errorBonferroni correction for A/B/n
Survivorship biasOnly analyzing users who completed flowInclude all users from assignment point
Simpson's paradoxAggregate hides segment reversalAlways check key segments

Step 7: Documentation

Every test must be documented, regardless of outcome.

Test Documentation Template

EXPERIMENT: [Name]
DATE: [Start] to [End]
OWNER: [Name]

HYPOTHESIS:
Because [observation], we believed [change] would cause [outcome] for [audience].

VARIANTS:
- Control: [description]
- Variant: [description + screenshot]

METRICS:
- Primary: [metric] (baseline: [X]%, MDE: [Y]%)
- Secondary: [metrics]
- Guardrails: [metrics]

RESULTS:
- Sample size: [actual] / [planned]
- Duration: [X] days
- Primary metric: Control [X]% vs Variant [Y]% (p = [Z], CI: [range])
- Secondary metrics: [results]
- Guardrails: [all clear / violation noted]

DECISION: [Ship variant / Keep control / Iterate]

LEARNINGS:
- [What we learned about our users]
- [What we'd do differently next time]

Experiment Prioritization Framework

ICE Scoring

FactorScore (1-10)Question
ImpactHow much will this move the metric?Big change to primary KPI = 10
ConfidenceHow sure are we it will work?Strong data supporting hypothesis = 10
EaseHow easy is it to implement and measure?Can ship in a day = 10

ICE Score = (Impact + Confidence + Ease) / 3

Rank all test ideas by ICE score. Run highest first.

Test Backlog Template

#HypothesisPrimary MetricICEEst. DurationStatus
1Larger CTA increases signupsSignup rate8.32 weeksReady
2Social proof on pricing increases conversionPlan selection rate7.03 weeksNeeds design
3Shorter onboarding increases activationFeature activation6.74 weeksIn backlog

Proactive Triggers

  • Someone debates between two design options: propose an A/B test instead of opinionating
  • Conversion rate mentioned as underperforming: offer to design a test, not guess at solutions
  • Pricing page changes discussed: always test pricing changes with guardrail metrics
  • Post-launch of any feature: propose follow-up experiment to optimize
  • "Let's just try it and see": redirect to structured hypothesis before implementation

Related Skills

SkillUse When
analytics-trackingSetting up event tracking that feeds experiment metrics
campaign-analyticsFolding experiment results into broader attribution
launch-strategyTesting within a product launch sequence
prompt-engineer-toolkitA/B testing AI prompts in production

Tool Reference

sample_size_calculator.py

Calculates required sample size per variant using the normal approximation to the two-proportion z-test. Includes Bonferroni correction for multi-variant tests and duration estimation.

FlagTypeDefaultDescription
--baseline
,
-b
float(required)Baseline conversion rate (e.g. 0.05 for 5%)
--mde
,
-m
float(required)Minimum detectable effect as relative lift (e.g. 0.10 for 10%)
--alpha
,
-a
float0.05Significance level
--power
,
-p
float0.80Statistical power
--variants
,
-v
int2Number of variants including control
--daily-traffic
,
-d
int0Daily eligible traffic for duration estimation
--one-tailed
flagFalseUse one-tailed test instead of two-tailed
--json
flagFalseOutput as JSON
python scripts/sample_size_calculator.py --baseline 0.05 --mde 0.10
python scripts/sample_size_calculator.py --baseline 0.12 --mde 0.15 --power 0.9 --daily-traffic 5000
python scripts/sample_size_calculator.py --baseline 0.05 --mde 0.10 --variants 3 --json

experiment_analyzer.py

Analyzes A/B test results using the two-proportion z-test with confidence intervals and segment breakdown.

FlagTypeDefaultDescription
input
positional(required)CSV file with results or "sample" to create sample
--alpha
,
-a
float0.05Significance level
--json
flagFalseOutput as JSON

CSV format:

variant,visitors,conversions,segment

python scripts/experiment_analyzer.py sample
python scripts/experiment_analyzer.py results.csv
python scripts/experiment_analyzer.py results.csv --alpha 0.01 --json

experiment_planner.py

Generates a structured experiment plan from a hypothesis text, including metric selection, sample size, timeline, risks, and documentation template.

FlagTypeDefaultDescription
--hypothesis
,
-H
string(required)Experiment hypothesis text
--baseline
,
-b
float0.05Baseline conversion rate
--mde
,
-m
float0.10Minimum detectable effect as relative lift
--daily-traffic
,
-d
int0Daily eligible traffic
--variants
,
-v
int2Number of variants including control
--json
flagFalseOutput as JSON
python scripts/experiment_planner.py --hypothesis "Larger CTA will increase signups by 15%"
python scripts/experiment_planner.py -H "Simplified checkout boosts conversions" -b 0.08 -m 0.15 -d 3000
python scripts/experiment_planner.py -H "New pricing page" --json

Troubleshooting

ProblemCauseSolution
Sample size is unrealistically largeMDE too small or baseline too lowIncrease MDE (test bolder changes) or target a higher-traffic page
Test duration exceeds 6 weeksInsufficient daily trafficConsider qualitative methods, test bigger changes, or combine traffic from multiple pages
p-value hovers around 0.05Borderline significanceDo not stop early; run to planned sample size or extend 20%
Results significant but lift is tiny (<1%)Overpowered testCheck practical significance alongside statistical significance
Segment results contradict overallSimpson's paradoxInvestigate segment composition; report both overall and segment results
Variant performs differently on mobile vs desktopDevice-specific UX issuesDesign device-specific variants; increase per-segment sample size
Calculator produces negative CIVery small samples or extreme ratesEnsure sufficient sample size; check data integrity

Success Criteria

CriterionTargetHow to Measure
Tests reach planned sample size100% of testsCompare actual vs planned sample at conclusion
False positive rate<5%Track post-implementation lift vs test prediction
Test velocity2+ tests per team per monthCount experiments documented per sprint
Documentation completeness100% of tests documentedAudit experiment records quarterly
Average test duration<4 weeksMeasure start-to-conclusion calendar days
Decision quality>80% of shipped variants hold gains at 90 daysPost-ship metric tracking

Scope & Limitations

In scope:

  • Hypothesis formulation and validation
  • Sample size and power calculations
  • Frequentist two-proportion z-tests
  • A/B, A/B/n, and split URL test planning
  • Segment-level analysis
  • Pre/post test documentation

Out of scope:

  • Bayesian A/B testing methods (use dedicated Bayesian tools)
  • Multi-armed bandit algorithms (require real-time allocation infrastructure)
  • Multivariate testing (MVT) analysis (combinatorial explosion requires specialized tools)
  • Server-side feature flag implementation (see engineering skills)
  • Revenue-based metrics requiring transaction-level data
  • Sequential testing / always-valid p-values (use Optimizely Stats Engine or similar)

Integration Points

Tool / PlatformIntegration MethodUse Case
PostHog / AmplitudeJSON export from experiment_analyzerFeed results into product analytics
Jira / Linearexperiment_planner JSON outputCreate experiment tickets with metadata
Google SheetsCSV export from experiment_analyzerShare results with non-technical stakeholders
LaunchDarkly / Unleashexperiment_planner checklistPre-launch validation before feature flag rollout
Slack / NotionCopy human-readable outputAsync experiment status updates
CI/CD pipelines
--json
flag on all scripts
Automated experiment health checks