Claude-Skills ab-test-setup

install

source · Clone the upstream repo

git clone https://github.com/borghei/Claude-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/borghei/Claude-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/product-team/ab-test-setup" ~/.claude/skills/borghei-claude-skills-ab-test-setup-2fce94 && rm -rf "$T"

manifest: product-team/ab-test-setup/SKILL.md

source content

A/B Test Setup - Experimentation Design & Analysis

Category: Product Team Tags: A/B testing, experiments, statistical significance, sample size, feature flags, hypothesis testing

Overview

A/B Test Setup provides the complete framework for designing experiments that produce statistically valid, actionable results. Most A/B tests fail not because the variant was wrong, but because the test was poorly designed: wrong sample size, wrong metric, or someone peeked at results and stopped early. This skill prevents those mistakes.

The Experiment Lifecycle

1. HYPOTHESIZE  →  2. DESIGN  →  3. CALCULATE  →  4. IMPLEMENT
       ↑                                                    │
       │                                                    ▼
7. ITERATE  ←  6. DOCUMENT  ←  5. ANALYZE  ←  [Run to completion]

Step 1: Hypothesis Formulation

The Hypothesis Template

Because [observation or data point],
we believe [specific change]
will cause [measurable outcome]
for [defined audience segment].

We'll know this is true when [primary metric] changes by [minimum detectable effect].
We'll watch [guardrail metrics] to ensure no negative impact.

Good vs Bad Hypotheses

Quality	Hypothesis	Problem
Bad	"Changing the button color might increase clicks"	No data basis, no target, no measurement plan
Mediocre	"A green button will get more clicks than blue"	No "why", no target size, no guardrails
Good	"Because heatmaps show 40% of users don't notice our CTA, making the button 2x larger with contrasting color will increase CTA clicks by 15%+ for new visitors. Guardrail: page load time stays under 2s."	Data-backed, specific change, measurable outcome, defined audience, guardrail

Hypothesis Sources (Where to Find Test Ideas)

Source	What to Look For	Example
Analytics data	Drop-off points, low-performing pages	"80% of users drop off at step 3 of onboarding"
User research	Confusion, frustration, unmet needs	"Users don't understand what the product does from the homepage"
Heatmaps/session recordings	Ignored elements, rage clicks	"Nobody scrolls past the fold on pricing page"
Support tickets	Recurring complaints, feature confusion	"Users constantly ask how to invite team members"
Competitor analysis	Different approaches to same problem	"Competitor uses a wizard; we use a form"
Sales objections	Common reasons prospects don't convert	"Prospects want to see pricing before signing up"

Step 2: Test Design

Test Types

Type	Variants	Traffic Need	Best For
A/B	2 (control + 1 variant)	Moderate	Single change validation
A/B/n	3+ variants	High	Comparing multiple approaches
Multivariate (MVT)	Combinations of changes	Very high	Optimizing multiple elements
Split URL	Different pages	Moderate	Major redesigns
Bandit	Dynamic allocation	Low-moderate	Revenue optimization

Default recommendation: Standard A/B test. Only use A/B/n or MVT when you have enough traffic and a specific need.

What to Test (By Impact)

Category	High Impact	Medium Impact	Low Impact
Copy	Headline/value prop, CTA text	Body copy, social proof	Microcopy, labels
Design	Page layout, above-fold content	Visual hierarchy, imagery	Color, font size
UX	Number of steps, form fields	Button placement, navigation	Animations, transitions
Pricing	Price point, plan names	Feature packaging, anchoring	Billing frequency display
Social Proof	Testimonials vs none, logos	Testimonial format, placement	Testimonial count

Metric Selection

Every test needs three types of metrics:

Primary Metric (1 only)

The single metric that determines success
Directly tied to the hypothesis
Must be measurable within the test duration
Examples: signup rate, click-through rate, purchase rate

Secondary Metrics (2-3)

Explain why the primary metric moved
Provide context for decision-making
Examples: time on page, scroll depth, feature adoption rate

Guardrail Metrics (1-3)

Things that must NOT get worse
Stop the test if significantly negative
Examples: error rate, support ticket volume, page load time, refund rate

Step 3: Sample Size Calculation

Quick Reference Table

Minimum visitors PER VARIANT needed (95% confidence, 80% power):

Baseline Rate	5% Lift	10% Lift	15% Lift	20% Lift	50% Lift
1%	620,000	156,000	70,000	39,000	6,400
2%	305,000	77,000	34,000	19,500	3,200
3%	200,000	51,000	23,000	12,800	2,100
5%	116,000	29,500	13,200	7,500	1,250
10%	54,000	13,800	6,200	3,500	600
20%	24,000	6,200	2,800	1,600	280
50%	6,100	1,600	720	410	75

Duration Calculation

Duration (days) = (Sample size per variant * Number of variants) / Daily traffic to test page

Minimum duration: 7 days (to capture day-of-week effects) Maximum recommended: 6 weeks (beyond this, external factors contaminate results)

What If You Don't Have Enough Traffic?

Situation	Solution
Need 100K visitors, get 5K/week	Increase minimum detectable effect (test bolder changes)
Very low traffic (<1K/week)	Use qualitative testing (user testing, surveys) instead
Medium traffic (5-20K/week)	Run for 4-6 weeks, test big changes only
High traffic (50K+/week)	You can test subtle changes, run multiple tests

Step 4: Implementation

Client-Side Implementation

JavaScript modifies the page after initial render.

Pros: Quick to implement, no deploy needed Cons: Can cause flicker (flash of original content), blocked by ad blockers Tools: PostHog, Optimizely, VWO, Google Optimize

Anti-flicker pattern:

// Add to <head> before any rendering
<style>.ab-test-hide { opacity: 0 !important; }</style>
<script>document.documentElement.classList.add('ab-test-hide');</script>

// In your test script (runs after variant assignment):
document.documentElement.classList.remove('ab-test-hide');

Server-Side Implementation

Variant determined before page renders. No flicker, no client-side dependency.

Pros: No flicker, not blocked by ad blockers, works for logged-in features Cons: Requires engineering work, deploy needed Tools: PostHog, LaunchDarkly, Split, Unleash, custom feature flags

Basic feature flag pattern:

# Server-side variant assignment
def get_variant(user_id: str, experiment: str) -> str:
    # Deterministic hash ensures same user always sees same variant
    hash_input = f"{user_id}:{experiment}"
    hash_value = hashlib.md5(hash_input.encode()).hexdigest()
    bucket = int(hash_value[:8], 16) % 100

    if bucket < 50:
        return "control"
    else:
        return "variant"

Traffic Allocation

Strategy	Split	When to Use
Standard	50/50	Default. Maximum statistical power.
Conservative	90/10 or 80/20	Risky changes, revenue-impacting tests
Ramped	Start 95/5, increase to 50/50	New infrastructure, technical risk

Critical rules:

Users must see the same variant on every visit (sticky assignment by user ID or cookie)
Allocation must be balanced across time of day and day of week
Never change allocation mid-test

Step 5: Running the Test

Pre-Launch Checklist

Hypothesis documented with primary metric and minimum detectable effect
Sample size calculated, expected duration estimated
Both variants implemented and QA'd on all device types
Tracking verified (events fire correctly for both variants)
No other tests running on the same page/feature
Stakeholders informed of test duration and "no peeking" rule
External factor calendar checked (no major launches, holidays, press)

During the Test

DO:

Monitor for technical errors (variant not rendering, tracking broken)
Check that traffic split is balanced daily
Document any external events that might affect results

DO NOT:

Look at results before reaching sample size ("peeking problem")
Make changes to either variant
Add traffic from new sources mid-test
Stop the test early because one variant "looks like it's winning"

The Peeking Problem (Critical)

Looking at results before reaching the planned sample size and stopping because one variant looks better leads to a 25-40% false positive rate (vs the intended 5%).

Why: Statistical significance fluctuates wildly with small samples. A variant can show p < 0.05 at 20% of planned sample size and p > 0.30 at full sample.

Solutions:

Pre-commit to sample size and do not check results until reached
If you must monitor: use sequential testing methods (group sequential design, always-valid p-values)
Set calendar reminder for expected completion date -- that is when you look

Step 6: Analysis

Analysis Checklist

Did we reach planned sample size? If not, results are preliminary only.
Is it statistically significant? p < 0.05 = 95% confidence the difference is real.
What's the confidence interval? Tells you the range of likely true effect.
Is the effect size meaningful? A 0.1% lift that's "significant" may not be worth implementing.
Are secondary metrics consistent? Do they support the primary result?
Any guardrail violations? Did anything get worse?
Segment analysis: Different results for mobile vs desktop? New vs returning?

Interpreting Results

Result	Primary Metric	Confidence	Action
Clear winner	Variant +15%, p < 0.01	High	Implement variant
Modest winner	Variant +5%, p < 0.05	Medium	Implement if easy, else run longer
Flat	< 2% difference, p > 0.20	High (no effect)	Keep control, test something bolder
Loser	Variant -10%, p < 0.05	High	Keep control, investigate why
Inconclusive	5% difference, p = 0.08	Low	Need more traffic or bolder test
Mixed signals	Primary up, guardrail down	Investigate	Dig into segments, do not ship blindly

Common Analysis Mistakes

Mistake	Consequence	Prevention
Stopping at first significance	25-40% false positive rate	Commit to sample size
Cherry-picking segments	Finding "winners" that don't replicate	Pre-register segments of interest
Ignoring confidence intervals	Overestimating effect size	Always report CI alongside p-value
Multiple comparisons	Inflated Type I error	Bonferroni correction for A/B/n
Survivorship bias	Only analyzing users who completed flow	Include all users from assignment point
Simpson's paradox	Aggregate hides segment reversal	Always check key segments

Step 7: Documentation

Every test must be documented, regardless of outcome.

Test Documentation Template

EXPERIMENT: [Name]
DATE: [Start] to [End]
OWNER: [Name]

HYPOTHESIS:
Because [observation], we believed [change] would cause [outcome] for [audience].

VARIANTS:
- Control: [description]
- Variant: [description + screenshot]

METRICS:
- Primary: [metric] (baseline: [X]%, MDE: [Y]%)
- Secondary: [metrics]
- Guardrails: [metrics]

RESULTS:
- Sample size: [actual] / [planned]
- Duration: [X] days
- Primary metric: Control [X]% vs Variant [Y]% (p = [Z], CI: [range])
- Secondary metrics: [results]
- Guardrails: [all clear / violation noted]

DECISION: [Ship variant / Keep control / Iterate]

LEARNINGS:
- [What we learned about our users]
- [What we'd do differently next time]

Experiment Prioritization Framework

ICE Scoring

Factor	Score (1-10)	Question
Impact	How much will this move the metric?	Big change to primary KPI = 10
Confidence	How sure are we it will work?	Strong data supporting hypothesis = 10
Ease	How easy is it to implement and measure?	Can ship in a day = 10

ICE Score = (Impact + Confidence + Ease) / 3

Rank all test ideas by ICE score. Run highest first.

Test Backlog Template

#	Hypothesis	Primary Metric	ICE	Est. Duration	Status
1	Larger CTA increases signups	Signup rate	8.3	2 weeks	Ready
2	Social proof on pricing increases conversion	Plan selection rate	7.0	3 weeks	Needs design
3	Shorter onboarding increases activation	Feature activation	6.7	4 weeks	In backlog

Proactive Triggers

Someone debates between two design options: propose an A/B test instead of opinionating
Conversion rate mentioned as underperforming: offer to design a test, not guess at solutions
Pricing page changes discussed: always test pricing changes with guardrail metrics
Post-launch of any feature: propose follow-up experiment to optimize
"Let's just try it and see": redirect to structured hypothesis before implementation

Related Skills

Skill	Use When
analytics-tracking	Setting up event tracking that feeds experiment metrics
campaign-analytics	Folding experiment results into broader attribution
launch-strategy	Testing within a product launch sequence
prompt-engineer-toolkit	A/B testing AI prompts in production

Tool Reference

sample_size_calculator.py

Calculates required sample size per variant using the normal approximation to the two-proportion z-test. Includes Bonferroni correction for multi-variant tests and duration estimation.

Flag	Type	Default	Description
`--baseline` , `-b`	float	(required)	Baseline conversion rate (e.g. 0.05 for 5%)
`--mde` , `-m`	float	(required)	Minimum detectable effect as relative lift (e.g. 0.10 for 10%)
`--alpha` , `-a`	float	0.05	Significance level
`--power` , `-p`	float	0.80	Statistical power
`--variants` , `-v`	int	2	Number of variants including control
`--daily-traffic` , `-d`	int	0	Daily eligible traffic for duration estimation
`--one-tailed`	flag	False	Use one-tailed test instead of two-tailed
`--json`	flag	False	Output as JSON

python scripts/sample_size_calculator.py --baseline 0.05 --mde 0.10
python scripts/sample_size_calculator.py --baseline 0.12 --mde 0.15 --power 0.9 --daily-traffic 5000
python scripts/sample_size_calculator.py --baseline 0.05 --mde 0.10 --variants 3 --json

experiment_analyzer.py

Analyzes A/B test results using the two-proportion z-test with confidence intervals and segment breakdown.

Flag Type Default Description

input

positional

(required)

CSV file with results or "sample" to create sample

--alpha

-a

float

0.05

Significance level

--json

flag

False

Output as JSON

CSV format:

variant,visitors,conversions,segment

python scripts/experiment_analyzer.py sample
python scripts/experiment_analyzer.py results.csv
python scripts/experiment_analyzer.py results.csv --alpha 0.01 --json

experiment_planner.py

Generates a structured experiment plan from a hypothesis text, including metric selection, sample size, timeline, risks, and documentation template.

Flag	Type	Default	Description
`--hypothesis` , `-H`	string	(required)	Experiment hypothesis text
`--baseline` , `-b`	float	0.05	Baseline conversion rate
`--mde` , `-m`	float	0.10	Minimum detectable effect as relative lift
`--daily-traffic` , `-d`	int	0	Daily eligible traffic
`--variants` , `-v`	int	2	Number of variants including control
`--json`	flag	False	Output as JSON

python scripts/experiment_planner.py --hypothesis "Larger CTA will increase signups by 15%"
python scripts/experiment_planner.py -H "Simplified checkout boosts conversions" -b 0.08 -m 0.15 -d 3000
python scripts/experiment_planner.py -H "New pricing page" --json

Troubleshooting

Problem	Cause	Solution
Sample size is unrealistically large	MDE too small or baseline too low	Increase MDE (test bolder changes) or target a higher-traffic page
Test duration exceeds 6 weeks	Insufficient daily traffic	Consider qualitative methods, test bigger changes, or combine traffic from multiple pages
p-value hovers around 0.05	Borderline significance	Do not stop early; run to planned sample size or extend 20%
Results significant but lift is tiny (<1%)	Overpowered test	Check practical significance alongside statistical significance
Segment results contradict overall	Simpson's paradox	Investigate segment composition; report both overall and segment results
Variant performs differently on mobile vs desktop	Device-specific UX issues	Design device-specific variants; increase per-segment sample size
Calculator produces negative CI	Very small samples or extreme rates	Ensure sufficient sample size; check data integrity

Success Criteria

Criterion	Target	How to Measure
Tests reach planned sample size	100% of tests	Compare actual vs planned sample at conclusion
False positive rate	<5%	Track post-implementation lift vs test prediction
Test velocity	2+ tests per team per month	Count experiments documented per sprint
Documentation completeness	100% of tests documented	Audit experiment records quarterly
Average test duration	<4 weeks	Measure start-to-conclusion calendar days
Decision quality	>80% of shipped variants hold gains at 90 days	Post-ship metric tracking

Scope & Limitations

In scope:

Hypothesis formulation and validation
Sample size and power calculations
Frequentist two-proportion z-tests
A/B, A/B/n, and split URL test planning
Segment-level analysis
Pre/post test documentation

Out of scope:

Bayesian A/B testing methods (use dedicated Bayesian tools)
Multi-armed bandit algorithms (require real-time allocation infrastructure)
Multivariate testing (MVT) analysis (combinatorial explosion requires specialized tools)
Server-side feature flag implementation (see engineering skills)
Revenue-based metrics requiring transaction-level data
Sequential testing / always-valid p-values (use Optimizely Stats Engine or similar)

Integration Points

Tool / Platform	Integration Method	Use Case
PostHog / Amplitude	JSON export from experiment_analyzer	Feed results into product analytics
Jira / Linear	experiment_planner JSON output	Create experiment tickets with metadata
Google Sheets	CSV export from experiment_analyzer	Share results with non-technical stakeholders
LaunchDarkly / Unleash	experiment_planner checklist	Pre-launch validation before feature flag rollout
Slack / Notion	Copy human-readable output	Async experiment status updates
CI/CD pipelines	`--json` flag on all scripts	Automated experiment health checks