Claude-skill-registry ab-testing-statistician
Expert in statistical analysis for blind A/B and ABX audio testing. Validates randomization, calculates statistical significance, and ensures proper experimental design. Use when implementing A/B test features or analyzing test results.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/ab-testing-statistician" ~/.claude/skills/majiayu000-claude-skill-registry-ab-testing-statistician && rm -rf "$T"
skills/data/ab-testing-statistician/SKILL.mdA/B Testing Statistician
Specialized agent for designing and validating blind audio comparison tests (A/B, Blind AB, ABX) with proper statistical analysis.
Overview of Audio A/B Testing
Test Modes
| Mode | Description | User Knows? | Purpose |
|---|---|---|---|
| AB | Switch between A and B | Yes | Quick comparison, training |
| Blind AB | A and B randomly mapped to Options 1 and 2 | No | Unbiased preference detection |
| ABX | X is secretly either A or B, user guesses | No | Audibility testing (can you hear the difference?) |
Why Blind Testing Matters
Confirmation Bias: Listeners tend to prefer what they expect to be better.
Example:
Non-blind: "This expensive cable sounds clearer!" (placebo effect) Blind: "I can't tell the difference" (objective reality)
Session Management
Session State (Rust)
#[derive(Clone, Serialize, Deserialize)] pub struct ABSession { pub mode: ABTestMode, // AB, BlindAB, or ABX pub preset_a_name: String, pub preset_b_name: String, pub trim_db: f32, // Loudness compensation for B pub total_trials: usize, pub current_trial: usize, pub hidden_mapping: Vec<bool>, // For BlindAB: true = Option1 is A pub x_is_a: Vec<bool>, // For ABX: true = X is A pub answers: Vec<ABAnswer>, // User responses } #[derive(Clone, Serialize, Deserialize)] pub enum ABTestMode { AB, // Non-blind switching BlindAB, // Blind preference test ABX, // Blind audibility test } #[derive(Clone, Serialize, Deserialize)] pub struct ABAnswer { pub trial: usize, pub selected_option: String, // "A", "B", "1", "2", or "X" pub timestamp: u64, // Milliseconds since session start }
Randomization (Critical!)
BlindAB Mode: Each trial randomly maps A/B to Options 1/2:
pub fn create_blind_ab_session( preset_a: String, preset_b: String, num_trials: usize, trim_db: f32, ) -> ABSession { use rand::Rng; let mut rng = rand::thread_rng(); // Randomize each trial independently let hidden_mapping: Vec<bool> = (0..num_trials) .map(|_| rng.gen_bool(0.5)) // 50% chance Option1 = A .collect(); ABSession { mode: ABTestMode::BlindAB, preset_a_name: preset_a, preset_b_name: preset_b, trim_db, total_trials: num_trials, current_trial: 0, hidden_mapping, x_is_a: vec![], answers: vec![], } }
ABX Mode: X is randomly set to A or B for each trial:
pub fn create_abx_session( preset_a: String, preset_b: String, num_trials: usize, trim_db: f32, ) -> ABSession { use rand::Rng; let mut rng = rand::thread_rng(); // Randomize X for each trial let x_is_a: Vec<bool> = (0..num_trials) .map(|_| rng.gen_bool(0.5)) // 50% chance X = A .collect(); ABSession { mode: ABTestMode::ABX, preset_a_name: preset_a, preset_b_name: preset_b, trim_db, total_trials: num_trials, current_trial: 0, hidden_mapping: vec![], x_is_a, answers: vec![], } }
Critical Rule: Randomize PER TRIAL, not once for all trials!
❌ Wrong:
let option1_is_a = rng.gen_bool(0.5); // Use same mapping for all trials
✅ Correct:
let hidden_mapping: Vec<bool> = (0..num_trials) .map(|_| rng.gen_bool(0.5)) .collect();
Loudness Compensation (Trim Parameter)
Problem: Louder = perceived as "better" (Fletcher-Munson curves)
Solution: Level-match presets before testing
Auto-Calculate Trim
pub fn calculate_auto_trim( bands_a: &[ParametricBand], preamp_a: f32, bands_b: &[ParametricBand], preamp_b: f32, ) -> f32 { use crate::audio_math::calculate_peak_gain; let peak_a = calculate_peak_gain(bands_a, preamp_a); let peak_b = calculate_peak_gain(bands_b, preamp_b); // Adjust B to match A's peak level peak_a - peak_b }
Apply Trim to Preset B
pub fn apply_preset_with_trim( bands: &[ParametricBand], preamp: f32, trim_db: f32, ) -> Result<(), String> { let adjusted_preamp = preamp + trim_db; // Apply to EqualizerAPO write_eapo_config(bands, adjusted_preamp)?; Ok(()) }
Example:
Preset A: Peak gain = -2 dB Preset B: Peak gain = +1 dB Trim = -2 - (+1) = -3 dB Apply Preset B with -3 dB trim → Both have -2 dB peak
Statistical Analysis
Preference Analysis (BlindAB)
Count how many times each preset was preferred:
pub struct PreferenceResults { pub a_selected: usize, pub b_selected: usize, pub total_trials: usize, pub a_percentage: f64, pub b_percentage: f64, pub p_value: f64, // Statistical significance } pub fn analyze_blind_ab(session: &ABSession) -> PreferenceResults { let mut a_count = 0; let mut b_count = 0; for (i, answer) in session.answers.iter().enumerate() { let option1_is_a = session.hidden_mapping[i]; let selected_a = match answer.selected_option.as_str() { "1" => option1_is_a, "2" => !option1_is_a, _ => continue, }; if selected_a { a_count += 1; } else { b_count += 1; } } let total = a_count + b_count; let a_pct = (a_count as f64 / total as f64) * 100.0; let b_pct = (b_count as f64 / total as f64) * 100.0; // Binomial test: is this significantly different from 50/50? let p_value = binomial_test(a_count, total, 0.5); PreferenceResults { a_selected: a_count, b_selected: b_count, total_trials: total, a_percentage: a_pct, b_percentage: b_pct, p_value, } }
ABX Analysis (Audibility Test)
Count correct vs incorrect identifications:
pub struct ABXResults { pub correct: usize, pub incorrect: usize, pub total_trials: usize, pub accuracy: f64, pub p_value: f64, } pub fn analyze_abx(session: &ABSession) -> ABXResults { let mut correct = 0; let mut incorrect = 0; for (i, answer) in session.answers.iter().enumerate() { let x_is_a = session.x_is_a[i]; let guessed_a = match answer.selected_option.as_str() { "A" => true, "B" => false, _ => continue, }; if guessed_a == x_is_a { correct += 1; } else { incorrect += 1; } } let total = correct + incorrect; let accuracy = (correct as f64 / total as f64) * 100.0; // Binomial test: is this better than 50% guessing? let p_value = binomial_test(correct, total, 0.5); ABXResults { correct, incorrect, total_trials: total, accuracy, p_value, } }
Binomial Test (P-Value)
Null Hypothesis: User is guessing randomly (50% chance)
P-Value: Probability of seeing this result (or more extreme) by chance
fn binomial_test(successes: usize, trials: usize, p_null: f64) -> f64 { use statrs::distribution::{Binomial, Discrete}; let dist = Binomial::new(p_null, trials as u64).unwrap(); // Two-tailed test let observed = successes as u64; let expected = (trials as f64 * p_null) as u64; let p_observed = dist.pmf(observed); let mut p_value = p_observed; // Add probabilities of more extreme outcomes for k in 0..=trials as u64 { let p_k = dist.pmf(k); if p_k <= p_observed && k != observed { p_value += p_k; } } p_value.min(1.0) }
Interpretation:
: Significant - unlikely to be chance (95% confidence)p < 0.05
: Highly significant - very unlikely to be chance (99% confidence)p < 0.01
: Not significant - could be random guessingp >= 0.05
Example:
ABX Test: 15/20 correct (75% accuracy) P-value = 0.041 Interpretation: Statistically significant at 95% level. User can reliably hear the difference.
Sample Size Requirements
How many trials needed for reliable results?
Rule of Thumb:
- Small effect: 50+ trials
- Medium effect: 20-30 trials
- Large effect: 10-15 trials
Formula (ABX test, 80% power):
n = (Z_α/2 + Z_β)² * p(1-p) / (p - 0.5)² Where: - Z_α/2 = 1.96 (for α = 0.05, two-tailed) - Z_β = 0.84 (for 80% power) - p = expected accuracy
Example:
Expected accuracy: 70% n = (1.96 + 0.84)² * 0.7 * 0.3 / (0.7 - 0.5)² n ≈ 41 trials
Recommended Trial Counts
pub fn recommended_trial_count(expected_accuracy: f64) -> usize { if expected_accuracy <= 0.55 { 100 // Very subtle difference } else if expected_accuracy <= 0.65 { 50 // Small difference } else if expected_accuracy <= 0.75 { 25 // Medium difference } else { 15 // Large difference } }
Results Export
CSV Format
pub fn export_to_csv(session: &ABSession) -> String { let mut csv = String::from("Trial,Option1,Option2,Selected,Timestamp\n"); for (i, answer) in session.answers.iter().enumerate() { let (opt1, opt2) = if session.mode == ABTestMode::BlindAB { if session.hidden_mapping[i] { (&session.preset_a_name, &session.preset_b_name) } else { (&session.preset_b_name, &session.preset_a_name) } } else { ("A", "B") }; csv.push_str(&format!( "{},{},{},{},{}\n", i + 1, opt1, opt2, answer.selected_option, answer.timestamp )); } csv }
Output:
Trial,Option1,Option2,Selected,Timestamp 1,Flat,Boosted,1,1234 2,Boosted,Flat,2,2456 3,Flat,Boosted,1,3789
JSON Format
pub fn export_to_json( session: &ABSession, results: &PreferenceResults, ) -> String { let export = serde_json::json!({ "mode": session.mode, "presets": { "a": session.preset_a_name, "b": session.preset_b_name, }, "trim_db": session.trim_db, "trials": session.total_trials, "results": { "a_selected": results.a_selected, "b_selected": results.b_selected, "a_percentage": results.a_percentage, "b_percentage": results.b_percentage, "p_value": results.p_value, "significant": results.p_value < 0.05, }, "answers": session.answers, }); serde_json::to_string_pretty(&export).unwrap() }
Experimental Design Best Practices
1. Counterbalancing
Ensure equal distribution of A and B across trials:
pub fn validate_counterbalancing(hidden_mapping: &[bool]) -> f64 { let a_count = hidden_mapping.iter().filter(|&&x| x).count(); let total = hidden_mapping.len(); let ratio = a_count as f64 / total as f64; // Should be close to 0.5 (ratio - 0.5).abs() }
Warning threshold:
if validate_counterbalancing(&session.hidden_mapping) > 0.15 { println!("Warning: Unbalanced randomization (>15% deviation from 50/50)"); }
2. Trial Independence
Each trial should be independent:
- ✅ Randomize per trial
- ❌ Use patterns (ABABAB...)
- ❌ Fixed order
3. Rest Breaks
Prevent listener fatigue:
if (currentTrial % 10 === 0 && currentTrial !== totalTrials) { showRestBreakDialog(); }
4. Reference Switching
Allow listeners to switch between options multiple times before answering:
let switchCount = 0; function handleSwitch() { switchCount++; applyOpposite(); } // Log switch count as quality metric
Common Pitfalls
❌ Volume Mismatch
// WRONG: Apply presets without level matching applyPresetA(); applyPresetB(); // CORRECT: Apply with trim applyPreset(presetA, 0); applyPreset(presetB, trimDb);
❌ Non-Random Patterns
// WRONG: Alternating pattern let hidden_mapping = vec![true, false, true, false, ...]; // CORRECT: True randomization let hidden_mapping: Vec<bool> = (0..trials) .map(|_| rng.gen_bool(0.5)) .collect();
❌ Ignoring P-Value
// WRONG: Report raw percentages without significance "Preset A preferred 55% of the time" // CORRECT: Include statistical context "Preset A preferred 55% (p=0.42, not significant)"
❌ Too Few Trials
// WRONG: Only 5 trials const trials = 5; // Unreliable! // CORRECT: Adequate sample size const trials = 20; // Minimum for medium effects
Validation Tests
#[cfg(test)] mod tests { use super::*; #[test] fn test_randomization_distribution() { let session = create_blind_ab_session("A".into(), "B".into(), 1000, 0.0); let a_count = session.hidden_mapping.iter().filter(|&&x| x).count(); let ratio = a_count as f64 / 1000.0; // With 1000 trials, should be very close to 0.5 assert!((ratio - 0.5).abs() < 0.05, "Randomization biased: {}", ratio); } #[test] fn test_trial_independence() { let session = create_blind_ab_session("A".into(), "B".into(), 100, 0.0); // Count runs (consecutive same values) let mut runs = 1; for i in 1..session.hidden_mapping.len() { if session.hidden_mapping[i] != session.hidden_mapping[i - 1] { runs += 1; } } // Expected runs ≈ n/2 for random data let expected_runs = 50.0; let deviation = (runs as f64 - expected_runs).abs() / expected_runs; assert!(deviation < 0.3, "Trials may not be independent"); } #[test] fn test_binomial_test() { // 20/20 correct should be highly significant let p = binomial_test(20, 20, 0.5); assert!(p < 0.001); // 10/20 correct should not be significant (random guessing) let p = binomial_test(10, 20, 0.5); assert!(p > 0.05); } }
Reference Materials
- Detailed statistical methodsreferences/statistical_tests.md
- Best practices for audio testingreferences/experimental_design.md
- Power analysis formulasreferences/sample_size_calculator.md