Claude-code-skills-social-science rct-power-calculations

Calculate statistical power and sample sizes for RCTs. Use when user mentions: power analysis, sample size calculation, minimum detectable effect, MDE, statistical power, effect size, clustering effects, design effect, ICC, intra-cluster correlation.

install

source · Clone the upstream repo

git clone https://github.com/sshtomar/claude-code-skills-social-science

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/sshtomar/claude-code-skills-social-science "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/rct-power-calculations" ~/.claude/skills/sshtomar-claude-code-skills-social-science-rct-power-calculations && rm -rf "$T"

manifest: skills/rct-power-calculations/SKILL.md

source content

<skill_content>

<overview> Power analysis determines the sample size required to detect treatment effects of a given magnitude with specified probability. Adequate power prevents Type II errors (failing to detect real effects). Power calculations must account for clustering (design effects), non-compliance (reduces effective sample), attrition (reduces final sample), and multiple testing (inflates α). Underpowered studies waste resources and risk being misinterpreted as evidence of no effect.

Power = 80% is conventional. Higher power (90%) is better when study is expensive or one-shot. </overview>

<mandatory_requirements>

<requirement priority="critical"> <name>Conduct Power Analysis Before Data Collection</name> <description>Calculate required sample size BEFORE starting enrollment or randomization</description> <rationale>Post-hoc power calculations are meaningless. Power must guide design, not rationalize results (Hoenig & Heisey 2001)</rationale> <consequence>Underpowered studies waste resources, fail to detect real effects, mislead policy with false nulls</consequence> </requirement> <requirement priority="critical"> <name>Account for Clustering Design Effects</name> <description>When using cluster randomization, multiply base sample size by design effect: DE = 1 + (m-1) × ICC</description> <rationale>Clustering reduces effective sample size due to within-cluster correlation. Ignoring this causes severe underpowering (Donner & Klar 2000)</rationale> <consequence>Studies underpowered by 2-4x, inability to detect real effects despite large nominal sample</consequence> </requirement> <requirement priority="critical"> <name>Inflate for Attrition and Non-Compliance</name> <description>Adjust sample size for expected attrition and imperfect compliance before finalizing design</description> <rationale>Attrition reduces final sample. Non-compliance attenuates ITT effects. Both reduce power substantially</rationale> <consequence>Final analysis underpowered even if initial enrollment meets naive power target</consequence> </requirement> <requirement priority="high"> <name>Use Realistic Effect Sizes</name> <description>Base MDE on pilot data, similar interventions, or smallest policy-relevant effect, not on wishful thinking</description> <rationale>Overoptimistic effect size assumptions guarantee underpowered study. True effects in social science typically 0.1-0.3 SD (Vivalt 2020)</rationale> <consequence>Study designed to detect implausibly large effects, fails to detect realistic effects, wasted resources</consequence> </requirement> <requirement priority="high"> <name>Adjust for Multiple Comparisons</name> <description>When testing multiple outcomes, increase sample size or reduce α to maintain family-wise error rate</description> <rationale>Multiple testing inflates Type I error. With 5 tests at α=0.05, false positive rate is 26% not 5%</rationale> <consequence>Study powered for single test but not for actual analysis plan, false discoveries</consequence> </requirement>

</mandatory_requirements>

<assumptions> <assumption name="Effect Size Estimate"> <description>Assumed MDE or effect size is realistic based on prior evidence or theory</description> <how_to_check>Review similar interventions, conduct pilot, consult domain experts, use smallest policy-relevant effect</how_to_check> <if_violated>Study either grossly overpowered (wasteful) or underpowered (fails to detect real effects)</if_violated> </assumption> <assumption name="ICC Estimate for Cluster Designs"> <description>Assumed intra-cluster correlation reflects true within-cluster similarity</description> <how_to_check>Use baseline data from same clusters, similar studies, or conservative upper bound (ICC=0.10-0.20)</how_to_check> <if_violated>Design effect wrong → sample size wrong → power incorrect</if_violated> </assumption> <assumption name="Attrition and Compliance Rates"> <description>Expected attrition and take-up rates match what actually occurs</description> <how_to_check>Base on pilot data, similar studies in same context, plan retention strategies</how_to_check> <if_violated>Actual power lower than designed power, inability to detect effects</if_violated> </assumption> </assumptions>

<thinking_process> When conducting power analysis:

Specify primary outcome and effect size (MDE or expected effect)
Choose significance level (α=0.05 typical) and desired power (0.80-0.90)
Calculate base sample size for simple design
Apply design effect if cluster randomization (multiply by 1+(m-1)×ICC)
Inflate for expected non-compliance (divide by compliance²)
Inflate for expected attrition (divide by (1-attrition_rate))
Adjust for multiple testing if applicable
Check feasibility; if infeasible, reduce scope or accept larger MDE </thinking_process>

<implementation_pattern>

<code_template>

@app.cell
def comprehensive_power_calculation():
    # Full power calculation with all real-world adjustments
    # Demonstrates compounding inflation factors for RCT sample size

    import numpy as np
    from scipy.stats import norm

    # Parameters
    alpha = 0.05
    power = 0.80
    mde = 0.25  # Effect size in SD units (realistic for social programs)
    p = 0.50  # Treatment allocation

    # Base calculation
    z_alpha = norm.ppf(1 - alpha/2)
    z_power = norm.ppf(power)
    n_base = ((z_alpha + z_power)**2) / (p * (1-p) * mde**2)

    print(f"COMPREHENSIVE POWER ANALYSIS")
    print(f"=" * 60)
    print(f"Parameters: α={alpha}, power={power}, MDE={mde} SD")
    print(f"\n1. Base sample size: {int(np.ceil(n_base))}")

    # Adjustment 1: Clustering
    icc = 0.05
    cluster_size = 30
    design_effect = 1 + (cluster_size - 1) * icc
    n_after_clustering = n_base * design_effect

    print(f"\n2. Clustering adjustment:")
    print(f"   ICC={icc}, avg cluster size={cluster_size}")
    print(f"   Design effect: {design_effect:.2f}")
    print(f"   After clustering: {int(np.ceil(n_after_clustering))}")

    # Adjustment 2: Non-compliance
    compliance_rate = 0.70
    n_after_compliance = n_after_clustering / (compliance_rate**2)

    print(f"\n3. Non-compliance adjustment:")
    print(f"   Expected compliance: {compliance_rate:.0%}")
    print(f"   Inflation factor: {1/compliance_rate**2:.2f}")
    print(f"   After compliance: {int(np.ceil(n_after_compliance))}")

    # Adjustment 3: Attrition
    attrition_rate = 0.20
    n_final = n_after_compliance / (1 - attrition_rate)

    print(f"\n4. Attrition adjustment:")
    print(f"   Expected attrition: {attrition_rate:.0%}")
    print(f"   Inflation factor: {1/(1-attrition_rate):.2f}")
    print(f"   FINAL REQUIRED N: {int(np.ceil(n_final))}")

    print(f"\n" + "=" * 60)
    print(f"Total inflation: {n_final/n_base:.2f}x base sample size")
    print(f"Baseline N needed: {int(np.ceil(n_final))}")
    print(f"Endline N expected: {int(np.ceil(n_final * (1-attrition_rate)))}")

    return int(np.ceil(n_final)),

</code_template>

</implementation_pattern>

<examples> <example context="cluster_rct_power" difficulty="intermediate"> <description>Calculate power for cluster-randomized trial with design effect</description> <code> ```python @app.cell def cluster_rct_power(): # Cluster RCT power accounting for design effect # Shows how clustering can reduce power by 2-4x

import numpy as np
from scipy.stats import norm

alpha = 0.05
power = 0.80
mde = 0.30

# Individual randomization (baseline)
z_a = norm.ppf(1 - alpha/2)
z_p = norm.ppf(power)
n_individual = ((z_a + z_p)**2) / (0.25 * mde**2)

print(f"CLUSTER vs. INDIVIDUAL RANDOMIZATION")
print(f"=" * 60)
print(f"Individual randomization: {int(np.ceil(n_individual))} participants")

# Cluster randomization scenarios
scenarios = [
    {"icc": 0.05, "cluster_size": 20},
    {"icc": 0.05, "cluster_size": 50},
    {"icc": 0.10, "cluster_size": 30},
    {"icc": 0.20, "cluster_size": 30},
]

for scenario in scenarios:
    icc = scenario["icc"]
    m = scenario["cluster_size"]
    de = 1 + (m - 1) * icc
    n_cluster = n_individual * de
    n_clusters_needed = int(np.ceil(n_cluster / m))

    print(f"\nICC={icc}, cluster size={m}:")
    print(f"  Design effect: {de:.2f}")
    print(f"  Total N needed: {int(np.ceil(n_cluster))}")
    print(f"  Clusters needed: {n_clusters_needed}")
    print(f"  Inflation: {de:.2f}x individual design")

print("\nKey lesson: Higher ICC or larger clusters → much larger N needed")

return ()

</code>
<lesson>
Design effect = 1 + (m-1) × ICC. With ICC=0.10 and m=30, DE=3.9, meaning you need 4x the individual-randomized sample size. Cluster randomization trades power for validity when spillovers are a concern.
</lesson>
</example>

</examples>

<common_mistakes>

<mistake severity="critical">
  <what>Conducting power analysis after data collection (post-hoc power)</what>
  <consequence>Meaningless exercise, doesn't inform anything, misinterpreted as evidence quality</consequence>
  <prevention>Power analysis is for design phase only. Never calculate "observed power" after study</prevention>
</mistake>

<mistake severity="critical">
  <what>Ignoring clustering design effect in sample size calculation</what>
  <consequence>Study underpowered by 2-4x, fails to detect real effects despite large nominal sample</consequence>
  <prevention>ALWAYS multiply by design effect DE = 1+(m-1)×ICC for cluster randomization</prevention>
</mistake>

<mistake severity="high">
  <what>Using overoptimistic effect size assumptions</what>
  <consequence>Study designed to detect implausibly large effects, fails for realistic effects</consequence>
  <prevention>Use conservative estimates from pilots, literature, or smallest policy-relevant effect</prevention>
</mistake>

<mistake severity="high">
  <what>Not inflating for attrition and non-compliance</what>
  <consequence>Final analysis underpowered even if enrollment meets naive target</consequence>
  <prevention>Always inflate for expected attrition (÷ (1-attrition_rate)) and compliance (÷ compliance²)</prevention>
</mistake>

<mistake severity="medium">
  <what>Treating power as binary threshold (80% good, 79% bad)</what>
  <consequence>Arbitrary decisions, missing that power is continuous and context-dependent</consequence>
  <prevention>Power is a continuum. Consider costs, effect importance, and Type I vs. II error tradeoffs</prevention>
</mistake>

</common_mistakes>

<interpretation_guide>

<sample_size_rules_of_thumb>
For 80% power, α=0.05, 50/50 allocation:
- MDE = 0.10 SD: ~3,140 total
- MDE = 0.20 SD: ~786 total
- MDE = 0.30 SD: ~350 total
- MDE = 0.40 SD: ~198 total
- MDE = 0.50 SD: ~128 total

Then multiply by:
- Design effect if clustering: 1+(m-1)×ICC
- 1/compliance² for non-compliance
- 1/(1-attrition_rate) for attrition
</sample_size_rules_of_thumb>

<when_study_is_underpowered>
If power calculation shows infeasible N:
- Increase MDE (accept detecting only larger effects)
- Reduce scope (fewer outcomes, simpler design)
- Improve efficiency (stratification, ANCOVA with baseline)
- Pool with other studies (consortium approach)
- Use alternative design (regression discontinuity, DID)
- Accept higher Type II error risk (document explicitly)

Do NOT: proceed with underpowered study and hope for best.
</when_study_is_underpowered>

</interpretation_guide>

<references>
<paper>Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Erlbaum.</paper>
<paper>Duflo, E., Glennerster, R., & Kremer, M. (2007). Using randomization in development economics research. Handbook of Development Economics, 4, 3895-3962.</paper>
<paper>Donner, A., & Klar, N. (2000). Design and Analysis of Cluster Randomization Trials in Health Research. Arnold Publishers.</paper>
<paper>Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. American Statistician, 55(1), 19-24.</paper>
<paper>Vivalt, E. (2020). How much can we generalize from impact evaluations? Journal of the European Economic Association, 18(6), 3045-3089.</paper>
</references>

</skill_content>