Claude-code-skills-social-science difference-in-differences
Causal inference with treatment/control groups over time. Use when user mentions: treatment effects, policy evaluation, pre/post comparison, parallel trends, DID, DiD, difference in differences, natural experiment, quasi-experimental.
install
source · Clone the upstream repo
git clone https://github.com/sshtomar/claude-code-skills-social-science
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/sshtomar/claude-code-skills-social-science "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/difference-in-differences" ~/.claude/skills/sshtomar-claude-code-skills-social-science-difference-in-differences && rm -rf "$T"
manifest:
skills/difference-in-differences/SKILL.mdsource content
<skill_content>
<overview> Difference-in-Differences (DID) is a causal inference method that estimates treatment effects by comparing changes over time between treated and control groups. It leverages both cross-sectional and temporal variation to identify causal effects when randomization is infeasible.DID is the workhorse of policy evaluation in economics and social sciences. </overview>
<mandatory_requirements>
<requirement priority="critical"> <name>Parallel Trends Testing</name> <description>MUST test the parallel trends assumption using pre-treatment data</description> <rationale>Parallel trends is THE identifying assumption. Without it, DID estimates are biased. Violations have invalidated numerous published studies (Roth, 2022)</rationale> <consequence>Biased treatment effect estimates that could be entirely spurious</consequence> </requirement> <requirement priority="critical"> <name>Cluster-Robust Standard Errors</name> <description>MUST cluster standard errors at the treatment level</description> <rationale>Serial correlation within units severely underestimates standard errors. Bertrand, Duflo & Mullainathan (2004) show 45% false positive rate with unclustered SE</rationale> <consequence>Type I error rates can exceed 45% instead of 5%</consequence> </requirement> <requirement priority="critical"> <name>Event Study Specification</name> <description>MUST run event study with leads and lags when multiple pre-periods exist</description> <rationale>Event studies reveal dynamics, test pre-trends, and detect anticipation effects simultaneously (Sun & Abraham, 2021)</rationale> <consequence>Missing pre-trends violations, anticipation effects, or dynamic treatment effects</consequence> </requirement> <requirement priority="high"> <name>Clear Treatment Timing</name> <description>Document exactly when treatment starts and who is treated</description> <rationale>Ambiguous treatment timing or definition makes results unreplicable and potentially wrong</rationale> <consequence>Wrong attribution of effects to treatment</consequence> </requirement></mandatory_requirements>
<assumptions> <assumption name="Parallel Trends"> <description>Treatment and control groups would follow parallel trajectories absent treatment</description> <how_to_check>Visual inspection of pre-trends, event study coefficients on leads, formal pre-trends test</how_to_check> <if_violated>Consider synthetic control, changes-in-changes, or interactive fixed effects</if_violated> </assumption> <assumption name="No Anticipation"> <description>Units don't change behavior before treatment actually occurs</description> <how_to_check>Check event study leads immediately before treatment</how_to_check> <if_violated>Redefine treatment timing to when announcement occurred</if_violated> </assumption> <assumption name="SUTVA"> <description>No spillovers between treated and control units</description> <how_to_check>Test for spatial correlation, check economic linkages</how_to_check> <if_violated>Exclude contaminated controls or use spatial methods</if_violated> </assumption> <assumption name="Common Support"> <description>Treatment and control groups are comparable</description> <how_to_check>Compare covariate distributions and pre-treatment outcomes</how_to_check> <if_violated>Consider matching or weighting before DID</if_violated> </assumption> </assumptions><thinking_process> When implementing DID:
- Verify panel structure and treatment timing
- Check pre-treatment balance between groups
- Test parallel trends assumption (CRITICAL)
- Implement main DID specification
- Run event study for dynamics
- Conduct robustness checks
- Interpret magnitude and significance carefully </thinking_process>
<implementation_pattern>
<code_template>
@app.cell def did_with_diagnostics(df): # Implementing DID with all required diagnostics to ensure valid causal # inference. Following Angrist & Pischke (2009) and recent best practices # from Roth et al. (2023) on credible DID designs. import pandas as pd import numpy as np import statsmodels.formula.api as smf import matplotlib.pyplot as plt import os # MANDATORY: Validate panel structure required = {"outcome", "treat", "post", "unit_id", "time", "cluster_id"} assert required.issubset(df.columns), f"Missing columns: {required - set(df.columns)}" # Verify treatment is binary assert df["treat"].isin([0, 1]).all(), "Treatment must be binary (0/1)" # Check panel balance panel_check = df.groupby("unit_id")["time"].nunique() is_balanced = (panel_check == panel_check.iloc[0]).all() print(f"Panel structure: {'Balanced' if is_balanced else 'Unbalanced'}") # MANDATORY: Test parallel trends visually pre_data = df[df["post"] == 0].copy() trends = pre_data.groupby(["time", "treat"])["outcome"].mean().unstack() fig, axes = plt.subplots(1, 2, figsize=(15, 6)) # Pre-trends plot axes[0].plot(trends.index, trends[0], marker='o', label='Control', linewidth=2) axes[0].plot(trends.index, trends[1], marker='s', label='Treatment', linewidth=2) axes[0].set_xlabel("Time") axes[0].set_ylabel("Outcome (mean)") axes[0].set_title("Pre-Treatment Trends (Parallel Trends Check)") axes[0].legend() axes[0].grid(True, alpha=0.3) # MANDATORY: Statistical test for parallel trends if len(pre_data["time"].unique()) >= 3: # Test for differential trends in pre-period pre_data["time_trend"] = pre_data.groupby("unit_id")["time"].transform(lambda x: x - x.min()) pre_data["treat_trend"] = pre_data["treat"] * pre_data["time_trend"] trend_test = smf.ols( "outcome ~ treat + time_trend + treat_trend + C(unit_id) + C(time)", data=pre_data ).fit(cov_type="cluster", cov_kwds={"groups": pre_data["cluster_id"]}) trend_coef = trend_test.params.get("treat_trend", 0) trend_pval = trend_test.pvalues.get("treat_trend", 1) print("\n" + "=" * 60) print("PARALLEL TRENDS TEST") print("=" * 60) print(f"Differential trend coefficient: {trend_coef:.4f}") print(f"P-value: {trend_pval:.4f}") if trend_pval < 0.05: print("WARNING: Parallel trends assumption may be violated!") else: print("SUCCESS: No significant differential pre-trends detected") # MANDATORY: Main DID estimation with cluster-robust SE df_did = df.copy() df_did["did"] = df_did["treat"] * df_did["post"] # Two-way fixed effects specification did_model = smf.ols( "outcome ~ did + C(unit_id) + C(time)", data=df_did ).fit( cov_type="cluster", cov_kwds={"groups": df_did["cluster_id"]} ) # Extract key results did_coef = did_model.params["did"] did_se = did_model.bse["did"] did_pval = did_model.pvalues["did"] ci_lower = did_coef - 1.96 * did_se ci_upper = did_coef + 1.96 * did_se print("\n" + "=" * 60) print("DIFFERENCE-IN-DIFFERENCES RESULTS") print("=" * 60) print(f"Treatment Effect: {did_coef:.4f}") print(f"Clustered SE: {did_se:.4f}") print(f"95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]") print(f"P-value: {did_pval:.4f}") print(f"\nClustering: {df_did['cluster_id'].nunique()} clusters") print(f"N observations: {len(df_did):,}") # MANDATORY: Event study for dynamics and further pre-trends check # Create relative time variable if not exists if "rel_time" not in df.columns: treatment_time = df[df["post"] == 1].groupby("unit_id")["time"].min() df["treatment_time"] = df["unit_id"].map(treatment_time) df["rel_time"] = df["time"] - df["treatment_time"] df.loc[df["treat"] == 0, "rel_time"] = -999 # Never treated # Event study with reasonable window event_window = range(-4, 5) # 4 pre, 4 post periods df_event = df[(df["rel_time"].isin(event_window)) | (df["rel_time"] == -999)].copy() # Create event time dummies (omit -1 as reference) for k in event_window: if k == -1: continue df_event[f"event_{k}"] = ((df_event["rel_time"] == k) & (df_event["treat"] == 1)).astype(int) # Build formula event_vars = [f"event_{k}" for k in event_window if k != -1] event_formula = "outcome ~ " + " + ".join(event_vars) + " + C(unit_id) + C(time)" event_model = smf.ols(event_formula, data=df_event).fit( cov_type="cluster", cov_kwds={"groups": df_event["cluster_id"]} ) # Plot event study periods = [k for k in event_window if k != -1] coefs = [event_model.params.get(f"event_{k}", 0) for k in periods] ses = [event_model.bse.get(f"event_{k}", 0) for k in periods] axes[1].axhline(0, color='gray', linestyle='--', alpha=0.5) axes[1].axvline(-0.5, color='red', linestyle='--', alpha=0.5, label='Treatment') axes[1].errorbar(periods, coefs, yerr=[1.96*s for s in ses], marker='o', capsize=5, linewidth=2) axes[1].set_xlabel("Relative Time to Treatment") axes[1].set_ylabel("Effect Estimate") axes[1].set_title("Event Study (Pre-trends & Dynamics)") axes[1].legend() axes[1].grid(True, alpha=0.3) # MANDATORY: Save diagnostic plots plt.tight_layout() os.makedirs("./images", exist_ok=True) plt.savefig("./images/did_diagnostics.png", dpi=144, bbox_inches="tight") print(f"\nSUCCESS: Diagnostic plots saved to ./images/did_diagnostics.png") # Get figure for inline display fig = plt.gcf() return did_model, event_model, fig,
</code_template>
</implementation_pattern>
<examples> <example context="clean_2x2" difficulty="basic"> <description>Classic 2×2 DID with one treatment group and two periods</description> <code> ```python @app.cell def did_2x2_simple(df): # Implementing the canonical 2×2 DID where we have one treatment group, # one control group, and two time periods. This is the simplest and most # transparent DID design (Card & Krueger, 1994).import statsmodels.formula.api as smf import pandas as pd # Verify 2×2 structure assert df["treat"].nunique() == 2, "Need exactly 2 groups" assert df["time"].nunique() == 2, "Need exactly 2 time periods" assert df["post"].nunique() == 2, "Need pre and post indicators" # Calculate means for 2×2 table means = df.groupby(["treat", "post"])["outcome"].mean().unstack() print("2×2 DID Table") print("=" * 50) print(means) # Manual DID calculation for transparency diff_treat = means.loc[1, 1] - means.loc[1, 0] # Treatment group change diff_control = means.loc[0, 1] - means.loc[0, 0] # Control group change did_manual = diff_treat - diff_control print(f"\nManual DID Calculation:") print(f" Treatment group change: {diff_treat:.4f}") print(f" Control group change: {diff_control:.4f}") print(f" DID estimate: {did_manual:.4f}") # Regression approach (should match manual calculation) df["did"] = df["treat"] * df["post"] model = smf.ols("outcome ~ treat + post + did", data=df).fit( cov_type="cluster", cov_kwds={"groups": df["cluster_id"]} ) print(f"\nRegression DID estimate: {model.params['did']:.4f}") print(f"Clustered SE: {model.bse['did']:.4f}") print(f"P-value: {model.pvalues['did']:.4f}") # Interpretation baseline_mean = means.loc[1, 0] # Pre-treatment mean for treated group pct_effect = (model.params['did'] / baseline_mean) * 100 print(f"\nEffect size: {pct_effect:.1f}% relative to baseline") return model,
</code> <lesson> With staggered adoption: 1. TWFE can be biased due to "bad comparisons" 2. Check cohort-specific effects for heterogeneity 3. Consider modern robust estimators (Callaway-Sant'Anna, Sun-Abraham) 4. Report both standard and robust estimates for transparency </lesson> </example> <example context="failed_pretrends" difficulty="advanced"> <description>Handling DID when parallel trends fails</description> <code> ```python @app.cell def did_with_failed_pretrends(df): # When parallel trends fails, standard DID is biased. We demonstrate the # failure, then show alternative approaches: controlling for differential # trends, matching before DID, or synthetic control methods.</code> <output_interpretation> In 2×2 DID: - The coefficient represents the average treatment effect on the treated (ATT) - Compare manual and regression estimates to verify implementation - Report both absolute and relative (%) effects for context </output_interpretation> </example> <example context="staggered_adoption" difficulty="intermediate"> <description>Staggered treatment adoption with heterogeneous effects</description> <code> ```python @app.cell def did_staggered_robust(df): # Standard TWFE DID is biased with staggered adoption and heterogeneous # effects (Goodman-Bacon, 2021). We implement both standard TWFE and # robust alternatives (Callaway & Sant'Anna, 2021). import pandas as pd import numpy as np import statsmodels.formula.api as smf # Check for staggered adoption treatment_times = df[df["treat"] == 1].groupby("unit_id")["time"].min() n_cohorts = treatment_times.nunique() print("Staggered Adoption Structure") print("=" * 50) print(f"Number of treatment cohorts: {n_cohorts}") print("\nTreatment timing distribution:") print(treatment_times.value_counts().sort_index()) if n_cohorts > 1: print("\nWARNING: Staggered adoption detected!") print("Standard TWFE may be biased. Showing both approaches:") # 1. Standard TWFE (potentially biased) print("\n1. STANDARD TWFE APPROACH") print("-" * 40) df["treated_post"] = (df["treat"] == 1) & (df["post"] == 1) twfe = smf.ols( "outcome ~ treated_post + C(unit_id) + C(time)", data=df ).fit(cov_type="cluster", cov_kwds={"groups": df["cluster_id"]}) print(f"TWFE estimate: {twfe.params['treated_post[T.True]']:.4f}") print(f"SE: {twfe.bse['treated_post[T.True]']:.4f}") # 2. Cohort-specific effects (diagnostic) print("\n2. COHORT-SPECIFIC EFFECTS") print("-" * 40) cohort_effects = {} for cohort_time in treatment_times.unique(): cohort_units = treatment_times[treatment_times == cohort_time].index # Get cohort data plus never-treated controls cohort_data = df[ (df["unit_id"].isin(cohort_units)) | (df["treat"] == 0) ].copy() cohort_data["post_cohort"] = cohort_data["time"] >= cohort_time cohort_data["did_cohort"] = ( cohort_data["unit_id"].isin(cohort_units) & cohort_data["post_cohort"] ) cohort_model = smf.ols( "outcome ~ did_cohort + C(unit_id) + C(time)", data=cohort_data ).fit(cov_type="cluster", cov_kwds={"groups": cohort_data["cluster_id"]}) effect = cohort_model.params.get("did_cohort[T.True]", np.nan) cohort_effects[cohort_time] = effect print(f"Cohort {cohort_time}: {effect:.4f}") # Check for heterogeneity effects_array = np.array(list(cohort_effects.values())) heterogeneity = effects_array.std() print(f"\nHeterogeneity check:") print(f" SD of cohort effects: {heterogeneity:.4f}") if heterogeneity > abs(effects_array.mean()) * 0.5: print(" WARNING: Substantial heterogeneity detected!") print(" Consider using robust DID methods (CS, SA, or Sun-Abraham)") # 3. Simple aggregation (equally weighted) print("\n3. SIMPLE AVERAGE OF COHORT EFFECTS") print("-" * 40) simple_avg = effects_array.mean() print(f"Average effect: {simple_avg:.4f}") print(f"Difference from TWFE: {simple_avg - twfe.params['treated_post[T.True]']:.4f}") return twfe, cohort_effects,
import pandas as pd import numpy as np import statsmodels.formula.api as smf import matplotlib.pyplot as plt import os from sklearn.linear_model import LogisticRegression from sklearn.neighbors import NearestNeighbors # First, demonstrate the pre-trends failure pre_data = df[df["post"] == 0].copy() pre_data["time_numeric"] = pre_data.groupby("unit_id")["time"].transform( lambda x: x - x.min() ) # Test for differential trends pre_data["treat_x_time"] = pre_data["treat"] * pre_data["time_numeric"] trend_test = smf.ols( "outcome ~ treat + time_numeric + treat_x_time + C(unit_id)", data=pre_data ).fit(cov_type="cluster", cov_kwds={"groups": pre_data["cluster_id"]}) diff_trend = trend_test.params["treat_x_time"] p_value = trend_test.pvalues["treat_x_time"] print("PRE-TRENDS DIAGNOSTIC") print("=" * 60) print(f"Differential trend: {diff_trend:.4f} (p={p_value:.4f})") if p_value < 0.05: print("WARNING: PARALLEL TRENDS VIOLATED - Standard DID is biased!") print("\nImplementing alternative approaches:") # Approach 1: Control for differential trends print("\n1. DID WITH DIFFERENTIAL TRENDS") print("-" * 40) df_trends = df.copy() df_trends["time_numeric"] = df_trends.groupby("unit_id")["time"].transform( lambda x: x - x.min() ) df_trends["treat_x_time"] = df_trends["treat"] * df_trends["time_numeric"] df_trends["did"] = df_trends["treat"] * df_trends["post"] trends_model = smf.ols( "outcome ~ did + treat_x_time + C(unit_id) + C(time)", data=df_trends ).fit(cov_type="cluster", cov_kwds={"groups": df_trends["cluster_id"]}) print(f"Effect (controlling for trends): {trends_model.params['did']:.4f}") print(f"SE: {trends_model.bse['did']:.4f}") print("Note: Assumes linear differential trends continue post-treatment") # Approach 2: Propensity score matching + DID print("\n2. MATCHED DID") print("-" * 40) # Match on pre-treatment characteristics pre_chars = pre_data.groupby("unit_id").agg({ "outcome": ["mean", "std", lambda x: x.iloc[-1] - x.iloc[0]], # level, volatility, trend "treat": "first" }).reset_index() pre_chars.columns = ["unit_id", "pre_mean", "pre_std", "pre_trend", "treat"] # Estimate propensity scores X = pre_chars[["pre_mean", "pre_std", "pre_trend"]].values y = pre_chars["treat"].values ps_model = LogisticRegression(random_state=42) ps_model.fit(X, y) pre_chars["pscore"] = ps_model.predict_proba(X)[:, 1] # Match treated to control units treated_units = pre_chars[pre_chars["treat"] == 1] control_units = pre_chars[pre_chars["treat"] == 0] # 1:1 nearest neighbor matching on propensity score nn = NearestNeighbors(n_neighbors=1) nn.fit(control_units[["pscore"]].values) distances, indices = nn.kneighbors(treated_units[["pscore"]].values) matched_controls = control_units.iloc[indices.flatten()]["unit_id"].values treated_ids = treated_units["unit_id"].values # Create matched sample matched_df = df[ df["unit_id"].isin(treated_ids) | df["unit_id"].isin(matched_controls) ].copy() print(f"Matched sample: {len(matched_df['unit_id'].unique())} units") print(f" ({len(treated_ids)} treated, {len(matched_controls)} control)") # Run DID on matched sample matched_df["did"] = matched_df["treat"] * matched_df["post"] matched_model = smf.ols( "outcome ~ did + C(unit_id) + C(time)", data=matched_df ).fit(cov_type="cluster", cov_kwds={"groups": matched_df["cluster_id"]}) print(f"Effect (matched DID): {matched_model.params['did']:.4f}") print(f"SE: {matched_model.bse['did']:.4f}") # Approach 3: Recommend alternatives print("\n3. RECOMMENDED ALTERNATIVES") print("-" * 40) print("Consider these methods when parallel trends fails:") print("• Synthetic Control Method (if few treated units)") print("• Changes-in-Changes (Athey & Imbens, 2006)") print("• Interactive Fixed Effects (Bai, 2009)") print("• Instrumental Variables (if available)") # Visualize the pre-trends failure and solutions fig, axes = plt.subplots(1, 3, figsize=(18, 6)) # Original pre-trends (failing) pre_means = pre_data.groupby(["time", "treat"])["outcome"].mean().unstack() axes[0].plot(pre_means.index, pre_means[0], 'o-', label='Control') axes[0].plot(pre_means.index, pre_means[1], 's-', label='Treatment') axes[0].set_title("Original: Parallel Trends FAILED") axes[0].set_xlabel("Time") axes[0].set_ylabel("Outcome") axes[0].legend() axes[0].grid(True, alpha=0.3) # After matching matched_pre = matched_df[matched_df["post"] == 0] matched_means = matched_pre.groupby(["time", "treat"])["outcome"].mean().unstack() axes[1].plot(matched_means.index, matched_means[0], 'o-', label='Matched Control') axes[1].plot(matched_means.index, matched_means[1], 's-', label='Treatment') axes[1].set_title("After Matching: Improved Balance") axes[1].set_xlabel("Time") axes[1].set_ylabel("Outcome") axes[1].legend() axes[1].grid(True, alpha=0.3) # Comparison of estimates methods = ["Standard\nDID", "Trend-\nAdjusted", "Matched\nDID"] estimates = [ df[df["treat"] == 1]["outcome"].mean() - df[df["treat"] == 0]["outcome"].mean(), trends_model.params["did"], matched_model.params["did"] ] axes[2].bar(methods, estimates) axes[2].set_title("Comparison of Estimates") axes[2].set_ylabel("Effect Estimate") axes[2].axhline(0, color='gray', linestyle='--', alpha=0.5) axes[2].grid(True, alpha=0.3, axis='y') plt.tight_layout() os.makedirs("./images", exist_ok=True) plt.savefig("./images/did_pretrends_failure.png", dpi=144, bbox_inches="tight") print("\nSUCCESS: Diagnostic plots saved to ./images/did_pretrends_failure.png") # Get figure for inline display fig = plt.gcf() return trends_model, matched_model, fig, else: print("SUCCESS: Parallel trends holds - proceed with standard DID") return None, None, None,
</code> <best_practice> When parallel trends fails: 1. NEVER ignore it - standard DID will be biased 2. Try multiple approaches and check sensitivity 3. Be transparent about the failure in reporting 4. Consider that treatment might truly be endogenous 5. Sometimes the honest answer is "we cannot identify the causal effect" </best_practice> </example> </examples> <common_mistakes> <mistake severity="critical"> <what>Not clustering standard errors</what> <consequence>Type I error rates can be 45% instead of 5% (Bertrand et al., 2004)</consequence> <prevention>ALWAYS cluster at treatment assignment level (usually unit level)</prevention> </mistake> <mistake severity="critical"> <what>Ignoring pre-trends test</what> <consequence>Entire estimate may be spurious if trends differ</consequence> <prevention>Always test and visualize pre-trends, report even if fails</prevention> </mistake> <mistake severity="high"> <what>Using TWFE with heterogeneous effects and staggered adoption</what> <consequence>Estimate is weighted average including "bad comparisons" (negative weights)</consequence> <prevention>Use Callaway-Sant'Anna, Sun-Abraham, or other robust estimators</prevention> </mistake> <mistake severity="high"> <what>Including bad controls (post-treatment variables)</what> <consequence>Bias from controlling for mechanisms</consequence> <prevention>Only control for pre-treatment characteristics</prevention> </mistake> <mistake severity="medium"> <what>Not checking for anticipation effects</what> <consequence>Underestimate true effect if behavior changes before official treatment</consequence> <prevention>Check event study leads just before treatment</prevention> </mistake> </common_mistakes> <interpretation_guide> <interpreting_results> - DID coefficient = Average Treatment Effect on the Treated (ATT) - Units are same as outcome variable (be specific!) - Relative effect = (coefficient / pre-treatment mean) × 100% - Check both statistical and economic significance </interpreting_results> <red_flags> - Pre-trend coefficients significantly different from zero - Event study shows effects before treatment (anticipation) - Very different estimates with different specifications - Implausibly large effects (>50% changes are rare) </red_flags> <next_steps> - All checks pass → Report main estimate with confidence - Pre-trends fail → Try alternative methods or acknowledge limitation - Heterogeneous effects → Report by subgroup - Dynamic effects → Focus on event study rather than single coefficient </next_steps> </interpretation_guide> <references> <paper>Angrist, J.D. & Pischke, J.S. (2009). "Mostly Harmless Econometrics." Princeton University Press. Canonical DID exposition.</paper> <paper>Bertrand, M., Duflo, E. & Mullainathan, S. (2004). "How Much Should We Trust DID Estimates?" QJE. Serial correlation and clustering.</paper> <paper>Roth, J. et al. (2023). "What's Trending in DID?" AER Insights. Modern best practices and pre-testing.</paper> <paper>Goodman-Bacon, A. (2021). "DID with Variation in Treatment Timing." Journal of Econometrics. TWFE bias.</paper> <paper>Callaway, B. & Sant'Anna, P.H.C. (2021). "DID with Multiple Time Periods." Journal of Econometrics. Robust estimation.</paper> <paper>Sun, L. & Abraham, S. (2021). "Estimating Dynamic Treatment Effects." Journal of Econometrics. Event study bias correction.</paper> </references> </skill_content>