Awesome-Agent-Skills-for-Empirical-Research causal-inference

install

source · Clone the upstream repo

git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/11-James-Traina-compound-science/skills/causal-inference" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-causal-inference && rm -rf "$T"

manifest: skills/11-James-Traina-compound-science/skills/causal-inference/SKILL.md

source content

Causal Inference

Reference for implementing causal inference methods: from identification strategy to estimation to diagnostics and robustness. Covers the major quasi-experimental and observational methods used in applied economics and quantitative social science.

When to Use This Skill

Use when the user is:

Choosing an identification strategy for a causal question
Implementing IV/2SLS, DiD, RDD, synthetic control, or matching
Debugging specification issues (weak instruments, parallel trends violations, bandwidth sensitivity)
Running robustness checks or falsification tests
Working with modern DiD methods for staggered treatment timing

Skip when:

The task is structural estimation (use
```
structural-modeling
```
skill)
The task is pure prediction/ML (no causal question)
The user needs simulation design (use
```
numerical-auditor
```
agent)

Where to Start

Choosing a method? Jump to Method Selection Guide at the end
Implementing a specific method? Go directly to that method's section below
Need full code? See
```
references/method-implementations.md
```
for complete implementations

Frameworks

Two complementary frameworks underpin all causal inference:

Potential Outcomes (Rubin): Define Y(1), Y(0) as potential outcomes under treatment and control. The causal effect is τ = Y(1) - Y(0). The fundamental problem: we never observe both for the same unit. All methods are strategies for constructing valid counterfactuals.

DAGs (Pearl): Graphical models encoding conditional independence assumptions. Use d-separation to determine what must be conditioned on (and what must NOT be conditioned on) to identify causal effects. Particularly useful for reasoning about bad controls (colliders, mediators), overcontrol bias, and which instruments satisfy the exclusion restriction.

Quick Reference: Methods at a Glance

Method	Key Assumption	Target Parameter	Key Package
IV/2SLS	Exclusion restriction, monotonicity	LATE	`linearmodels` (Py), `fixest` (R), `ivregress` (Stata)
DiD	Parallel trends	ATT	`fixest` (R), `reghdfe` (Stata), `linearmodels` (Py)
RDD	No manipulation, local continuity	LATE at cutoff	`rdrobust` (all)
Synthetic Control	Weights reproduce pre-treatment trends	ATT (single unit)	`Synth` / `augsynth` (R)
Matching/AIPW	Selection on observables	ATE or ATT	`econml` (Py), `MatchIt` / `WeightIt` (R)

Target Parameters

Be precise about what parameter you are estimating:

Parameter	Definition	Estimated by
ATE	E[Y(1) - Y(0)]	Randomized experiment, IPW, AIPW
ATT	E[Y(1) - Y(0) \| D=1]	DiD, matching, selection-on-observables
LATE	E[Y(1) - Y(0) \| compliers]	IV/2SLS (Imbens-Angrist 1994)
ATT(g,t)	Group-time specific treatment effect	Staggered DiD (Callaway-Sant'Anna)

Common mistake: IV estimates LATE, not ATE. DiD estimates ATT, not ATE. This matters for policy interpretation.

Instrumental Variables (IV/2SLS)

Key idea: Find a variable Z that shifts D (first stage) but affects Y only through D (exclusion restriction).

from linearmodels.iv import IV2SLS

result = IV2SLS.from_formula(
    'lwage ~ 1 + exper + expersq + [educ ~ nearc4 + nearc2]',
    data=df
).fit(cov_type='robust')

# Always check first-stage F > 10; report LIML as robustness with weak instruments

For R/Stata implementations, weak instrument corrections (LIML, Anderson-Rubin), and overidentification tests, see

references/method-implementations.md

IV Diagnostics Checklist:

First-stage F > 10 (or Olea-Pflueger effective F for robust inference)
Exclusion restriction argued substantively (not testable)
Monotonicity for LATE interpretation (no defiers)
Reduced form significant (regress Y directly on Z)
Overidentification test reported if over-identified
Compare OLS vs 2SLS — direction and magnitude as expected?
Report LATE interpretation — who are the compliers?

Difference-in-Differences (DiD)

Key idea: Compare changes over time between treated and control groups, assuming they would have followed parallel trends absent treatment.

import statsmodels.formula.api as smf

# Standard 2x2 DiD
result = smf.ols('y ~ treated + post + treated:post', data=df).fit(
    cov_type='cluster', cov_kwds={'groups': df['state']}
)
# Coefficient on treated:post is the DiD estimate

Staggered treatment timing: With staggered adoption, TWFE can produce sign-reversed estimates due to negative weights. Use:

Callaway-Sant'Anna (
```
did
```
R package): Most flexible aggregation, doubly robust
Sun-Abraham (
```
fixest::sunab
```
): Integrates directly into
```
feols
```
; simpler for event studies
Bacon decomposition (
```
bacondecomp
```
R): Diagnose how much weight TWFE puts on contaminated comparisons

For full staggered DiD code (C-SA, Sun-Abraham, BJS24, de Chaisemartin-D'H), see

references/staggered-did.md

. For event study code and HonestDiD pre-trend sensitivity, see

references/method-implementations.md

DiD Diagnostics Checklist:

Pre-trends: Event study shows no significant pre-treatment coefficients
Parallel trends sensitivity: Rambachan-Roth or similar analysis
Staggered timing: If varies, use C-SA or S-A — NOT naive TWFE
Clustering at level of treatment assignment (typically state/county)
Anticipation: Check period just before treatment
Bacon decomposition if using TWFE

Regression Discontinuity (RDD)

Key idea: Units just above and below a threshold are locally comparable; the jump at the threshold identifies the causal effect.

from rdrobust import rdrobust, rdbwselect, rdplot

# Basic sharp RD with bias-corrected robust CI
result = rdrobust(y=df['outcome'], x=df['running_var'], c=0)
# Reports: point estimate, robust CI, MSE-optimal bandwidth, N left/right

# Density test for manipulation
from rddensity import rddensity
density_test = rddensity(X=df['running_var'], c=0)

For fuzzy RDD, bandwidth sensitivity tables, and R/Stata implementations, see

references/method-implementations.md

RDD Diagnostics Checklist:

McCrary density test: No manipulation of running variable at cutoff
Covariate balance: Run RD on predetermined covariates as placebo outcomes
Bandwidth sensitivity: Results stable across 0.5×, 0.75×, 1×, 1.5×, 2× optimal
Local linear (p=1) is standard — avoid high-order polynomials
Donut hole: Drop observations very close to cutoff
Placebo cutoffs: Run RD where no effect should exist

Synthetic Control

When to use: Single or very few treated units, long pre-treatment series, no obvious comparison group. SC constructs a synthetic counterfactual as a weighted average of donor units.

Key packages: R:

Synth

tidysynth

augsynth

; Python:

SparseSC

SyntheticControlMethods

Diagnostics: Pre-treatment RMSPE (fit quality), permutation/placebo tests across donor units, leave-one-out stability, time placebo at earlier date.

For full implementation (Synth setup, augsynth, permutation tests), see

references/synthetic-control.md

and

references/method-implementations.md

. The

identification-critic

agent can evaluate SC identification assumptions.

Matching and Weighting

Key idea: Reweight control group to match treated group on observed characteristics. Only valid under selection-on-observables (no unobserved confounders).

AIPW (doubly robust) is the recommended default — consistent if either the propensity score model or the outcome model is correctly specified.

from econml.dr import LinearDRLearner
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier

# Doubly robust learner
dr = LinearDRLearner(
    model_regression=GradientBoostingRegressor(),
    model_propensity=GradientBoostingClassifier()
)
dr.fit(Y=df['y'], T=df['treatment'], X=df[covariates], W=None)
ate = dr.ate(df[covariates])

For propensity score estimation, IPW/Hajek estimators, and manual AIPW implementation, see

references/method-implementations.md

Matching Diagnostics Checklist:

Covariate balance: Standardized mean differences < 0.1 after weighting
Common support: Substantial overlap in propensity score distributions
Sensitivity analysis: Rosenbaum bounds
No post-treatment covariates in the propensity model
Trim if propensity scores near 0 or 1

Method Selection Guide

Scenario	Recommended Method	Key Assumption
Random assignment with imperfect compliance	IV/2SLS	Exclusion restriction, monotonicity
Policy change at a threshold	RDD	No manipulation, local continuity
Policy change at a time point, treated and control groups	DiD	Parallel trends
Staggered policy adoption across units	Staggered DiD (C-SA, S-A)	Parallel trends (conditional)
Single treated unit, long pre-period	Synthetic control	Weights reproduce pre-treatment
Treatment assignment based on observables	Matching/IPW/AIPW	Selection on observables

Decision heuristic:

Is there a sharp threshold? → RDD
Is there an instrument? → IV
Is there a clean pre/post + treated/control? → DiD
Only one treated unit? → Synthetic control
Rich observables, selection on observables plausible? → AIPW
None of the above → structural model may be needed

Common Anti-Patterns

Anti-Pattern	Problem	Better Approach
TWFE with staggered timing and heterogeneous effects	Negative weights, biased estimates	Use Callaway-Sant'Anna or Sun-Abraham
Reporting 2SLS without first-stage F	Reader cannot assess instrument strength	Always report first-stage F (and LIML as robustness)
High-order polynomial in RDD	Overfitting, poor boundary properties	Use local linear (p=1) with rdrobust
Matching on post-treatment variables	Conditioning on outcome of treatment	Only match on pre-treatment covariates
Claiming parallel trends hold because pre-event coefficients are insignificant	Low power; absence of evidence ≠ evidence of absence	Use Rambachan-Roth sensitivity analysis
IPW with extreme propensity scores (near 0 or 1)	Huge variance, unstable estimates	Trim, use normalized/Hajek weights, or switch to AIPW
Reporting only one bandwidth in RDD	Cherry-picking concern	Show results across bandwidth range
Cluster-robust SEs with few clusters (< 30-40)	Poor finite-sample coverage	Wild cluster bootstrap (Cameron, Gelbach, Miller 2008)

Integration with compound-science

```
econometric-reviewer
```
— Reviews identification strategy, standard errors, and asymptotic properties
```
identification-critic
```
— Evaluates exclusion restrictions, support conditions, and identification completeness
```
identification-critic
```
agent /
```
identification-proofs
```
skill — Formalize an identification argument end-to-end
```
/estimate
```
— Run a full estimation pipeline with diagnostics
```
empirical-playbook
```
skill (
```
sensitivity-analysis.md
```
) — Oster bounds, specification curve, breakdown frontier for robustness

Additional References

```
references/method-implementations.md
```
— Full IV/2SLS, DiD event study, RDD, and matching/AIPW implementation code
```
references/staggered-did.md
```
— Full implementation code for Callaway-Sant'Anna, Sun-Abraham, BJS24, de Chaisemartin-D'Haultfoeuille, and Bacon decomposition
```
references/synthetic-control.md
```
— Standard SC optimizer, permutation/placebo test code, augmented SC, diagnostics checklist