Awesome-Agent-Skills-for-Empirical-Research empirical-playbook

install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/11-James-Traina-compound-science/skills/empirical-playbook" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-empirical-playboo && rm -rf "$T"
manifest: skills/11-James-Traina-compound-science/skills/empirical-playbook/SKILL.md
source content

Applied Micro Toolkit

Reference for applied micro research design: method selection, diagnostics, inference, pitfalls, reporting standards, and power analysis.

When to Use This Skill

Use when the user is:

  • Choosing between empirical methods for a causal question
  • Evaluating which identification strategy fits their data and setting
  • Running standard diagnostic tests and unsure which ones apply
  • Designing a study and needs to calculate statistical power
  • Reviewing or critiquing an empirical strategy
  • Preparing the "Empirical Strategy" section of a paper
  • Downloading macroeconomic or cross-national data (see
    references/data-sources.md
    for FRED/World Bank API access)

Skip when:

  • Implementation details for a specific method are needed (use
    causal-inference
    skill for IV, DiD, RDD, SC, matching)
  • The task is structural estimation (use
    structural-modeling
    skill)
  • The task is manuscript preparation or journal logistics (use
    submission-guide
    skill)
  • The task is formal identification proof (use
    identification-proofs
    skill)
  • The task is Bayesian model specification (use
    bayesian-estimation
    skill)

After selecting a method, the

econometric-reviewer
agent can review the implementation and the
identification-critic
agent can evaluate the identification argument.

Method Selection Decision Tree

Start with the fundamental question: What source of variation identifies the causal effect?

Step 1: What is your source of variation?

Source of VariationMethod FamilyKey Assumption
Randomized assignment (with full compliance)Experimental analysis (OLS on treatment indicator)Random assignment
Randomized assignment (with imperfect compliance)IV / 2SLS using random assignment as instrumentExclusion restriction, monotonicity
Policy change at a sharp thresholdSharp RDDContinuity of potential outcomes at cutoff
Policy change at a threshold with imperfect complianceFuzzy RDD (= IV at the cutoff)Continuity + monotonicity at cutoff
Policy change at a point in time, with affected and unaffected groupsDifference-in-differencesParallel trends
Staggered policy adoption across units over timeStaggered DiD (Callaway-Sant'Anna, Sun-Abraham, etc.)Parallel trends (conditional on group and time)
Rare event affecting a single unit, long pre-treatment dataSynthetic controlPre-treatment fit implies post-treatment counterfactual
Exogenous shifter of treatment that does not affect outcome directlyIV / 2SLS / GMMExclusion restriction, relevance, monotonicity
Rich set of observables that plausibly captures all confoundersMatching, IPW, AIPW (selection on observables)Conditional independence (no unobserved confounders)
No credible exogenous variationSensitivity analysis, bounds, partial identificationDepends on bounding assumptions

Step 2: Refinements Within Method Families

Within DiD:

Is treatment timing staggered?
├── No → Classic 2x2 DiD (TWFE is fine)
└── Yes
    ├── Can treatment turn off (reversals)?
    │   ├── Yes → de Chaisemartin-D'Haultfoeuille (2020)
    │   └── No
    │       ├── Do you have never-treated units?
    │       │   ├── Yes → Callaway-Sant'Anna (2021) with never-treated controls
    │       │   └── No → Callaway-Sant'Anna with not-yet-treated controls
    │       │           or Sun-Abraham (2021)
    │       └── Are effects likely heterogeneous across cohorts?
    │           ├── Yes → Callaway-Sant'Anna or Sun-Abraham (NOT TWFE)
    │           └── No → TWFE is OK, but report Bacon decomposition

Within IV:

How many instruments for how many endogenous regressors?
├── Exactly identified (K instruments = K endogenous)
│   └── 2SLS (= IV = Wald estimator for single instrument)
├── Over-identified (K instruments > K endogenous)
│   ├── 2SLS (default)
│   ├── GMM (efficient, use if heteroskedasticity suspected)
│   └── LIML (less biased with weak instruments)
└── Under-identified (K instruments < K endogenous)
    └── Cannot identify all parameters — need more instruments or fewer endogenous regressors

Within RDD:

Does crossing the threshold guarantee treatment?
├── Yes → Sharp RDD
└── No → Fuzzy RDD
    └── Is the running variable continuous?
        ├── Yes → Standard rdrobust
        └── No (discrete / few mass points)
            └── Cattaneo-Idrobo-Titiunik (2019) discrete RD methods

Within Matching / Selection on Observables:

Is the selection-on-observables assumption plausible?
├── No → Need a different identification strategy
└── Yes
    ├── Do you need ATE or ATT?
    │   ├── ATE → IPW or AIPW
    │   └── ATT → Matching or IPW with ATT weights
    ├── Is the propensity score model well-specified?
    │   ├── Uncertain → Use AIPW (doubly robust)
    │   └── Confident → IPW or regression adjustment
    └── Many covariates or nonlinear confounding?
        ├── Yes → ML-based methods (causal forests, DML)
        └── No → Parametric PS model + AIPW

Standard Diagnostics by Method

Key diagnostics to run for each method family. For full reporting checklists and minimum standards, see

references/reporting-standards.md
.

MethodMust-Run DiagnosticsKey Concern
IV / 2SLSFirst-stage F (KP), reduced form, overid testWeak instruments (F < 10), exclusion restriction
DiD (classic)Pre-trend F-test, event study plot, raw means by group/periodParallel trends violation
Staggered DiDBacon decomposition, Callaway-Sant'Anna group-time ATTsNegative TWFE weights with heterogeneous effects
RDDMcCrary density test, covariate balance at cutoff, bandwidth sensitivityManipulation of running variable, extrapolation bias
Synthetic ControlPre-fit RMSPE, permutation p-value, leave-one-outPre-period fit quality, donor pool sensitivity
Matching / AIPWOverlap plots, Love plot (SMD before/after), Oster/Rosenbaum boundsLack of overlap, unobserved confounders
StructuralConvergence, identification rank condition, robustness to starting valuesGlobal vs local optimum, identification failure

For implementation details and diagnostic code by method, see the

causal-inference
skill.

Inference Frameworks

Clustering Decision Rule

  1. Identify the level at which treatment is assigned → cluster at that level (minimum)
  2. If there are within-cluster correlations beyond treatment (e.g., spatial), consider multi-way clustering
  3. If the number of clusters is small (< 30–40), use wild cluster bootstrap (Cameron-Gelbach-Miller 2008)
  4. If the number of clusters is very small (< 10), cluster-robust methods may not work at all — consider randomization inference or aggregate to the cluster level
MistakeConsequenceFix
Clustering too fine (individual when treatment is at state level)SEs too small; over-rejectionCluster at the level of treatment assignment
Few clusters (< 30–40) with standard cluster-robust SEsPoor finite-sample propertiesWild cluster bootstrap
Not clustering when treatment varies at group levelSEs dramatically understatedAlways cluster at level of treatment assignment

Design-Based vs Model-Based Inference

DimensionDesign-BasedModel-Based
Source of randomnessTreatment assignment mechanismOutcome draws from a superpopulation
Key assumptionKnown or modeled treatment assignmentCorrect outcome model specification
ExamplesExperiments, RCTs, RDD, DiD, natural experimentsStructural models, matching, cross-sectional surveys
AdvantagesTransparent; does not require outcome modelMore powerful; extends to complex settings

Design-based is appropriate when the assignment mechanism is known (experiments, lotteries, cutoffs). Model-based when random sampling is reasonable. The standard in applied micro is hybrid: design-based identification + model-based inference. Doubly robust methods (AIPW) combine both.

Power Analysis

The key quantity is the Minimum Detectable Effect (MDE) — the smallest effect detectable with 80% power at alpha = 0.05.

Quick MDE formula (equal groups, two-sided test):

MDE = 2.8 × sigma / sqrt(N)

Required N = (2.8 × sigma / MDE)²

For IV designs, the effective MDE is inflated by the inverse of the first-stage coefficient:

MDE_IV ≈ MDE_OLS / |pi|
. A weak first stage (small pi) dramatically reduces power.

For DiD designs, effective power increases with more post-treatment periods and higher within-group correlation (absorbed by FEs). For RDD, use effective N (observations within bandwidth), not total N.

For cluster-randomized designs, the design effect

(1 + (m-1) × ICC)
inflates variance — with ICC = 0.05 and cluster size m = 50, you need 3.45x as many observations.

For full MDE formulas (DiD, IV, RDD, cluster-randomized), power simulation code, and MDE interpretation tables, see

references/reporting-standards.md
.

Research Design Checklist

Before Touching Data

  • Research question: What causal parameter are you trying to estimate? Write it as a formal estimand.
  • Identification strategy: What source of variation identifies the effect? Draw the DAG.
  • Assumptions: List all identification assumptions explicitly. Which are testable?
  • Threats: For each assumption, what is the most plausible violation? How would you detect it?
  • Power: Given your expected sample size, what is the MDE? Is it policy-relevant?
  • Pre-analysis plan: For prospective studies, register the plan before seeing outcomes.

During Analysis

  • Data cleaning documented: Every sample restriction justified and recorded.
  • Summary statistics: Know your data before running regressions.
  • Main specification: Run the main spec first. Resist the urge to search for significance.
  • Diagnostics: Run all standard diagnostics for your method (see table above).
  • Robustness: Vary specification choices systematically.
  • Magnitude interpretation: Can you explain the coefficient in plain language?

Before Submission

  • All diagnostics reported: See method-specific standards in
    references/reporting-standards.md
    .
  • Replication package: Code runs from raw data to all tables and figures.
  • Seeds set: All random number generators seeded for reproducibility.
  • Limitations discussed: What are the strongest objections? Address them in the paper.
  • Literature positioned: Have you cited and compared to the 5 closest papers?

Common Pitfalls

Bad Controls

A "bad control" is a variable that is itself an outcome of treatment. Conditioning on it introduces selection bias.

Variable TypeExampleWhy It Is Bad
Post-treatment outcomeControlling for occupation when estimating returns to educationEducation affects occupation; conditioning selects on an outcome of treatment
MediatorControlling for wages when estimating effect of training on employmentBlocks part of the causal effect
ColliderConditioning on "survived" when estimating health effectsOpens a non-causal path

Rule of thumb: If you cannot be sure a variable is determined before treatment, do not include it as a control. When in doubt, draw the DAG.

Staggered DiD with Heterogeneous Effects

MistakeConsequenceFix
Running TWFE with staggered timingAlready-treated units used as controls; negative weights; estimate can have wrong signUse Callaway-Sant'Anna, Sun-Abraham, or other modern DiD estimator
Using single post-treatment indicator for all cohortsMasks heterogeneity in treatment effects across cohortsEstimate group-time ATTs separately, then aggregate
Not reporting the Bacon decompositionReader cannot assess how much of the TWFE estimate comes from problematic comparisonsReport
bacondecomp
output

Forbidden Regressions

Never plug a manual first-stage into an OLS second stage (SEs are wrong — use proper 2SLS). Never use a nonlinear first stage with linear second stage (not consistent — use control function). Never include generated regressors without bootstrapping the full two-step procedure.

Integration

For full minimum reporting standards (method-specific checklists for IV, DiD, RDD, SC, Matching) and complete power analysis code, see

references/reporting-standards.md
. For sensitivity analysis procedures (Oster bounds, Conley bounds, breakdown frontiers, specification curves), see
references/sensitivity-analysis.md
.

Agents:

  • econometric-reviewer
    : Reviews identification strategy, standard errors, and diagnostic results
  • identification-critic
    : Evaluates identification argument completeness and exclusion restrictions
  • numerical-auditor
    : Designs power simulations for nonstandard study designs
  • journal-referee
    : Reviews whether the empirical strategy meets journal standards

Cross-references:

  • identification-proofs
    skill: Formalize an identification argument for the chosen method
  • references/diagnostic-battery.md
    : Run the full diagnostic battery for the estimated specification
  • references/sensitivity-analysis.md
    : Run sensitivity analysis (Oster bounds, specification curve, breakdown frontier)
  • publication-output
    skill: Format regression tables and diagnostic output for publication