Awesome-Agent-Skills-for-Empirical-Research stata
install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/14-luischanci-claude-code-research-starter/dot-claude/skills/stata" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-stata && rm -rf "$T"
manifest:
skills/14-luischanci-claude-code-research-starter/dot-claude/skills/stata/SKILL.mdsource content
Stata Skill
You have access to comprehensive Stata reference files. Do not load all files. Read only the 1-3 files relevant to the user's current task using the routing table below.
Critical Gotchas
These are Stata-specific pitfalls that lead to silent bugs. Internalize these before writing any code.
Missing Values Sort to +Infinity
Stata's
. (and .a-.z) are greater than all numbers.
* WRONG — includes observations where income is missing! gen high_income = (income > 50000) * RIGHT gen high_income = (income > 50000) if !missing(income) * WRONG — missing ages appear in this list list if age > 60 * RIGHT list if age > 60 & !missing(age)
=
vs ==
==== is assignment; == is comparison. Mixing them up is a syntax error or silent bug.
* WRONG — syntax error gen employed = 1 if status = 1 * RIGHT gen employed = 1 if status == 1
Local Macro Syntax
Locals use
`name' (backtick + single-quote). Globals use $name or ${name}.
Forgetting the closing quote is the #1 macro bug.
local controls "age education income" regress wage `controls' // correct regress wage `controls // WRONG — missing closing quote regress wage 'controls' // WRONG — wrong quote characters
by
Requires Prior Sort (Use bysort
)
bybysort* WRONG — error if data not sorted by id by id: gen first = (_n == 1) * RIGHT — bysort sorts automatically bysort id: gen first = (_n == 1) * Also RIGHT — explicit sort sort id by id: gen first = (_n == 1)
Factor Variable Notation (i.
and c.
)
i.c.Use
i. for categorical, c. for continuous. Omitting i. treats categories as continuous.
* WRONG — treats race as continuous (e.g., race=3 has 3x effect of race=1) regress wage race education * RIGHT — creates dummies automatically regress wage i.race education * Interactions regress wage i.race##c.education // full interaction regress wage i.race#c.education // interaction only (no main effects)
generate
vs replace
generatereplacegenerate creates new variables; replace modifies existing ones. Using generate on an existing variable name is an error.
gen x = 1 gen x = 2 // ERROR: x already defined replace x = 2 // correct
String Comparison Is Case-Sensitive
* May miss "Male", "MALE", etc. keep if gender == "male" * Safer keep if lower(gender) == "male"
merge
Always Check _merge
merge_mergemerge 1:1 id using other.dta tab _merge // always inspect assert _merge == 3 // or handle mismatches drop _merge
preserve
/ restore
for Temporary Changes
preserverestorepreserve collapse (mean) income, by(state) * ... do something with collapsed data ... restore // original data is back
Weights Are Not Interchangeable
— frequency weights (replication)fweight
— analytic/regression weights (inverse variance)aweight
— probability/sampling weights (survey data, implies robust SE)pweight
— importance weights (rarely used)iweight
capture
Swallows Errors
capturecapture some_command if _rc != 0 { di as error "Failed with code: " _rc exit _rc }
Line Continuation Uses ///
///regress y x1 x2 x3 /// x4 x5 x6, /// vce(robust)
Stored Results: r()
vs e()
vs s()
r()e()s()
— r-class commands (summarize, tabulate, etc.)r()
— e-class commands (estimation: regress, logit, etc.)e()
— s-class commands (parsing)s()
A new estimation command overwrites previous
e() results. Store them first:
regress y x1 x2 estimates store model1
Routing Table
Read only the files relevant to the user's task. Paths are relative to this SKILL.md file.
Data Operations
| File | Topics & Key Commands |
|---|---|
| , , , , , basic workflow |
| , , ODBC, , web data |
| , , , , , , , , / |
| Variable types, ////, operators, missing values (), / qualifiers |
| , , , , , regex, Unicode |
| , , / formats, , , business calendars |
| , , , , , , distributions, random numbers |
Statistics & Econometrics
| File | Topics & Key Commands |
|---|---|
| , , , , , weighted stats |
| , , , , , , , |
| , /, Hausman test, , dynamic panels |
| , ARIMA, VAR, , , , forecasting |
| , , , , , , , for nonlinear |
| , , , Monte Carlo |
| , , , complex survey design, replicate weights |
| , , FIML, , diagnostics |
| , custom likelihood functions, , gradient-based optimization |
| , moment conditions, , J-test |
Causal Inference
| File | Topics & Key Commands |
|---|---|
| , , ATE/ATT/ATET |
| DiD, parallel trends, event studies, staggered adoption |
| Sharp/fuzzy RD, bandwidth selection, |
| PSM, nearest neighbor, kernel matching, |
| , , treatment models, exclusion restrictions |
Advanced Methods
| File | Topics & Key Commands |
|---|---|
| , , , Kaplan-Meier, parametric models |
| , , CFA, path analysis, , reliability |
| , rank tests, , |
| , , spatial weights, Moran's I |
| , , , cross-validation |
Graphics
| File | Topics & Key Commands |
|---|---|
| , , , , , , , schemes |
Programming
| File | Topics & Key Commands |
|---|---|
| , , , , , , |
| , , classes, , dialog boxes, / |
| Mata basics, when to use Mata vs ado, data types |
| Mata functions, flow control, structures, pointers |
| Matrix creation, decompositions, solvers, |
| , , , performance tips |
Output & Workflow
| File | Topics & Key Commands |
|---|---|
| , , , LaTeX integration, |
| Project structure, master do-files, version control, debugging, common mistakes |
| Python via , R via , shell commands, Git |
Community Packages
| File | What It Does |
|---|---|
| High-dimensional fixed effects OLS (absorbs multiple FE sets efficiently) |
| /: publication-quality regression tables |
| Alternative regression table exporter (Word, Excel, TeX) |
| One-command Word document creation for any Stata output |
| Cross-tabulations and summary tables to file |
| Coefficient plots from stored estimates |
| , , — better graph themes |
| Modern DiD: , , (Callaway-Sant'Anna, de Chaisemartin-D'Haultfoeuille, Borusyak-Jaravel-Spiess) |
| , — event study estimators |
| Robust RD estimation with optimal bandwidth (, , ) |
| Propensity score matching (nearest neighbor, kernel, radius) |
| Synthetic control method (, ) |
| Enhanced IV/2SLS: , with additional diagnostics |
| Dynamic panel GMM (Arellano-Bond/Blundell-Bond) |
| Binned scatter plots with CI (, ) |
| Nonparametric kernel estimation and inference |
| , , collinearity, heteroskedasticity tests |
| Winsorizing and trimming: , |
| (fast collapse/egen), , |
| , , , finding packages |
Common Patterns
Regression Table Workflow
* Estimate models eststo clear eststo: regress y x1 x2, vce(robust) eststo: regress y x1 x2 x3, vce(robust) eststo: regress y x1 x2 x3 x4, vce(cluster id) * Export table esttab using "results.tex", replace /// se star(* 0.10 ** 0.05 *** 0.01) /// label booktabs /// title("Main Results") /// mtitles("(1)" "(2)" "(3)")
Panel Data Setup
xtset panelid timevar // declare panel structure xtdescribe // check balance xtsum outcome // within/between variation * Fixed effects xtreg y x1 x2, fe vce(cluster panelid) * Or with reghdfe (preferred for multiple FE) reghdfe y x1 x2, absorb(panelid timevar) vce(cluster panelid)
Difference-in-Differences
* Classic 2x2 DiD gen post = (year >= treatment_year) gen treat_post = treated * post regress y treated post treat_post, vce(cluster id) * Modern staggered DiD (Callaway & Sant'Anna) csdid y x1 x2, ivar(id) time(year) gvar(first_treat) agg(event) csdid_plot
Graph Export
* Publication-quality scatter with fit line twoway (scatter y x, mcolor(navy%50) msize(small)) /// (lfit y x, lcolor(cranberry) lwidth(medthick)), /// title("Title Here") /// xtitle("X Label") ytitle("Y Label") /// legend(off) scheme(s2color) graph export "figure1.pdf", replace as(pdf) graph export "figure1.png", replace as(png) width(2400)
Data Cleaning Pipeline
* Load and inspect import delimited "raw_data.csv", clear varnames(1) describe codebook, compact * Clean rename *, lower // lowercase all varnames destring income, replace force // convert string to numeric replace income = . if income < 0 * Label label variable income "Annual household income (USD)" label define yesno 0 "No" 1 "Yes" label values employed yesno * Save compress save "clean_data.dta", replace
Multiple Imputation
mi set mlong mi register imputed income education mi impute chained (regress) income (ologit) education = age i.gender, add(20) rseed(12345) mi estimate: regress wage income education age i.gender