Awesome-Agent-Skills-for-Empirical-Research stata
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/32-dylantmoore-stata-skill/plugins/stata/skills/stata" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-stata-9f666d && rm -rf "$T"
skills/32-dylantmoore-stata-skill/plugins/stata/skills/stata/SKILL.mdStata Skill
You have access to comprehensive Stata reference files. Do not load all files. Read only the 1-3 files relevant to the user's current task using the routing table below.
Critical Gotchas
These are Stata-specific pitfalls that lead to silent bugs. Internalize these before writing any code.
Missing Values Sort to +Infinity
Stata's
. (and .a-.z) are greater than all numbers.
* WRONG — includes observations where income is missing! gen high_income = (income > 50000) * RIGHT gen high_income = (income > 50000) if !missing(income) * WRONG — missing ages appear in this list list if age > 60 * RIGHT list if age > 60 & !missing(age)
=
vs ==
==== is assignment; == is comparison. Mixing them up is a syntax error or silent bug.
* WRONG — syntax error gen employed = 1 if status = 1 * RIGHT gen employed = 1 if status == 1
Local Macro Syntax
Locals use
`name' (backtick + single-quote). Globals use $name or ${name}.
Forgetting the closing quote is the #1 macro bug.
local controls "age education income" regress wage `controls' // correct regress wage `controls // WRONG — missing closing quote regress wage 'controls' // WRONG — wrong quote characters
by
Requires Prior Sort (Use bysort
)
bybysort* WRONG — error if data not sorted by id by id: gen first = (_n == 1) * RIGHT — bysort sorts automatically bysort id: gen first = (_n == 1) * Also RIGHT — explicit sort sort id by id: gen first = (_n == 1)
Factor Variable Notation (i.
and c.
)
i.c.Use
i. for categorical, c. for continuous. Omitting i. treats categories as continuous.
* WRONG — treats race as continuous (e.g., race=3 has 3x effect of race=1) regress wage race education * RIGHT — creates dummies automatically regress wage i.race education * Interactions regress wage i.race##c.education // full interaction regress wage i.race#c.education // interaction only (no main effects)
generate
vs replace
generatereplacegenerate creates new variables; replace modifies existing ones. Using generate on an existing variable name is an error.
gen x = 1 gen x = 2 // ERROR: x already defined replace x = 2 // correct
String Comparison Is Case-Sensitive
* May miss "Male", "MALE", etc. keep if gender == "male" * Safer keep if lower(gender) == "male"
merge
Always Check _merge
merge_mergeNever skip
tab _merge — it costs nothing and is the only diagnostic you get when assert fails.
merge 1:1 id using other.dta tab _merge // ALWAYS tab before assert assert _merge == 3 // fails silently without tab output drop _merge
preserve
/ restore
+ tempfile
for Collapse-Merge-Back
preserverestoretempfileThe standard pattern for computing group stats and merging them onto the original data:
tempfile stats preserve collapse (mean) avg_x=x, by(group) save `stats' restore merge m:1 group using `stats' tab _merge assert _merge == 3 drop _merge
For simple group means,
bysort group: egen avg_x = mean(x) avoids the round-trip entirely.
Weights Are Not Interchangeable
— frequency weights (replication)fweight
— analytic/regression weights (inverse variance)aweight
— probability/sampling weights (survey data, implies robust SE)pweight
— importance weights (rarely used)iweight
capture
Swallows Errors
capturecapture some_command if _rc != 0 { di as error "Failed with code: " _rc exit _rc }
Line Continuation Uses ///
///regress y x1 x2 x3 /// x4 x5 x6, /// vce(robust)
Stored Results: r()
vs e()
vs s()
r()e()s()
— r-class commands (summarize, tabulate, etc.)r()
— e-class commands (estimation: regress, logit, etc.)e()
— s-class commands (parsing)s()
A new estimation command overwrites previous
e() results. Store them first:
regress y x1 x2 estimates store model1
Running Stata from the Command Line
Claude can execute Stata code by running
.do files in batch mode from the terminal. This is how to run Stata non-interactively.
Finding the Stata Binary
Stata on macOS is a
.app bundle. The actual binary is inside it. Common locations:
# Stata 18 / StataNow (most common) /Applications/Stata/StataMP.app/Contents/MacOS/stata-mp /Applications/StataNow/StataMP.app/Contents/MacOS/stata-mp # Other editions (SE, BE) /Applications/Stata/StataSE.app/Contents/MacOS/stata-se /Applications/Stata/StataBE.app/Contents/MacOS/stata-be
If Stata isn't on
$PATH, find it with: mdfind -name "stata-mp" | grep MacOS
Batch Mode (-b
)
-b# Run a .do file in batch mode — output goes to <filename>.log /Applications/Stata/StataMP.app/Contents/MacOS/stata-mp -b do analysis.do # If stata-mp is on PATH (e.g., via symlink or alias): stata-mp -b do analysis.do
= batch mode (non-interactive, no GUI)-b- Output (everything Stata would display) is written to
in the working directoryanalysis.log - Exit code is 0 on success, non-zero on error
- The log file contains all output, including error messages — check it after execution
Running Inline Stata Code
To run a quick Stata snippet without creating a
.do file:
# Write a temp .do file and run it cat > /tmp/stata_run.do << 'EOF' sysuse auto, clear summarize price mpg EOF stata-mp -b do /tmp/stata_run.do cat /tmp/stata_run.log
Checking Results
# Check if it succeeded stata-mp -b do tests/run_tests.do && echo "SUCCESS" || echo "FAILED" # Search the log for pass/fail grep -E "PASS|FAIL|error|r\([0-9]+\)" run_tests.log
Tips
at the top of batch scripts — batch mode starts with a fresh Stata session, butclear all
ensures no stale state from prior runs in the same session.clear all
— prevents Stata from pausing forset more off
prompts (fatal in batch mode).--more--- Log files overwrite silently —
always writes toanalysis.do
in the current directory. If you run multipleanalysis.log
files, check the right log..do - Working directory — Stata's working directory is wherever you run the command from, not where the
file lives. Use.do
in thecd
file or absolute paths if needed..do
Routing Table
Read only the files relevant to the user's task. Paths are relative to this SKILL.md file.
Data Operations
| File | Topics & Key Commands |
|---|---|
| , , , , , basic workflow |
| , , ODBC, , web data |
| , , , , , , , , / |
| Variable types, ////, operators, missing values (), / qualifiers |
| , , , , , regex, Unicode |
| , , / formats, , , business calendars |
| , , , , , , distributions, random numbers |
Statistics & Econometrics
| File | Topics & Key Commands |
|---|---|
| , , , , , weighted stats |
| , , , , , , , |
| , /, Hausman test, , dynamic panels |
| , ARIMA, VAR, , , , forecasting |
| , , , , , , , for nonlinear |
| , , , Monte Carlo |
| , , , complex survey design, replicate weights |
| , , FIML, , diagnostics |
| , custom likelihood functions, , gradient-based optimization |
| , moment conditions, , J-test |
Causal Inference
| File | Topics & Key Commands |
|---|---|
| , , ATE/ATT/ATET |
| DiD, parallel trends, event studies, staggered adoption |
| Sharp/fuzzy RD, bandwidth selection, |
| PSM, nearest neighbor, kernel matching, |
| , , treatment models, exclusion restrictions |
Advanced Methods
| File | Topics & Key Commands |
|---|---|
| , , , Kaplan-Meier, parametric models |
| , , CFA, path analysis, , reliability |
| , rank tests, , |
| , , spatial weights, Moran's I |
| , , , cross-validation |
Graphics
| File | Topics & Key Commands |
|---|---|
| , , , , , , , schemes |
Programming
| File | Topics & Key Commands |
|---|---|
| , , , , , , |
| , , classes, , dialog boxes, / |
| Mata basics, when to use Mata vs ado, data types |
| Mata functions, flow control, structures, pointers |
| Matrix creation, decompositions, solvers, |
| , , , performance tips |
Output & Workflow
| File | Topics & Key Commands |
|---|---|
| , , , LaTeX integration, |
| Project structure, master do-files, version control, debugging, common mistakes |
| Python via , R via , shell commands, Git |
| User wants to report a Stata skill documentation gap or error to the repository |
Community Packages
| File | What It Does |
|---|---|
| High-dimensional fixed effects OLS (absorbs multiple FE sets efficiently) |
| /: publication-quality regression tables |
| Alternative regression table exporter (Word, Excel, TeX) |
| One-command Word document creation for any Stata output |
| Cross-tabulations and summary tables to file |
| Coefficient plots from stored estimates |
| , , — better graph themes |
| Modern DiD: , , (Callaway-Sant'Anna, de Chaisemartin-D'Haultfoeuille, Borusyak-Jaravel-Spiess) |
| , — event study estimators |
| Robust RD estimation with optimal bandwidth (, , ) |
| Propensity score matching (nearest neighbor, kernel, radius) |
| Synthetic control method (, ) |
| Enhanced IV/2SLS: , with additional diagnostics |
| Dynamic panel GMM (Arellano-Bond/Blundell-Bond) |
| Binned scatter plots with CI (, ) |
| Nonparametric kernel estimation and inference |
| , , collinearity, heteroskedasticity tests |
| Winsorizing and trimming: , |
| (fast collapse/egen), , |
| , , , finding packages |
Common Patterns
Regression Table Workflow
* Estimate models eststo clear eststo: regress y x1 x2, vce(robust) eststo: regress y x1 x2 x3, vce(robust) eststo: regress y x1 x2 x3 x4, vce(cluster id) * Export table esttab using "results.tex", replace /// se star(* 0.10 ** 0.05 *** 0.01) /// label booktabs /// title("Main Results") /// mtitles("(1)" "(2)" "(3)")
Panel Data Setup
xtset panelid timevar // declare panel structure xtdescribe // check balance xtsum outcome // within/between variation * Fixed effects xtreg y x1 x2, fe vce(cluster panelid) * Or with reghdfe (preferred for multiple FE) reghdfe y x1 x2, absorb(panelid timevar) vce(cluster panelid)
Difference-in-Differences
* Classic 2x2 DiD gen post = (year >= treatment_year) gen treat_post = treated * post regress y treated post treat_post, vce(cluster id) * Event study (uniform timing — must interact with treatment group) reghdfe y ib(-1).rel_time#1.treated, absorb(id year) vce(cluster id) testparm *.rel_time#1.treated // pre-trend test * Modern staggered DiD (Callaway & Sant'Anna) csdid y x1 x2, ivar(id) time(year) gvar(first_treat) agg(event) csdid_plot
Graph Export
* Publication-quality scatter with fit line twoway (scatter y x, mcolor(navy%50) msize(small)) /// (lfit y x, lcolor(cranberry) lwidth(medthick)), /// title("Title Here") /// xtitle("X Label") ytitle("Y Label") /// legend(off) scheme(s2color) graph export "figure1.pdf", replace as(pdf) graph export "figure1.png", replace as(png) width(2400)
Data Cleaning Pipeline
* Load and inspect import delimited "raw_data.csv", clear varnames(1) describe codebook, compact * Clean rename *, lower // lowercase all varnames destring income, replace force // convert string to numeric replace income = . if income < 0 * Label label variable income "Annual household income (USD)" label define yesno 0 "No" 1 "Yes" label values employed yesno * Save compress save "clean_data.dta", replace
Multiple Imputation
mi set mlong mi register imputed income education mi impute chained (regress) income (ologit) education = age i.gender, add(20) rseed(12345) mi estimate: regress wage income education age i.gender
Help Improve This Skill
If you produce Stata code with a significant error — wrong syntax, incorrect command usage, or a gotcha you failed to catch — and the issue seems to stem from a gap in these reference files rather than a one-off mistake, consider suggesting to the user that they file an issue on the skill repository. This helps future users.
When to raise this: Only after you've already corrected the error and the user has working code. Frame it as optional: "I made an error with [X] that I think comes from a gap in the Stata skill documentation. If you'd like, I can help you file an issue or a PR so it gets fixed for everyone."
When NOT to raise this: If the user is on Claude Haiku, the error is more likely a model capability issue than a documentation gap. In that case, suggest they try Sonnet or Opus for complex Stata work instead of filing an issue.
If the user agrees, read
references/filing-issues.md for instructions on writing a good issue report.