Awesome-Agent-Skills-for-Empirical-Research stata-reference-guide
Comprehensive Stata reference covering syntax, econometrics, and 20+ packages
install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/analysis/econometrics/stata-reference-guide" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-stata-reference-g && rm -rf "$T"
manifest:
skills/43-wentorai-research-plugins/skills/analysis/econometrics/stata-reference-guide/SKILL.mdsource content
Stata Comprehensive Reference Guide
Overview
Stata is the dominant statistical software in economics, political science, public health, and sociology research. This guide provides a comprehensive reference covering core syntax, data management, estimation commands, causal inference methods, graphics, Mata programming, and 20+ community-contributed packages. It is designed as a progressive-disclosure reference: use the section relevant to your current task rather than reading end-to-end.
Core Syntax and Data Management
Data Import and Export
* Import CSV with variable names in first row import delimited "data.csv", clear varnames(1) * Import Excel (specific sheet and cell range) import excel "workbook.xlsx", sheet("Sheet1") cellrange(A1:Z1000) firstrow clear * Import Stata format use "dataset.dta", clear * Export to CSV export delimited "output.csv", replace * Save as Stata format save "cleaned_data.dta", replace
Variable Management
* Generate new variables gen log_income = ln(income) gen age_sq = age^2 gen treatment_post = treatment * post * Recode and label recode education (1/12 = 1 "HS or less") (13/16 = 2 "College") (17/20 = 3 "Graduate"), gen(edu_cat) label variable edu_cat "Education Category" * String operations gen first_name = word(full_name, 1) gen year_str = string(year) destring price_str, gen(price) force * Date handling gen date = date(date_str, "YMD") format date %td gen year = year(date) gen quarter = quarter(date)
Data Cleaning Patterns
* Identify and handle duplicates duplicates report id year duplicates tag id year, gen(dup_flag) duplicates drop id year, force * Missing values misstable summarize misstable patterns replace income = . if income < 0 // recode impossible values * Merge datasets merge 1:1 id year using "panel_data.dta", keep(match master) nogen merge m:1 state year using "state_controls.dta", keep(match master) nogen * Reshape between wide and long reshape long income_, i(id) j(year) reshape wide income, i(id) j(year) * Collapse to group level collapse (mean) avg_income=income (sd) sd_income=income (count) n=income, by(state year)
Estimation Commands
Linear Regression
* OLS with robust standard errors reg y x1 x2 x3, robust * Clustered standard errors reg y x1 x2 x3, cluster(firm_id) * Fixed effects (within estimator) xtreg y x1 x2 x3, fe cluster(firm_id) xtset firm_id year // must declare panel structure first * Absorbing high-dimensional FE (reghdfe) reghdfe y x1 x2 x3, absorb(firm_id year) cluster(firm_id) * Instrumental variables (2SLS) ivregress 2sls y x1 x2 (endog_var = instrument1 instrument2), robust estat firststage estat overid
Panel Data Methods
* Panel setup xtset firm_id year * Hausman test (FE vs RE) quietly xtreg y x1 x2, fe estimates store fe quietly xtreg y x1 x2, re estimates store re hausman fe re * Dynamic panel GMM (xtabond2) xtabond2 y L.y x1 x2, gmm(L.y, lag(2 4)) iv(x1 x2) robust twostep * Test for serial correlation and overidentification estat abond // Arellano-Bond test estat sargan // Sargan/Hansen test
Causal Inference
* Difference-in-Differences gen did = treatment * post reg y did treatment post controls, cluster(state) * Modern DiD with staggered treatment (csdid) csdid y x1 x2, ivar(id) time(year) gvar(first_treat) method(dripw) csdid_plot // event study plot * Regression Discontinuity (rdrobust) rdrobust y running_var, c(0) p(1) kernel(triangular) rdplot y running_var, c(0) p(1) * Propensity Score Matching (psmatch2) psmatch2 treatment x1 x2 x3, outcome(y) logit caliper(0.05) common pstest x1 x2 x3 // balance check * Synthetic Control (synth) synth y x1 x2 x3 y(1990) y(1991) y(1992), trunit(1) trperiod(1993) fig
Limited Dependent Variables
* Logit/Probit logit binary_y x1 x2, robust margins, dydx(*) // average marginal effects probit binary_y x1 x2, robust margins, dydx(*) * Ordered logit ologit ordered_y x1 x2, robust margins, predict(outcome(3)) dydx(x1) * Tobit (censored regression) tobit y x1 x2, ll(0) * Poisson and Negative Binomial poisson count_y x1 x2, robust nbreg count_y x1 x2, robust
Community Packages (20+)
Installation
* Install from SSC (Statistical Software Components) ssc install reghdfe ssc install estout ssc install coefplot ssc install csdid ssc install rdrobust ssc install psmatch2 ssc install synth ssc install ivreg2 ssc install xtabond2 ssc install winsor2 ssc install gtools ssc install ftools ssc install binscatter ssc install binsreg ssc install grstyle * Install from GitHub net install did_multiplegt, from("https://raw.githubusercontent.com/chaisemartinDehejia/did_multiplegt/main")
Publication-Quality Output
* estout / esttab — formatted regression tables eststo clear eststo: reg y x1 x2, robust eststo: reg y x1 x2 x3, robust eststo: reg y x1 x2 x3, cluster(firm_id) esttab, se star(* 0.10 ** 0.05 *** 0.01) /// title("Main Results") label replace /// scalars("r2 R-squared" "N Observations") * Export to LaTeX esttab using "table1.tex", replace booktabs /// se star(* 0.10 ** 0.05 *** 0.01) label * Export to CSV/Excel esttab using "table1.csv", replace se * coefplot — coefficient visualization coefplot est1 est2 est3, drop(_cons) xline(0) /// title("Coefficient Estimates") legend(order(1 "Model 1" 2 "Model 2" 3 "Model 3"))
Graphics
* Scatter with fit line twoway (scatter y x) (lfit y x), title("Y vs X") /// xtitle("X Variable") ytitle("Y Variable") * Event study plot coefplot, vertical drop(_cons) yline(0) /// title("Event Study") xtitle("Periods Relative to Treatment") * Binned scatter (binscatter) binscatter y x, controls(z1 z2) nquantiles(20) /// title("Binned Scatter") xtitle("X") ytitle("Y") * Kernel density kdensity income if year==2020, normal /// title("Income Distribution") xtitle("Income") * Graph styling (grstyle) grstyle init grstyle set plain, horizontal grid grstyle color background white grstyle set color economist
Mata Programming
* Basic Mata usage mata: // Matrix operations X = st_data(., ("x1", "x2", "x3")) y = st_data(., "y") n = rows(X) // OLS by hand X = X, J(n, 1, 1) // add constant beta = invsym(X'X) * X'y e = y - X * beta sigma2 = (e'e) / (n - cols(X)) V = sigma2 * invsym(X'X) se = sqrt(diagonal(V)) beta, se end
Workflow Best Practices
- Always set a random seed before any procedure involving randomness:
set seed 12345 - Use
/preserve
for temporary data manipulations within a do-filerestore - Log your sessions:
log using "analysis_log.smcl", replace - Version control: Start do-files with
(or your version) for reproducibilityversion 17 - Use tempfiles for intermediate datasets:
thentempfile merged
merged'`save - Profile your code with
/timer on 1
/timer off 1
for long-running operationstimer list - Use
(greshape, gcollapse, gegen) for 5-10x speedups on large datasetsgtools
References
- Stata Official Documentation
- UCLA Stata FAQ
- Stata Journal
- SSC Archive
- dylantmoore/stata-skill — Source for this reference