Awesome-Agent-Skills-for-Empirical-Research stata

install

source · Clone the upstream repo

git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/14-luischanci-claude-code-research-starter/dot-claude/skills/stata" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-stata && rm -rf "$T"

manifest: skills/14-luischanci-claude-code-research-starter/dot-claude/skills/stata/SKILL.md

source content

Stata Skill

You have access to comprehensive Stata reference files. Do not load all files. Read only the 1-3 files relevant to the user's current task using the routing table below.

Critical Gotchas

These are Stata-specific pitfalls that lead to silent bugs. Internalize these before writing any code.

Missing Values Sort to +Infinity

Stata's

(and

.a

.z

) are greater than all numbers.

* WRONG — includes observations where income is missing!
gen high_income = (income > 50000)

* RIGHT
gen high_income = (income > 50000) if !missing(income)

* WRONG — missing ages appear in this list
list if age > 60

* RIGHT
list if age > 60 & !missing(age)

==

is assignment;

==

is comparison. Mixing them up is a syntax error or silent bug.

* WRONG — syntax error
gen employed = 1 if status = 1

* RIGHT
gen employed = 1 if status == 1

Local Macro Syntax

Locals use

`name'

(backtick + single-quote). Globals use

$name

${name}

. Forgetting the closing quote is the #1 macro bug.

local controls "age education income"
regress wage `controls'        // correct
regress wage `controls         // WRONG — missing closing quote
regress wage 'controls'        // WRONG — wrong quote characters

by

Requires Prior Sort (Use

bysort

)

* WRONG — error if data not sorted by id
by id: gen first = (_n == 1)

* RIGHT — bysort sorts automatically
bysort id: gen first = (_n == 1)

* Also RIGHT — explicit sort
sort id
by id: gen first = (_n == 1)

Factor Variable Notation (

i.

and

c.

)

Use

i.

for categorical,

c.

for continuous. Omitting

i.

treats categories as continuous.

* WRONG — treats race as continuous (e.g., race=3 has 3x effect of race=1)
regress wage race education

* RIGHT — creates dummies automatically
regress wage i.race education

* Interactions
regress wage i.race##c.education    // full interaction
regress wage i.race#c.education     // interaction only (no main effects)

generate

replace

generate

creates new variables;

replace

modifies existing ones. Using

generate

on an existing variable name is an error.

gen x = 1
gen x = 2          // ERROR: x already defined
replace x = 2      // correct

String Comparison Is Case-Sensitive

* May miss "Male", "MALE", etc.
keep if gender == "male"

* Safer
keep if lower(gender) == "male"

merge

Always Check

_merge

merge 1:1 id using other.dta
tab _merge                      // always inspect
assert _merge == 3              // or handle mismatches
drop _merge

preserve

restore

for Temporary Changes

preserve
collapse (mean) income, by(state)
* ... do something with collapsed data ...
restore   // original data is back

Weights Are Not Interchangeable

```
fweight
```
— frequency weights (replication)
```
aweight
```
— analytic/regression weights (inverse variance)
```
pweight
```
— probability/sampling weights (survey data, implies robust SE)
```
iweight
```
— importance weights (rarely used)

capture

Swallows Errors

capture some_command
if _rc != 0 {
    di as error "Failed with code: " _rc
    exit _rc
}

Line Continuation Uses

///

regress y x1 x2 x3 ///
    x4 x5 x6, ///
    vce(robust)

Stored Results:

r()

e()

s()

```
r()
```
— r-class commands (summarize, tabulate, etc.)
```
e()
```
— e-class commands (estimation: regress, logit, etc.)
```
s()
```
— s-class commands (parsing)

A new estimation command overwrites previous

e()

results. Store them first:

regress y x1 x2
estimates store model1

Routing Table

Read only the files relevant to the user's task. Paths are relative to this SKILL.md file.

Data Operations

File	Topics & Key Commands
`references/basics-getting-started.md`	`use` , `save` , `describe` , `browse` , `sysuse` , basic workflow
`references/data-import-export.md`	`import delimited` , `import excel` , ODBC, `export` , web data
`references/data-management.md`	`generate` , `replace` , `merge` , `append` , `reshape` , `collapse` , `recode` , `egen` , `encode` / `decode`
`references/variables-operators.md`	Variable types, `byte` / `int` / `long` / `float` / `double` , operators, missing values ( `.<.a` ), `if` / `in` qualifiers
`references/string-functions.md`	`substr()` , `regexm()` , `strtrim()` , `split` , `ustrlen()` , regex, Unicode
`references/date-time-functions.md`	`date()` , `clock()` , `%td` / `%tc` formats, `mdy()` , `dofm()` , business calendars
`references/mathematical-functions.md`	`round()` , `log()` , `exp()` , `abs()` , `mod()` , `cond()` , distributions, random numbers

Statistics & Econometrics

File	Topics & Key Commands
`references/descriptive-statistics.md`	`summarize` , `tabulate` , `correlate` , `tabstat` , `codebook` , weighted stats
`references/linear-regression.md`	`regress` , `vce(robust)` , `vce(cluster)` , `test` , `lincom` , `margins` , `predict` , `ivregress`
`references/panel-data.md`	`xtset` , `xtreg fe` / `re` , Hausman test, `xtabond` , dynamic panels
`references/time-series.md`	`tsset` , ARIMA, VAR, `dfuller` , `pperron` , `irf` , forecasting
`references/limited-dependent-variables.md`	`logit` , `probit` , `tobit` , `poisson` , `nbreg` , `mlogit` , `ologit` , `margins` for nonlinear
`references/bootstrap-simulation.md`	`bootstrap` , `simulate` , `permute` , Monte Carlo
`references/survey-data-analysis.md`	`svyset` , `svy:` , `subpop()` , complex survey design, replicate weights
`references/missing-data-handling.md`	`mi impute` , `mi estimate` , FIML, `misstable` , diagnostics
`references/maximum-likelihood.md`	`ml model` , custom likelihood functions, `ml init` , gradient-based optimization
`references/gmm-estimation.md`	`gmm` , moment conditions, `estat overid` , J-test

Causal Inference

File	Topics & Key Commands
`references/treatment-effects.md`	`teffects ra/ipw/ipwra/aipw` , `stteffects` , ATE/ATT/ATET
`references/difference-in-differences.md`	DiD, parallel trends, event studies, staggered adoption
`references/regression-discontinuity.md`	Sharp/fuzzy RD, bandwidth selection, `rdplot`
`references/matching-methods.md`	PSM, nearest neighbor, kernel matching, `teffects nnmatch`
`references/sample-selection.md`	`heckman` , `heckprobit` , treatment models, exclusion restrictions

Advanced Methods

File	Topics & Key Commands
`references/survival-analysis.md`	`stset` , `stcox` , `streg` , Kaplan-Meier, parametric models
`references/sem-factor-analysis.md`	`sem` , `gsem` , CFA, path analysis, `alpha` , reliability
`references/nonparametric-methods.md`	`kdensity` , rank tests, `qreg` , `npregress`
`references/spatial-analysis.md`	`spmatrix` , `spregress` , spatial weights, Moran's I
`references/machine-learning.md`	`lasso` , `elasticnet` , `cvlasso` , cross-validation

Graphics

File Topics & Key Commands

references/graphics.md

twoway

scatter

line

bar

histogram

graph combine

graph export

, schemes

Programming

File	Topics & Key Commands
`references/programming-basics.md`	`local` , `global` , `foreach` , `forvalues` , `program define` , `syntax` , `return`
`references/advanced-programming.md`	`syntax` , `mata` , classes, `_prefix` , dialog boxes, `tempfile` / `tempvar`
`references/mata-introduction.md`	Mata basics, when to use Mata vs ado, data types
`references/mata-programming.md`	Mata functions, flow control, structures, pointers
`references/mata-matrix-operations.md`	Matrix creation, decompositions, solvers, `st_matrix()`
`references/mata-data-access.md`	`st_data()` , `st_view()` , `st_store()` , performance tips

Output & Workflow

File Topics & Key Commands

references/tables-reporting.md

putexcel

putdocx

putpdf

, LaTeX integration,

collect

references/workflow-best-practices.md

Project structure, master do-files, version control, debugging, common mistakes

references/external-tools-integration.md

Python via

python:

, R via

rsource

, shell commands, Git

Community Packages

File	What It Does
`packages/reghdfe.md`	High-dimensional fixed effects OLS (absorbs multiple FE sets efficiently)
`packages/estout.md`	`esttab` / `estout` : publication-quality regression tables
`packages/outreg2.md`	Alternative regression table exporter (Word, Excel, TeX)
`packages/asdoc.md`	One-command Word document creation for any Stata output
`packages/tabout.md`	Cross-tabulations and summary tables to file
`packages/coefplot.md`	Coefficient plots from stored estimates
`packages/graph-schemes.md`	`grstyle` , `schemepack` , `plotplain` — better graph themes
`packages/did.md`	Modern DiD: `csdid` , `did_multiplegt` , `did_imputation` (Callaway-Sant'Anna, de Chaisemartin-D'Haultfoeuille, Borusyak-Jaravel-Spiess)
`packages/event-study.md`	`eventstudyinteract` , `eventdd` — event study estimators
`packages/rdrobust.md`	Robust RD estimation with optimal bandwidth ( `rdrobust` , `rdplot` , `rdbwselect` )
`packages/psmatch2.md`	Propensity score matching (nearest neighbor, kernel, radius)
`packages/synth.md`	Synthetic control method ( `synth` , `synth_runner` )
`packages/ivreg2.md`	Enhanced IV/2SLS: `ivreg2` , `xtivreg2` with additional diagnostics
`packages/xtabond2.md`	Dynamic panel GMM (Arellano-Bond/Blundell-Bond)
`packages/binsreg.md`	Binned scatter plots with CI ( `binsreg` , `binstest` )
`packages/nprobust.md`	Nonparametric kernel estimation and inference
`packages/diagnostics.md`	`bacondecomp` , `xttest3` , collinearity, heteroskedasticity tests
`packages/winsor.md`	Winsorizing and trimming: `winsor2` , `winsor`
`packages/data-manipulation.md`	`gtools` (fast collapse/egen), `rangestat` , `egenmore`
`packages/package-management.md`	`ssc install` , `net install` , `ado update` , finding packages

Common Patterns

Regression Table Workflow

* Estimate models
eststo clear
eststo: regress y x1 x2, vce(robust)
eststo: regress y x1 x2 x3, vce(robust)
eststo: regress y x1 x2 x3 x4, vce(cluster id)

* Export table
esttab using "results.tex", replace ///
    se star(* 0.10 ** 0.05 *** 0.01) ///
    label booktabs ///
    title("Main Results") ///
    mtitles("(1)" "(2)" "(3)")

Panel Data Setup

xtset panelid timevar          // declare panel structure
xtdescribe                      // check balance
xtsum outcome                   // within/between variation

* Fixed effects
xtreg y x1 x2, fe vce(cluster panelid)
* Or with reghdfe (preferred for multiple FE)
reghdfe y x1 x2, absorb(panelid timevar) vce(cluster panelid)

Difference-in-Differences

* Classic 2x2 DiD
gen post = (year >= treatment_year)
gen treat_post = treated * post
regress y treated post treat_post, vce(cluster id)

* Modern staggered DiD (Callaway & Sant'Anna)
csdid y x1 x2, ivar(id) time(year) gvar(first_treat) agg(event)
csdid_plot

Graph Export

* Publication-quality scatter with fit line
twoway (scatter y x, mcolor(navy%50) msize(small)) ///
       (lfit y x, lcolor(cranberry) lwidth(medthick)), ///
    title("Title Here") ///
    xtitle("X Label") ytitle("Y Label") ///
    legend(off) scheme(s2color)
graph export "figure1.pdf", replace as(pdf)
graph export "figure1.png", replace as(png) width(2400)

Data Cleaning Pipeline

* Load and inspect
import delimited "raw_data.csv", clear varnames(1)
describe
codebook, compact

* Clean
rename *, lower                 // lowercase all varnames
destring income, replace force  // convert string to numeric
replace income = . if income < 0

* Label
label variable income "Annual household income (USD)"
label define yesno 0 "No" 1 "Yes"
label values employed yesno

* Save
compress
save "clean_data.dta", replace

Multiple Imputation

mi set mlong
mi register imputed income education
mi impute chained (regress) income (ologit) education = age i.gender, add(20) rseed(12345)
mi estimate: regress wage income education age i.gender

Awesome-Agent-Skills-for-Empirical-Research stata

Stata Skill

Critical Gotchas

Missing Values Sort to +Infinity

`=`
vs
`==`

Local Macro Syntax

`by`
Requires Prior Sort (Use
`bysort`
)

Factor Variable Notation (
`i.`
and
`c.`
)

`generate`
vs
`replace`

String Comparison Is Case-Sensitive

`merge`
Always Check
`_merge`

`preserve`
/
`restore`
for Temporary Changes

Weights Are Not Interchangeable

`capture`
Swallows Errors

Line Continuation Uses
`///`

Stored Results:
`r()`
vs
`e()`
vs
`s()`

Routing Table

Data Operations

Statistics & Econometrics

Causal Inference

Advanced Methods

Graphics

Programming

Output & Workflow

Community Packages

Common Patterns

Regression Table Workflow

Panel Data Setup

Difference-in-Differences

Graph Export

Data Cleaning Pipeline

Multiple Imputation

Awesome-Agent-Skills-for-Empirical-Research stata

Stata Skill

Critical Gotchas

Missing Values Sort to +Infinity

= vs ==

Local Macro Syntax

by Requires Prior Sort (Use bysort)

Factor Variable Notation (i. and c.)

generate vs replace

String Comparison Is Case-Sensitive

merge Always Check _merge

preserve / restore for Temporary Changes

Weights Are Not Interchangeable

capture Swallows Errors

Line Continuation Uses ///

Stored Results: r() vs e() vs s()

Routing Table

Data Operations

Statistics & Econometrics

Causal Inference

Advanced Methods

Graphics

Programming

Output & Workflow

Community Packages

Common Patterns

Regression Table Workflow

Panel Data Setup

Difference-in-Differences

Graph Export

Data Cleaning Pipeline

Multiple Imputation

`=`
vs
`==`

`by`
Requires Prior Sort (Use
`bysort`
)

Factor Variable Notation (
`i.`
and
`c.`
)

`generate`
vs
`replace`

`merge`
Always Check
`_merge`

`preserve`
/
`restore`
for Temporary Changes

`capture`
Swallows Errors

Line Continuation Uses
`///`

Stored Results:
`r()`
vs
`e()`
vs
`s()`