Awesome-Agent-Skills-for-Empirical-Research data-analysis

End-to-end data analysis workflow in R or Python — from exploration through regression to publication-ready tables and figures. Make sure to use this skill whenever the user wants to run any empirical analysis, write analysis code, or produce output from data. Triggers include: "analyze this data", "run a regression", "write R code for this", "write Python code for this", "I have a dataset", "help me with this regression", "run a DiD", "run an RDD", "event study", "IV regression", "fit a model", "produce a table", "make a figure", "explore my data", or any request involving a dataset path or empirical estimation.

install

source · Clone the upstream repo

git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/15-Felpix-Studios-social-science-research/skills/data-analysis" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-data-analysis-f12d63 && rm -rf "$T"

manifest: skills/15-Felpix-Studios-social-science-research/skills/data-analysis/SKILL.md

source content

Data Analysis Workflow

Run an end-to-end data analysis in R or Python: load, explore, analyze, and produce publication-ready output.

Input:

$ARGUMENTS

— a dataset path (e.g.,

data/county_panel.csv

) or a description of the analysis goal (e.g., "regress wages on education with state fixed effects using CPS data").

Phase 0: Choose Language

Determine language from

$ARGUMENTS

or ask the user:

User mentions
```
tidyverse
```
,
```
fixest
```
,
```
lm
```
,
```
.R
```
context → R track
User mentions
```
pandas
```
,
```
statsmodels
```
,
```
sklearn
```
,
```
.py
```
or
```
.ipynb
```
context → Python track
Dataset is
```
.csv
```
/
```
.parquet
```
with no language cue → use AskUserQuestion with a single-select menu:
- header: "Language"
- question: "Which language should I use for this analysis?"
- options:
  - label: "R (Recommended)", description: "tidyverse, fixest, ggplot2 — full plugin support with coding conventions and R reviewer"
  - label: "Python", description: "pandas, statsmodels — supported for analysis scripts and figures"
  - label: "Both", description: "R for figures and tables, Python for data processing"

R Track

Constraints

Follow
```
rules/r-code-conventions.md
```
for all standards
Save scripts to
```
scripts/R/
```
with descriptive names
Save all outputs (figures, tables, RDS) to
```
output/
```
Use
```
saveRDS()
```
for every computed object
Run
```
r-reviewer
```
on the generated script before presenting results

Phase 1: Setup and Data Loading

Create R script with proper header (title, author, purpose, inputs, outputs)
Load required packages at top (
```
library()
```
, never
```
require()
```
)
Set seed once at top:
```
set.seed(42)
```

Create output directories:

dir.create("output/analysis", recursive = TRUE, showWarnings = FALSE)

Load and inspect the dataset

Phase 2: Exploratory Data Analysis

```
summary()
```
, missingness rates, variable types
Histograms for key continuous variables
Scatter plots, correlation matrices
Panel trends, pre-treatment comparisons if applicable
Save all diagnostic figures to
```
output/diagnostics/
```

Phase 3: Main Analysis

Panel data: use
```
fixest
```
; cross-section: use
```
lm
```
/
```
glm
```
Cluster SEs at the appropriate level (document why)
Multiple specifications: start simple, progressively add controls
Report standardized effects alongside raw coefficients

Phase 4: Publication-Ready Output

Tables:

modelsummary

(preferred) or

stargazer

— export

.tex

and

.html

Figures:

ggplot2

with project theme; explicit

ggsave(width = X, height = Y)

; save as

.pdf

and

.png

; add

bg = "transparent"

only if output is for Beamer slides

Phase 5: Save and Review

```
saveRDS()
```
for all key objects
Run the
```
r-reviewer
```
agent: "Review the script at scripts/R/[script_name].R"
Address Critical and High issues before presenting results

R Script Template

# ============================================================
# [Descriptive Title]
# Author: [from project context]
# Purpose: [What this script does]
# Inputs:  [Data files]
# Outputs: [Figures, tables, RDS files]
# ============================================================

# 0. Setup ----
library(tidyverse)
library(fixest)
library(modelsummary)

set.seed(42)
dir.create("output/analysis", recursive = TRUE, showWarnings = FALSE)

# 1. Data Loading ----
# 2. Exploratory Analysis ----
# 3. Main Analysis ----
# 4. Tables and Figures ----
# 5. Export ----

Python Track

Constraints

Save scripts to
```
scripts/python/
```
with descriptive names
Save all outputs (figures, tables, pickles) to
```
output/
```
Use
```
joblib.dump()
```
for model objects;
```
.to_parquet()
```
for DataFrames
Use
```
pathlib.Path
```
for all file paths — never hardcode absolute paths
Set random seeds at the top of the script

Phase 1: Setup and Data Loading

Create Python script with header (title, author, purpose, inputs, outputs)
Import all packages at the top of the file
Set seeds:
```
np.random.seed(42)
```
and
```
random.seed(42)
```

Create output directories:

Path("output/analysis").mkdir(parents=True, exist_ok=True)

Load and inspect the dataset with
```
pandas
```

Phase 2: Exploratory Data Analysis

```
df.describe()
```
,
```
df.isnull().sum()
```
,
```
df.dtypes
```
Histograms and distributions with
```
matplotlib
```
/
```
seaborn
```
Scatter plots and correlation matrices
Save diagnostic figures to
```
output/diagnostics/
```

Save summary stats:

df.describe().to_csv("output/diagnostics/summary_stats.csv")

Phase 3: Main Analysis

Cross-section OLS:

smf.ols("y ~ x", data=df).fit(cov_type="HC3")

Panel data:
```
PanelOLS
```
from
```
linearmodels
```
with cluster-robust SEs
Multiple specifications: build incrementally
Document SE choice with a comment

Phase 4: Publication-Ready Output

Tables: Format with

pandas

and export via

.to_latex()

stargazer

(Python port) Figures:

matplotlib

seaborn

; explicit

fig.savefig(path, dpi=300, bbox_inches="tight")

; save as

.pdf

and

.png

Phase 5: Save and Review

```
joblib.dump(model, "output/model.pkl")
```
for fitted models

df_results.to_parquet("output/results.parquet")

for DataFrames

Review the script manually against the Python checklist below before presenting

Python Script Template

# ============================================================
# [Descriptive Title]
# Author: [from project context]
# Purpose: [What this script does]
# Inputs:  [Data files]
# Outputs: [Figures, tables, pickle/parquet files]
# ============================================================

import random
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
from pathlib import Path

# Seeds
np.random.seed(42)
random.seed(42)

# Output directories
Path("output/analysis").mkdir(parents=True, exist_ok=True)
Path("output/figures").mkdir(parents=True, exist_ok=True)

# 1. Data Loading
# 2. Exploratory Analysis
# 3. Main Analysis
# 4. Tables and Figures
# 5. Export

Python Quality Checklist

[ ] All imports at top
[ ] Random seeds set (numpy + stdlib)
[ ] All paths use pathlib.Path — no hardcoded strings
[ ] Output directories created with mkdir(exist_ok=True)
[ ] Figures saved with explicit dpi=300, bbox_inches="tight"
[ ] Model objects saved with joblib.dump()
[ ] DataFrames saved as parquet
[ ] Comments explain WHY, not WHAT

Shared Principles

Reproduce, don't guess. If the user specifies a regression, run exactly that.
Show your work. Compute summary statistics before jumping to regression.
Check for issues. Look for multicollinearity, outliers, perfect prediction, missing data.
Use relative paths. All paths relative to repository root.
No hardcoded values. Use variables for sample restrictions, date ranges, thresholds.