Awesome-Agent-Skills-for-Empirical-Research data-analysis
End-to-end R data analysis workflow from exploration through regression to publication-ready tables and figures
install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/28-maxwell2732-paper-replicate-agent-demo/dot-claude/skills/data-analysis" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-data-analysis-ed15d9 && rm -rf "$T"
manifest:
skills/28-maxwell2732-paper-replicate-agent-demo/dot-claude/skills/data-analysis/SKILL.mdsource content
Data Analysis Workflow
Run an end-to-end data analysis in R: load, explore, analyze, and produce publication-ready output.
Input:
$ARGUMENTS — a dataset path (e.g., data/county_panel.csv) or a description of the analysis goal (e.g., "regress wages on education with state fixed effects using CPS data").
Constraints
- Follow R code conventions in
.claude/rules/r-code-conventions.md - Save all scripts to
with descriptive namesscripts/R/ - Save all outputs (figures, tables, RDS) to
output/ - Use
for every computed object — Quarto slides may need themsaveRDS() - Use project theme for all figures (check for custom theme in
).claude/rules/ - Run r-reviewer on the generated script before presenting results
Workflow Phases
Phase 1: Setup and Data Loading
- Read
for project standards.claude/rules/r-code-conventions.md - Create R script with proper header (title, author, purpose, inputs, outputs)
- Load required packages at top (
, neverlibrary()
)require() - Set seed once at top:
set.seed(42) - Load and inspect the dataset
Phase 2: Exploratory Data Analysis
Generate diagnostic outputs:
- Summary statistics:
, missingness rates, variable typessummary() - Distributions: Histograms for key continuous variables
- Relationships: Scatter plots, correlation matrices
- Time patterns: If panel data, plot trends over time
- Group comparisons: If treatment/control, compare pre-treatment means
Save all diagnostic figures to
output/diagnostics/.
Phase 3: Main Analysis
Based on the research question:
- Regression analysis: Use
for panel data,fixest
/lm
for cross-sectionglm - Standard errors: Cluster at the appropriate level (document why)
- Multiple specifications: Start simple, progressively add controls
- Effect sizes: Report standardized effects alongside raw coefficients
Phase 4: Publication-Ready Output
Tables:
- Use
for regression tables (preferred) ormodelsummarystargazer - Include all standard elements: coefficients, SEs, significance stars, N, R-squared
- Export as
for LaTeX inclusion and.tex
for quick viewing.html
Figures:
- Use
with project themeggplot2 - Set
for Beamer compatibilitybg = "transparent" - Include proper axis labels (sentence case, units)
- Export with explicit dimensions:
ggsave(width = X, height = Y) - Save as both
and.pdf.png
Phase 5: Save and Review
for all key objects (regression results, summary tables, processed data)saveRDS()- Create
subdirectories as needed withoutput/dir.create(..., recursive = TRUE) - Run the r-reviewer agent on the generated script:
Delegate to the r-reviewer agent: "Review the script at scripts/R/[script_name].R"
- Address any Critical or High issues from the review.
Script Structure
Follow this template:
# ============================================================ # [Descriptive Title] # Author: [from project context] # Purpose: [What this script does] # Inputs: [Data files] # Outputs: [Figures, tables, RDS files] # ============================================================ # 0. Setup ---- library(tidyverse) library(fixest) library(modelsummary) set.seed(42) dir.create("output/analysis", recursive = TRUE, showWarnings = FALSE) # 1. Data Loading ---- # [Load and clean data] # 2. Exploratory Analysis ---- # [Summary stats, diagnostic plots] # 3. Main Analysis ---- # [Regressions, estimation] # 4. Tables and Figures ---- # [Publication-ready output] # 5. Export ---- # [saveRDS for all objects, ggsave for all figures]
Important
- Reproduce, don't guess. If the user specifies a regression, run exactly that.
- Show your work. Print summary statistics before jumping to regression.
- Check for issues. Look for multicollinearity, outliers, perfect prediction.
- Use relative paths. All paths relative to repository root.
- No hardcoded values. Use variables for sample restrictions, date ranges, etc.