Awesome-Agent-Skills-for-Empirical-Research data-deposit
Prepare a replication package for the sewage-house-prices project. Generates AEA-compliant README, master script, numbered script order, install script, and deposit checklist. Validates the package against 10 verification checks. This skill should be used when asked to "prepare replication", "data deposit", "create replication package", or "package for submission".
install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/41-sticerd-eee-sewage-econometrics-check/skills/data-deposit" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-data-deposit && rm -rf "$T"
manifest:
skills/41-sticerd-eee-sewage-econometrics-check/skills/data-deposit/SKILL.mdsource content
Data Deposit Preparation
Prepare an AEA Data Editor compliant replication package for the sewage-house-prices project.
Input:
$ARGUMENTS — output directory (defaults to Replication/).
Project-Specific Context
Pipeline Structure
The project has a 6-layer data pipeline in
scripts/R/:
— Raw data collection (EDM archives, APIs)01_data_ingestion/
— Format standardisation, geocoding, validation02_data_cleaning/
— Temporal aggregation, rainfall metrics, dry spill identification03_data_enrichment/
— Spatial matching (house/rental ↔ spill sites)04_feature_engineering/
— Merging historical and API EDM data05_data_integration/
— Final dataset assembly06_analysis_datasets/
Analysis scripts:
scripts/R/09_analysis/ (6 subdirectories by approach)
Utilities: scripts/R/utils/
Python scripts: scripts/python/ (river network processing)
Docker pipelines: RiverNetworks/, upstream_downstream/
Data Layout
data/raw/ — Original immutable data (EDM, Land Registry, Met Office, shapefiles) data/processed/ — Intermediate pipeline outputs (parquet) data/final/ — Analysis-ready datasets data/cache/ — Postcode geocoding cache
Key Dependencies
- R packages managed via
(renv
)renv.lock - Python environment via
inuvscripts/python/ - PostGIS via Docker for river network analysis
Workflow
Step 1: Inventory
- Read all scripts in
and parse data file referencesscripts/R/ - Read
for package versionsrenv.lock - Scan
andoutput/tables/
for output filesoutput/figures/ - Read the manuscript (
) for table/figure referencesdocs/overleaf/_main.tex - Check
for Python dependenciesscripts/python/
Step 2: Analyse Dependencies
- Parse script dependencies (which scripts create files that others load)
- Map the execution order (follows the 6-layer pipeline, then analysis scripts)
- Cross-reference the full execution order documented in
ReadMe.md
Step 3: Assemble Package
Create in
Replication/ (or specified directory):
-
README.md — AEA format:
- Data availability statement (which data is public vs restricted)
- Computational requirements (R version, packages, PostGIS, Python)
- Program descriptions (what each script does)
- Replication instructions (step-by-step)
- Expected runtime
-
master.R — Runs everything in order:
# Master replication script for "Sewage in Our Waters" # Estimated runtime: [X hours] source(here::here("scripts", "R", "01_data_ingestion", "script.R")) # ... through all layers source(here::here("scripts", "R", "09_analysis", "subdir", "script.R")) -
install_packages.R — If renv is not used:
install.packages(c("tidyverse", "fixest", "modelsummary", ...)) -
DEPOSIT_CHECKLIST.md — Pre-deposit verification
Step 4: Validate
Run the 10 verification checks (equivalent to
/audit-replication):
- Script execution order is correct
- All data file references resolve
- All output files are generated
- Package versions documented
- No hardcoded absolute paths
- Data provenance documented
- README completeness (AEA format)
- Output cross-reference (every table/figure traced to a script)
- Restricted data properly flagged
- Master script runs without modification
Step 5: Present Results
- Package contents — All files in
Replication/ - Script order — Numbered sequence with dependency graph
- Data availability — Public vs restricted datasets
- Verification result — X/10 checks passed
- Deposit steps — openICPSR / Zenodo instructions
Principles
- AEA Data Editor standards are the target. README format, versions, data access statements.
- Don't rename scripts without approval. Present ordering first, let the user decide.
- Thorough data provenance. Every dataset documented with source, access date, and restrictions.
- Test before declaring ready. Always validate after assembly.
- Document restricted data clearly. Land Registry and Zoopla data may have access restrictions.