Awesome-Agent-Skills-for-Empirical-Research data-deposit

Prepare a replication package for the sewage-house-prices project. Generates AEA-compliant README, master script, numbered script order, install script, and deposit checklist. Validates the package against 10 verification checks. This skill should be used when asked to "prepare replication", "data deposit", "create replication package", or "package for submission".

install

source · Clone the upstream repo

git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/41-sticerd-eee-sewage-econometrics-check/skills/data-deposit" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-data-deposit && rm -rf "$T"

manifest: skills/41-sticerd-eee-sewage-econometrics-check/skills/data-deposit/SKILL.md

source content

Data Deposit Preparation

Prepare an AEA Data Editor compliant replication package for the sewage-house-prices project.

Input:

$ARGUMENTS

— output directory (defaults to

Replication/

Project-Specific Context

Pipeline Structure

The project has a 6-layer data pipeline in

scripts/R/

```
01_data_ingestion/
```
— Raw data collection (EDM archives, APIs)
```
02_data_cleaning/
```
— Format standardisation, geocoding, validation
```
03_data_enrichment/
```
— Temporal aggregation, rainfall metrics, dry spill identification
```
04_feature_engineering/
```
— Spatial matching (house/rental ↔ spill sites)
```
05_data_integration/
```
— Merging historical and API EDM data
```
06_analysis_datasets/
```
— Final dataset assembly

Analysis scripts:

scripts/R/09_analysis/

(6 subdirectories by approach) Utilities:

scripts/R/utils/

Python scripts:

scripts/python/

(river network processing) Docker pipelines:

RiverNetworks/

upstream_downstream/

Data Layout

data/raw/          — Original immutable data (EDM, Land Registry, Met Office, shapefiles)
data/processed/    — Intermediate pipeline outputs (parquet)
data/final/        — Analysis-ready datasets
data/cache/        — Postcode geocoding cache

Key Dependencies

R packages managed via
```
renv
```
(
```
renv.lock
```
)
Python environment via
```
uv
```
in
```
scripts/python/
```
PostGIS via Docker for river network analysis

Workflow

Step 1: Inventory

Read all scripts in
```
scripts/R/
```
and parse data file references
Read
```
renv.lock
```
for package versions
Scan
```
output/tables/
```
and
```
output/figures/
```
for output files
Read the manuscript (
```
docs/overleaf/_main.tex
```
) for table/figure references
Check
```
scripts/python/
```
for Python dependencies

Step 2: Analyse Dependencies

Parse script dependencies (which scripts create files that others load)
Map the execution order (follows the 6-layer pipeline, then analysis scripts)
Cross-reference the full execution order documented in
```
ReadMe.md
```

Step 3: Assemble Package

Create in

Replication/

(or specified directory):

README.md — AEA format:
- Data availability statement (which data is public vs restricted)
- Computational requirements (R version, packages, PostGIS, Python)
- Program descriptions (what each script does)
- Replication instructions (step-by-step)
- Expected runtime

master.R — Runs everything in order:

# Master replication script for "Sewage in Our Waters"
# Estimated runtime: [X hours]

source(here::here("scripts", "R", "01_data_ingestion", "script.R"))
# ... through all layers
source(here::here("scripts", "R", "09_analysis", "subdir", "script.R"))

install_packages.R — If renv is not used:

install.packages(c("tidyverse", "fixest", "modelsummary", ...))

DEPOSIT_CHECKLIST.md — Pre-deposit verification

Step 4: Validate

Run the 10 verification checks (equivalent to

/audit-replication

Script execution order is correct
All data file references resolve
All output files are generated
Package versions documented
No hardcoded absolute paths
Data provenance documented
README completeness (AEA format)
Output cross-reference (every table/figure traced to a script)
Restricted data properly flagged
Master script runs without modification

Step 5: Present Results

Package contents — All files in
```
Replication/
```
Script order — Numbered sequence with dependency graph
Data availability — Public vs restricted datasets
Verification result — X/10 checks passed
Deposit steps — openICPSR / Zenodo instructions

Principles

AEA Data Editor standards are the target. README format, versions, data access statements.
Don't rename scripts without approval. Present ordering first, let the user decide.
Thorough data provenance. Every dataset documented with source, access date, and restrictions.
Test before declaring ready. Always validate after assembly.
Document restricted data clearly. Land Registry and Zoopla data may have access restrictions.