Medsci-skills replicate-study
Replicate an existing cohort study's methodology on a different database. Extracts study design from a source paper, maps variables to the target DB via harmonization table, generates analysis code, and produces a replication difference report.
install
source · Clone the upstream repo
git clone https://github.com/Aperivue/medsci-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aperivue/medsci-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/replicate-study" ~/.claude/skills/aperivue-medsci-skills-replicate-study && rm -rf "$T"
manifest:
skills/replicate-study/SKILL.mdsource content
Replicate Study Skill
You are assisting a medical researcher in replicating an existing published study's methodology on a different database. This is a common research strategy: take a validated methodology from Paper A (e.g., NHIS cohort study) and apply it to Database B (e.g., KNHANES, NHANES, or another cohort) to produce a new paper with the same analytical rigor.
When to Use
- Researcher has a published paper they want to replicate on their own data
- Swapping exposure/outcome variables within the same DB
- Cross-national replication (e.g., Korean study → US data, or vice versa)
- Extending a single-institution study to a national cohort
Inputs
- Source paper: PDF, DOI, or markdown of the paper to replicate
- Target database path: CSV/SAS data file(s) to use
- Harmonization table (optional): CSV mapping source → target variables
- Default:
(if KNHANES↔NHANES)${SKILL_DIR}/references/harmonization_knhanes_nhanes.csv
- Default:
Reference Files
— checklist for extracting study design${SKILL_DIR}/references/methodology_extraction_template.md
— KNHANES↔NHANES variable mapping (67 rows)${SKILL_DIR}/references/harmonization_knhanes_nhanes.csv
— KNHANES+NHANES+CHNS 3-country mapping (45 rows, if available)${SKILL_DIR}/references/harmonization_3country.csv- Upstream templates (read on demand):
medsci-skills/skills/write-paper/references/paper_types/nhis_cohort.mdmedsci-skills/skills/write-paper/references/paper_types/cross_national.mdmedsci-skills/skills/analyze-stats/references/analysis_guides/survey_weighted.mdmedsci-skills/skills/analyze-stats/references/analysis_guides/propensity_score.md
Workflow
Phase 1: Source Paper Analysis
- Read the source paper (PDF → text, or markdown).
- Extract methodology using the extraction template:
- Study design: cohort / cross-sectional / case-control
- Database: name, country, years, N
- Population: inclusion/exclusion criteria, age range
- Exposure: variable name, definition, coding
- Outcome: variable name, definition, coding
- Covariates: full list with definitions
- Statistical methods: regression type, adjustment model, subgroup analyses
- Survey design: weights, strata, PSU (if applicable)
- Sensitivity analyses: list all
- Output: structured extraction summary for user review.
Phase 2: Variable Mapping
- Load the harmonization table (CSV with columns: domain, concept, source_var, target_var, notes).
- For each extracted variable (exposure, outcome, covariates):
- Find the matching row in the harmonization table
- Flag: DIRECT_MATCH / RECODE_NEEDED / NOT_AVAILABLE / PROXY_AVAILABLE
- Generate a mapping report:
- Green: directly available (no recoding)
- Yellow: available but needs recoding (document transformation)
- Red: not available in target DB (propose proxy or exclusion)
- Output: variable mapping table for user approval.
Phase 3: Code Generation
- Generate analysis code (Python with
+ R viapandas
for survey-weighted): a. Data loading & cleaning: read target DB, apply inclusion/exclusion b. Variable derivation: recode variables per mapping table c. Survey design setup: define svydesign object (strata, PSU, weights) d. Table 1: demographics by exposure group (weighted) e. Main analysis: replicate the primary model (logistic/Cox/linear regression) f. Subgroup analyses: if specified in source paper g. Sensitivity analyses: replicate all listed in source papersubprocess - Use
templates where available (survey_weighted, propensity_score)./analyze-stats - All code must be self-contained and reproducible.
Phase 4: Difference Report
Generate a structured difference report documenting:
| Section | Content |
|---|---|
| Study Design | Same / Modified (explain) |
| Database | Source DB → Target DB (N, years, country) |
| Population | Inclusion/exclusion differences |
| Variable Mapping | Full mapping table with match status |
| Unavailable Variables | What's missing and how handled |
| Methodological Differences | Any forced changes (e.g., BMI cutoffs, LDL calculation) |
| Expected Differences | Why results may differ (population, measurement, cultural) |
Save as
replication_report.md in the working directory.
Phase 5: Validation Checklist
Before reporting completion, verify:
- All source paper covariates accounted for (mapped, proxied, or documented as missing)
- Survey weights correctly applied (NEVER analyze unweighted if source used weights)
- Obesity/BMI cutoffs match target population standards (Asian vs WHO)
- Fasting requirements matched (fasting glucose, lipids)
- Age restrictions applied correctly
- Code runs without errors on target data
- Output tables match source paper structure
Critical Rules
- Never pool data across surveys. Analyze each country's data with its own survey design.
- Document every deviation from the source methodology in the difference report.
- Asian BMI cutoffs (≥25 for obesity) when analyzing Korean data, even if source used WHO (≥30).
- LDL calculation: note if source used direct measurement vs Friedewald.
- Weighted analysis is mandatory for KNHANES/NHANES — never run unweighted models.
- IRB: note that KNHANES/NHANES are de-identified public data (IRB exempt or waived).
- Outdated source definitions: if the source paper used a pre-2023 definition that has since been superseded (e.g., NAFLD → MASLD 2023, CKD-EPI 2009 → 2021 race-free), call
to cross-check whether to mirror the legacy definition (pure replication) or upgrade to current (extension). Document the choice explicitly in the difference report./define-variables
Output Files
{working_dir}/ ├── replication_report.md — Structured difference report ├── variable_mapping.csv — Variable mapping table with match status ├── analysis_code.py — Main analysis script (Python + R calls) ├── analysis_code.R — R script for survey-weighted analysis └── results/ ├── table1.csv — Demographics table ├── main_results.csv — Primary analysis results └── subgroup_results.csv — Subgroup analysis results (if applicable)
Example Invocation
/replicate-study Source paper: Joo 2026 (Psychiatry Research) — depression/diabetes cross-national Target DB: /path/to/knhanes/HN18.csv Harmonization: /path/to/harmonization_knhanes_nhanes.csv
Anti-Hallucination
- Never fabricate variable names, dataset column names, or variable codings. If a variable mapping is uncertain, output
and ask the user to confirm against the data dictionary.[VERIFY: variable_name] - Never fabricate statistical results — no invented p-values, effect sizes, confidence intervals, or sample sizes. All numbers must come from executed code output.
- Never generate references from memory. Use
for all citations./search-lit - If a function, package, or API does not exist or you are unsure, say so explicitly rather than guessing.