Medsci-skills replicate-study

Replicate an existing cohort study's methodology on a different database. Extracts study design from a source paper, maps variables to the target DB via harmonization table, generates analysis code, and produces a replication difference report.

install

source · Clone the upstream repo

git clone https://github.com/Aperivue/medsci-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/Aperivue/medsci-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/replicate-study" ~/.claude/skills/aperivue-medsci-skills-replicate-study && rm -rf "$T"

manifest: skills/replicate-study/SKILL.md

source content

Replicate Study Skill

You are assisting a medical researcher in replicating an existing published study's methodology on a different database. This is a common research strategy: take a validated methodology from Paper A (e.g., NHIS cohort study) and apply it to Database B (e.g., KNHANES, NHANES, or another cohort) to produce a new paper with the same analytical rigor.

When to Use

Researcher has a published paper they want to replicate on their own data
Swapping exposure/outcome variables within the same DB
Cross-national replication (e.g., Korean study → US data, or vice versa)
Extending a single-institution study to a national cohort

Inputs

Source paper: PDF, DOI, or markdown of the paper to replicate
Target database path: CSV/SAS data file(s) to use
Harmonization table (optional): CSV mapping source → target variables
- Default:
```
${SKILL_DIR}/references/harmonization_knhanes_nhanes.csv
```
  (if KNHANES↔NHANES)

Reference Files

${SKILL_DIR}/references/methodology_extraction_template.md

— checklist for extracting study design

${SKILL_DIR}/references/harmonization_knhanes_nhanes.csv

— KNHANES↔NHANES variable mapping (67 rows)

```
${SKILL_DIR}/references/harmonization_3country.csv
```
— KNHANES+NHANES+CHNS 3-country mapping (45 rows, if available)

Upstream templates (read on demand):

medsci-skills/skills/write-paper/references/paper_types/nhis_cohort.md

medsci-skills/skills/write-paper/references/paper_types/cross_national.md

medsci-skills/skills/analyze-stats/references/analysis_guides/survey_weighted.md

medsci-skills/skills/analyze-stats/references/analysis_guides/propensity_score.md

Workflow

Phase 1: Source Paper Analysis

Read the source paper (PDF → text, or markdown).
Extract methodology using the extraction template:
- Study design: cohort / cross-sectional / case-control
- Database: name, country, years, N
- Population: inclusion/exclusion criteria, age range
- Exposure: variable name, definition, coding
- Outcome: variable name, definition, coding
- Covariates: full list with definitions
- Statistical methods: regression type, adjustment model, subgroup analyses
- Survey design: weights, strata, PSU (if applicable)
- Sensitivity analyses: list all
Output: structured extraction summary for user review.

Phase 2: Variable Mapping

Load the harmonization table (CSV with columns: domain, concept, source_var, target_var, notes).
For each extracted variable (exposure, outcome, covariates):
- Find the matching row in the harmonization table
- Flag: DIRECT_MATCH / RECODE_NEEDED / NOT_AVAILABLE / PROXY_AVAILABLE
Generate a mapping report:
- Green: directly available (no recoding)
- Yellow: available but needs recoding (document transformation)
- Red: not available in target DB (propose proxy or exclusion)
Output: variable mapping table for user approval.

Phase 3: Code Generation

Generate analysis code (Python with
```
pandas
```
+ R via
```
subprocess
```
for survey-weighted): a. Data loading & cleaning: read target DB, apply inclusion/exclusion b. Variable derivation: recode variables per mapping table c. Survey design setup: define svydesign object (strata, PSU, weights) d. Table 1: demographics by exposure group (weighted) e. Main analysis: replicate the primary model (logistic/Cox/linear regression) f. Subgroup analyses: if specified in source paper g. Sensitivity analyses: replicate all listed in source paper
Use
```
/analyze-stats
```
templates where available (survey_weighted, propensity_score).
All code must be self-contained and reproducible.

Phase 4: Difference Report

Generate a structured difference report documenting:

Section	Content
Study Design	Same / Modified (explain)
Database	Source DB → Target DB (N, years, country)
Population	Inclusion/exclusion differences
Variable Mapping	Full mapping table with match status
Unavailable Variables	What's missing and how handled
Methodological Differences	Any forced changes (e.g., BMI cutoffs, LDL calculation)
Expected Differences	Why results may differ (population, measurement, cultural)

Save as

replication_report.md

in the working directory.

Phase 5: Validation Checklist

Before reporting completion, verify:

All source paper covariates accounted for (mapped, proxied, or documented as missing)
Survey weights correctly applied (NEVER analyze unweighted if source used weights)
Obesity/BMI cutoffs match target population standards (Asian vs WHO)
Fasting requirements matched (fasting glucose, lipids)
Age restrictions applied correctly
Code runs without errors on target data
Output tables match source paper structure

Critical Rules

Never pool data across surveys. Analyze each country's data with its own survey design.
Document every deviation from the source methodology in the difference report.
Asian BMI cutoffs (≥25 for obesity) when analyzing Korean data, even if source used WHO (≥30).
LDL calculation: note if source used direct measurement vs Friedewald.
Weighted analysis is mandatory for KNHANES/NHANES — never run unweighted models.
IRB: note that KNHANES/NHANES are de-identified public data (IRB exempt or waived).
Outdated source definitions: if the source paper used a pre-2023 definition that has since been superseded (e.g., NAFLD → MASLD 2023, CKD-EPI 2009 → 2021 race-free), call
```
/define-variables
```
to cross-check whether to mirror the legacy definition (pure replication) or upgrade to current (extension). Document the choice explicitly in the difference report.

Output Files

{working_dir}/
├── replication_report.md     — Structured difference report
├── variable_mapping.csv      — Variable mapping table with match status
├── analysis_code.py          — Main analysis script (Python + R calls)
├── analysis_code.R           — R script for survey-weighted analysis
└── results/
    ├── table1.csv            — Demographics table
    ├── main_results.csv      — Primary analysis results
    └── subgroup_results.csv  — Subgroup analysis results (if applicable)

Example Invocation

/replicate-study

Source paper: Joo 2026 (Psychiatry Research) — depression/diabetes cross-national
Target DB: /path/to/knhanes/HN18.csv
Harmonization: /path/to/harmonization_knhanes_nhanes.csv

Anti-Hallucination

Never fabricate variable names, dataset column names, or variable codings. If a variable mapping is uncertain, output
```
[VERIFY: variable_name]
```
and ask the user to confirm against the data dictionary.
Never fabricate statistical results — no invented p-values, effect sizes, confidence intervals, or sample sizes. All numbers must come from executed code output.
Never generate references from memory. Use
```
/search-lit
```
for all citations.
If a function, package, or API does not exist or you are unsure, say so explicitly rather than guessing.