Medsci-skills replicate-study

Replicate an existing cohort study's methodology on a different database. Extracts study design from a source paper, maps variables to the target DB via harmonization table, generates analysis code, and produces a replication difference report.

install
source · Clone the upstream repo
git clone https://github.com/Aperivue/medsci-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aperivue/medsci-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/replicate-study" ~/.claude/skills/aperivue-medsci-skills-replicate-study && rm -rf "$T"
manifest: skills/replicate-study/SKILL.md
source content

Replicate Study Skill

You are assisting a medical researcher in replicating an existing published study's methodology on a different database. This is a common research strategy: take a validated methodology from Paper A (e.g., NHIS cohort study) and apply it to Database B (e.g., KNHANES, NHANES, or another cohort) to produce a new paper with the same analytical rigor.

When to Use

  • Researcher has a published paper they want to replicate on their own data
  • Swapping exposure/outcome variables within the same DB
  • Cross-national replication (e.g., Korean study → US data, or vice versa)
  • Extending a single-institution study to a national cohort

Inputs

  1. Source paper: PDF, DOI, or markdown of the paper to replicate
  2. Target database path: CSV/SAS data file(s) to use
  3. Harmonization table (optional): CSV mapping source → target variables
    • Default:
      ${SKILL_DIR}/references/harmonization_knhanes_nhanes.csv
      (if KNHANES↔NHANES)

Reference Files

  • ${SKILL_DIR}/references/methodology_extraction_template.md
    — checklist for extracting study design
  • ${SKILL_DIR}/references/harmonization_knhanes_nhanes.csv
    — KNHANES↔NHANES variable mapping (67 rows)
  • ${SKILL_DIR}/references/harmonization_3country.csv
    — KNHANES+NHANES+CHNS 3-country mapping (45 rows, if available)
  • Upstream templates (read on demand):
    • medsci-skills/skills/write-paper/references/paper_types/nhis_cohort.md
    • medsci-skills/skills/write-paper/references/paper_types/cross_national.md
    • medsci-skills/skills/analyze-stats/references/analysis_guides/survey_weighted.md
    • medsci-skills/skills/analyze-stats/references/analysis_guides/propensity_score.md

Workflow

Phase 1: Source Paper Analysis

  1. Read the source paper (PDF → text, or markdown).
  2. Extract methodology using the extraction template:
    • Study design: cohort / cross-sectional / case-control
    • Database: name, country, years, N
    • Population: inclusion/exclusion criteria, age range
    • Exposure: variable name, definition, coding
    • Outcome: variable name, definition, coding
    • Covariates: full list with definitions
    • Statistical methods: regression type, adjustment model, subgroup analyses
    • Survey design: weights, strata, PSU (if applicable)
    • Sensitivity analyses: list all
  3. Output: structured extraction summary for user review.

Phase 2: Variable Mapping

  1. Load the harmonization table (CSV with columns: domain, concept, source_var, target_var, notes).
  2. For each extracted variable (exposure, outcome, covariates):
    • Find the matching row in the harmonization table
    • Flag: DIRECT_MATCH / RECODE_NEEDED / NOT_AVAILABLE / PROXY_AVAILABLE
  3. Generate a mapping report:
    • Green: directly available (no recoding)
    • Yellow: available but needs recoding (document transformation)
    • Red: not available in target DB (propose proxy or exclusion)
  4. Output: variable mapping table for user approval.

Phase 3: Code Generation

  1. Generate analysis code (Python with
    pandas
    + R via
    subprocess
    for survey-weighted): a. Data loading & cleaning: read target DB, apply inclusion/exclusion b. Variable derivation: recode variables per mapping table c. Survey design setup: define svydesign object (strata, PSU, weights) d. Table 1: demographics by exposure group (weighted) e. Main analysis: replicate the primary model (logistic/Cox/linear regression) f. Subgroup analyses: if specified in source paper g. Sensitivity analyses: replicate all listed in source paper
  2. Use
    /analyze-stats
    templates where available (survey_weighted, propensity_score).
  3. All code must be self-contained and reproducible.

Phase 4: Difference Report

Generate a structured difference report documenting:

SectionContent
Study DesignSame / Modified (explain)
DatabaseSource DB → Target DB (N, years, country)
PopulationInclusion/exclusion differences
Variable MappingFull mapping table with match status
Unavailable VariablesWhat's missing and how handled
Methodological DifferencesAny forced changes (e.g., BMI cutoffs, LDL calculation)
Expected DifferencesWhy results may differ (population, measurement, cultural)

Save as

replication_report.md
in the working directory.

Phase 5: Validation Checklist

Before reporting completion, verify:

  • All source paper covariates accounted for (mapped, proxied, or documented as missing)
  • Survey weights correctly applied (NEVER analyze unweighted if source used weights)
  • Obesity/BMI cutoffs match target population standards (Asian vs WHO)
  • Fasting requirements matched (fasting glucose, lipids)
  • Age restrictions applied correctly
  • Code runs without errors on target data
  • Output tables match source paper structure

Critical Rules

  1. Never pool data across surveys. Analyze each country's data with its own survey design.
  2. Document every deviation from the source methodology in the difference report.
  3. Asian BMI cutoffs (≥25 for obesity) when analyzing Korean data, even if source used WHO (≥30).
  4. LDL calculation: note if source used direct measurement vs Friedewald.
  5. Weighted analysis is mandatory for KNHANES/NHANES — never run unweighted models.
  6. IRB: note that KNHANES/NHANES are de-identified public data (IRB exempt or waived).
  7. Outdated source definitions: if the source paper used a pre-2023 definition that has since been superseded (e.g., NAFLD → MASLD 2023, CKD-EPI 2009 → 2021 race-free), call
    /define-variables
    to cross-check whether to mirror the legacy definition (pure replication) or upgrade to current (extension). Document the choice explicitly in the difference report.

Output Files

{working_dir}/
├── replication_report.md     — Structured difference report
├── variable_mapping.csv      — Variable mapping table with match status
├── analysis_code.py          — Main analysis script (Python + R calls)
├── analysis_code.R           — R script for survey-weighted analysis
└── results/
    ├── table1.csv            — Demographics table
    ├── main_results.csv      — Primary analysis results
    └── subgroup_results.csv  — Subgroup analysis results (if applicable)

Example Invocation

/replicate-study

Source paper: Joo 2026 (Psychiatry Research) — depression/diabetes cross-national
Target DB: /path/to/knhanes/HN18.csv
Harmonization: /path/to/harmonization_knhanes_nhanes.csv

Anti-Hallucination

  • Never fabricate variable names, dataset column names, or variable codings. If a variable mapping is uncertain, output
    [VERIFY: variable_name]
    and ask the user to confirm against the data dictionary.
  • Never fabricate statistical results — no invented p-values, effect sizes, confidence intervals, or sample sizes. All numbers must come from executed code output.
  • Never generate references from memory. Use
    /search-lit
    for all citations.
  • If a function, package, or API does not exist or you are unsure, say so explicitly rather than guessing.