Medsci-skills batch-cohort

Generate N analysis scripts from a single methodology template × multiple exposure/outcome combinations. The "80-person team" pattern — same validated method, swap variables only. Produces batch R/Python code + summary matrix.

install
source · Clone the upstream repo
git clone https://github.com/Aperivue/medsci-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aperivue/medsci-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/batch-cohort" ~/.claude/skills/aperivue-medsci-skills-batch-cohort && rm -rf "$T"
manifest: skills/batch-cohort/SKILL.md
source content

Batch Cohort Analysis Skill

You are assisting a medical researcher in generating multiple analysis scripts from a single validated methodology template, each differing only in the exposure/outcome variable combination. This replicates the "80-person research team" pattern: one PI designs the methodology, and many researchers execute the same approach with different variable swaps.

When to Use

  • Researcher has a validated analysis template (e.g., from /replicate-study or /cross-national)
  • Wants to explore multiple exposure → outcome combinations on the same database
  • Goal: systematic variable-swap code generation + batch execution + result matrix

Inputs

  1. Database path(s): CSV/SAS data files (KNHANES, NHANES, NHIS, or any cleaned cohort)
  2. Methodology template: One of:
    • Path to a validated R/Python analysis script (from /replicate-study or /cross-national)
    • A paper type template name:
      nhis_cohort
      ,
      cross_national
      ,
      survey_weighted
    • A source paper to extract methodology from (falls back to /replicate-study Phase 1)
  3. Combination spec: A list of exposure/outcome pairs, provided as:
    • Inline list:
      exposures: [depression, obesity, smoking]; outcomes: [diabetes, hypertension, CVD]
    • CSV file with columns:
      exposure
      ,
      outcome
      , (optional)
      subgroup_vars
    • "all"
      keyword: generates all pairwise combinations from the lists

Optional Inputs

  • Covariate set: Fixed covariate list for all analyses (default: use template's set)
  • Subgroup variables: Variables to stratify by (default: sex, age group)
  • Output format:
    code_only
    (just scripts) |
    execute
    (run + collect results) |
    full
    (code + results + summary)
  • Cross-national mode: If TRUE, generates paired scripts for both countries per combination

Workflow

Phase 1: Template Validation

  1. Read the methodology template (R script or paper type reference).
  2. Identify the slot variables — parts that change per combination:
    • EXPOSURE_VAR
      : raw variable name in the database
    • EXPOSURE_LABEL
      : human-readable label for tables/figures
    • EXPOSURE_CODING
      : how to derive binary/categorical exposure
    • OUTCOME_VAR
      : raw variable name
    • OUTCOME_LABEL
      : human-readable label
    • OUTCOME_CODING
      : how to derive binary outcome
  3. Verify the template runs successfully on at least one combination before batch generation.
  4. Output: template summary with identified slots → user approval.

Phase 2: Variable Specification

For each exposure and outcome in the combination spec:

  1. Look up the variable in the database:
    • KNHANES: check variable name exists in the CSV header
    • NHANES: check which table contains the variable (use codebook.csv if available)
    • NHIS: check claims code or variable name
  2. Define coding:
    • Binary: threshold or category mapping (e.g.,
      HE_glu >= 126 → diabetes = 1
      )
    • Categorical: level definitions (e.g.,
      smoking: current/former/never
      )
  3. Check covariate overlap: If the exposure IS one of the standard covariates, remove it from the adjustment set for that analysis (no self-adjustment).
  4. Output: combination matrix with all variable specifications.
| # | Exposure | Exposure Coding | Outcome | Outcome Coding | Covariates (adjusted) | Notes |
|---|----------|-----------------|---------|----------------|----------------------|-------|
| 1 | Depression (PHQ≥10) | BP_PHQ sum ≥10 | Diabetes | HE_glu≥126|HbA1c≥6.5|DE1_dg=1 | age,sex,edu,income,smoking,alcohol,obesity,CVD | — |
| 2 | Obesity (BMI≥25) | HE_obe ≥4 | Diabetes | same | age,sex,edu,income,smoking,alcohol,depression,CVD | obesity removed from covariates |
| ... | | | | | | |

Phase 3: Batch Code Generation

For each combination in the matrix:

  1. Clone the template script.
  2. Replace slot variables with the combination-specific values.
  3. Adjust covariates: Remove exposure variable from covariate list if present.
  4. Set output paths: Each combination gets its own results subdirectory.
  5. Generate a master runner script (
    run_all.R
    or
    run_all.sh
    ) that:
    • Executes all N scripts sequentially (or in parallel via
      future
      /
      parallel
      )
    • Captures errors per script without stopping the batch
    • Logs execution time per analysis

Phase 4: Batch Execution (if
execute
or
full
mode)

  1. Run the master script.
  2. Collect results from each combination's output directory.
  3. Handle failures gracefully:
    • Log which combinations failed and why
    • Common failures: convergence issues, too few events, empty subgroups
    • Suggest fixes for failed combinations

Phase 5: Summary Matrix

Aggregate all results into a single summary:

Main Results Matrix (

summary_matrix.csv
):

ExposureOutcomeNEventsModel 1 OR (95% CI)Model 2 OR (95% CI)Model 3 OR (95% CI)p-valueSignificant
DepressionDiabetes5,8114872.14 (1.52–3.01)1.89 (1.33–2.69)1.36 (0.91–2.05)0.137No
ObesityDiabetes5,8114873.45 (2.71–4.39)3.38 (2.65–4.32)3.12 (2.42–4.02)<0.001Yes
...

Subgroup Summary (

subgroup_matrix.csv
): Same format, stratified by subgroup variables.

Heatmap (optional): Visual matrix of effect sizes × significance, exposure on Y-axis, outcome on X-axis.

Output Files

{working_dir}/batch_{timestamp}/
├── README.md                    — Batch run summary (N combinations, template used, date)
├── combination_matrix.csv       — All exposure/outcome specs with coding
├── template/
│   └── base_template.R          — The validated template (frozen copy)
├── scripts/
│   ├── 01_depression_diabetes.R
│   ├── 02_obesity_diabetes.R
│   ├── ...
│   └── run_all.R                — Master execution script
├── results/
│   ├── 01_depression_diabetes/
│   │   ├── table1.csv
│   │   ├── main_results.csv
│   │   └── subgroup_results.csv
│   ├── 02_obesity_diabetes/
│   │   └── ...
│   └── ...
├── summary/
│   ├── summary_matrix.csv       — Main results across all combinations
│   ├── subgroup_matrix.csv      — Subgroup results across all combinations
│   ├── failed_runs.csv          — Combinations that failed + error messages
│   └── heatmap.png              — Optional effect size × significance visual
└── logs/
    └── batch_execution.log      — Timing + error log

Critical Rules

  1. Never modify the core methodology across combinations — only swap exposure/outcome/covariates.
  2. Remove self-adjustment: If exposure = BMI, remove obesity from covariates. If exposure = education/income, remove the same variable from covariates. If outcome = MetS, consider removing obesity from covariates. Document all removals.
  3. Weighted analysis mandatory for KNHANES/NHANES/NHIS — inherited from template.
  4. Event count check: Before running, verify each outcome has ≥10 events per covariate (EPV rule). Flag underpowered combinations.
  5. Multiple comparisons: When generating >5 combinations, include a Bonferroni-corrected significance column in the summary matrix. Add a note about exploratory vs confirmatory framing.
  6. Reproducibility: Freeze the template version. Include a SHA256 hash of the data file in README.
  7. No p-hacking framing: The summary matrix is for hypothesis generation, not confirmation. State this explicitly in README and any manuscript output.
  8. Outcome definitions MUST include physician diagnosis: Diabetes = FPG≥126 OR HbA1c≥6.5 OR physician-diagnosed (KNHANES: DE1_dg=1, NHANES: DIQ010="Yes"). Hypertension = SBP≥140 OR DBP≥90 OR physician-diagnosed (KNHANES: DI1_dg=1, NHANES: BPQ020="Yes"). Lab-only definitions systematically overestimate exposure→outcome associations (validated: Joo 2026 replication showed US depression→DM wOR 1.92 without vs 1.54 with physician dx).
  9. Full covariate set is default: Always use 8 covariates (age, sex, education, income, smoking, alcohol, obesity, CVD) unless explicitly justified. Minimal models (age+sex+BMI only) overestimate effects due to residual confounding.

Cross-National Batch Mode

When

cross_national: true
:

  • Generate paired scripts for each combination (Korea + US)
  • Summary matrix includes both countries side-by-side
  • Direction agreement column: ✓ if both countries show same direction of effect
  • Uses /cross-national skill's dual-survey-design approach

Integration with Upstream Skills

NeedSkill
Variable coding lookup
analyze-stats
survey_weighted guide
Template creation from paper
/replicate-study
Phase 1–3
Cross-national paired analysis
/cross-national
ICD-10 claims algorithms
analyze-stats
nhis_icd10_mapping guide
Write manuscript from results
/write-paper
(nhis_cohort or cross_national type)
Figure generation
/make-figures
(forest plot of all combinations)

Example Invocations

Basic: Single DB, Multiple Exposures × Single Outcome

/batch-cohort

DB: /path/to/knhanes/HN18.csv
Template: /path/to/validated_analysis.R
Exposures: [depression, obesity, smoking, heavy_drinking, low_income, low_education]
Outcome: diabetes
Mode: full

Cross-National: Full Matrix

/batch-cohort

DB Korea: /path/to/knhanes/HN18.csv
DB US: /path/to/nhanes/
Template: cross_national
Exposures: [depression, obesity, smoking]
Outcomes: [diabetes, hypertension, metabolic_syndrome]
cross_national: true
Mode: execute

NHIS Cohort: Claims-Based Batch

/batch-cohort

DB: /path/to/nhis_sample_cohort.csv
Template: nhis_cohort
Exposures: [atrial_fibrillation, heart_failure, COPD, CKD]
Outcomes: [all_cause_mortality, cardiovascular_death, stroke]
Mode: code_only

Anti-Hallucination

  • Never fabricate variable names, dataset column names, or variable codings. If a variable mapping is uncertain, output
    [VERIFY: variable_name]
    and ask the user to confirm against the data dictionary.
  • Never fabricate statistical results — no invented p-values, effect sizes, confidence intervals, or sample sizes. All numbers must come from executed code output.
  • Never generate references from memory. Use
    /search-lit
    for all citations.
  • If a function, package, or API does not exist or you are unsure, say so explicitly rather than guessing.