Medsci-skills batch-cohort

Generate N analysis scripts from a single methodology template × multiple exposure/outcome combinations. The "80-person team" pattern — same validated method, swap variables only. Produces batch R/Python code + summary matrix.

install

source · Clone the upstream repo

git clone https://github.com/Aperivue/medsci-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/Aperivue/medsci-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/batch-cohort" ~/.claude/skills/aperivue-medsci-skills-batch-cohort && rm -rf "$T"

manifest: skills/batch-cohort/SKILL.md

source content

Batch Cohort Analysis Skill

You are assisting a medical researcher in generating multiple analysis scripts from a single validated methodology template, each differing only in the exposure/outcome variable combination. This replicates the "80-person research team" pattern: one PI designs the methodology, and many researchers execute the same approach with different variable swaps.

When to Use

Researcher has a validated analysis template (e.g., from /replicate-study or /cross-national)
Wants to explore multiple exposure → outcome combinations on the same database
Goal: systematic variable-swap code generation + batch execution + result matrix

Inputs

Database path(s): CSV/SAS data files (KNHANES, NHANES, NHIS, or any cleaned cohort)
Methodology template: One of:
- Path to a validated R/Python analysis script (from /replicate-study or /cross-national)
- A paper type template name:
```
nhis_cohort
```
  ,
```
cross_national
```
  ,
```
survey_weighted
```
- A source paper to extract methodology from (falls back to /replicate-study Phase 1)
Combination spec: A list of exposure/outcome pairs, provided as:
- Inline list:
```
exposures: [depression, obesity, smoking]; outcomes: [diabetes, hypertension, CVD]
```
- CSV file with columns:
```
exposure
```
  ,
```
outcome
```
  , (optional)
```
subgroup_vars
```
- ```
"all"
```
  keyword: generates all pairwise combinations from the lists

Optional Inputs

Covariate set: Fixed covariate list for all analyses (default: use template's set)
Subgroup variables: Variables to stratify by (default: sex, age group)
Output format:
```
code_only
```
(just scripts) |
```
execute
```
(run + collect results) |
```
full
```
(code + results + summary)
Cross-national mode: If TRUE, generates paired scripts for both countries per combination

Workflow

Phase 1: Template Validation

Read the methodology template (R script or paper type reference).
Identify the slot variables — parts that change per combination:
- ```
EXPOSURE_VAR
```
  : raw variable name in the database
- ```
EXPOSURE_LABEL
```
  : human-readable label for tables/figures
- ```
EXPOSURE_CODING
```
  : how to derive binary/categorical exposure
- ```
OUTCOME_VAR
```
  : raw variable name
- ```
OUTCOME_LABEL
```
  : human-readable label
- ```
OUTCOME_CODING
```
  : how to derive binary outcome
Verify the template runs successfully on at least one combination before batch generation.
Output: template summary with identified slots → user approval.

Phase 2: Variable Specification

For each exposure and outcome in the combination spec:

Look up the variable in the database:
- KNHANES: check variable name exists in the CSV header
- NHANES: check which table contains the variable (use codebook.csv if available)
- NHIS: check claims code or variable name
Define coding:
- Binary: threshold or category mapping (e.g.,
```
HE_glu >= 126 → diabetes = 1
```
  )
- Categorical: level definitions (e.g.,
```
smoking: current/former/never
```
  )
Check covariate overlap: If the exposure IS one of the standard covariates, remove it from the adjustment set for that analysis (no self-adjustment).
Output: combination matrix with all variable specifications.

| # | Exposure | Exposure Coding | Outcome | Outcome Coding | Covariates (adjusted) | Notes |
|---|----------|-----------------|---------|----------------|----------------------|-------|
| 1 | Depression (PHQ≥10) | BP_PHQ sum ≥10 | Diabetes | HE_glu≥126|HbA1c≥6.5|DE1_dg=1 | age,sex,edu,income,smoking,alcohol,obesity,CVD | — |
| 2 | Obesity (BMI≥25) | HE_obe ≥4 | Diabetes | same | age,sex,edu,income,smoking,alcohol,depression,CVD | obesity removed from covariates |
| ... | | | | | | |

Phase 3: Batch Code Generation

For each combination in the matrix:

Clone the template script.
Replace slot variables with the combination-specific values.
Adjust covariates: Remove exposure variable from covariate list if present.
Set output paths: Each combination gets its own results subdirectory.
Generate a master runner script (
```
run_all.R
```
or
```
run_all.sh
```
) that:
- Executes all N scripts sequentially (or in parallel via
```
future
```
  /
```
parallel
```
  )
- Captures errors per script without stopping the batch
- Logs execution time per analysis

Phase 4: Batch Execution (if

execute

full

mode)

Run the master script.
Collect results from each combination's output directory.
Handle failures gracefully:
- Log which combinations failed and why
- Common failures: convergence issues, too few events, empty subgroups
- Suggest fixes for failed combinations

Phase 5: Summary Matrix

Aggregate all results into a single summary:

Main Results Matrix (

summary_matrix.csv

Exposure	Outcome	N	Events	Model 1 OR (95% CI)	Model 2 OR (95% CI)	Model 3 OR (95% CI)	p-value	Significant
Depression	Diabetes	5,811	487	2.14 (1.52–3.01)	1.89 (1.33–2.69)	1.36 (0.91–2.05)	0.137	No
Obesity	Diabetes	5,811	487	3.45 (2.71–4.39)	3.38 (2.65–4.32)	3.12 (2.42–4.02)	<0.001	Yes
...

Subgroup Summary (

subgroup_matrix.csv

): Same format, stratified by subgroup variables.

Heatmap (optional): Visual matrix of effect sizes × significance, exposure on Y-axis, outcome on X-axis.

Output Files

{working_dir}/batch_{timestamp}/
├── README.md                    — Batch run summary (N combinations, template used, date)
├── combination_matrix.csv       — All exposure/outcome specs with coding
├── template/
│   └── base_template.R          — The validated template (frozen copy)
├── scripts/
│   ├── 01_depression_diabetes.R
│   ├── 02_obesity_diabetes.R
│   ├── ...
│   └── run_all.R                — Master execution script
├── results/
│   ├── 01_depression_diabetes/
│   │   ├── table1.csv
│   │   ├── main_results.csv
│   │   └── subgroup_results.csv
│   ├── 02_obesity_diabetes/
│   │   └── ...
│   └── ...
├── summary/
│   ├── summary_matrix.csv       — Main results across all combinations
│   ├── subgroup_matrix.csv      — Subgroup results across all combinations
│   ├── failed_runs.csv          — Combinations that failed + error messages
│   └── heatmap.png              — Optional effect size × significance visual
└── logs/
    └── batch_execution.log      — Timing + error log

Critical Rules

Never modify the core methodology across combinations — only swap exposure/outcome/covariates.
Remove self-adjustment: If exposure = BMI, remove obesity from covariates. If exposure = education/income, remove the same variable from covariates. If outcome = MetS, consider removing obesity from covariates. Document all removals.
Weighted analysis mandatory for KNHANES/NHANES/NHIS — inherited from template.
Event count check: Before running, verify each outcome has ≥10 events per covariate (EPV rule). Flag underpowered combinations.
Multiple comparisons: When generating >5 combinations, include a Bonferroni-corrected significance column in the summary matrix. Add a note about exploratory vs confirmatory framing.
Reproducibility: Freeze the template version. Include a SHA256 hash of the data file in README.
No p-hacking framing: The summary matrix is for hypothesis generation, not confirmation. State this explicitly in README and any manuscript output.
Outcome definitions MUST include physician diagnosis: Diabetes = FPG≥126 OR HbA1c≥6.5 OR physician-diagnosed (KNHANES: DE1_dg=1, NHANES: DIQ010="Yes"). Hypertension = SBP≥140 OR DBP≥90 OR physician-diagnosed (KNHANES: DI1_dg=1, NHANES: BPQ020="Yes"). Lab-only definitions systematically overestimate exposure→outcome associations (validated: Joo 2026 replication showed US depression→DM wOR 1.92 without vs 1.54 with physician dx).
Full covariate set is default: Always use 8 covariates (age, sex, education, income, smoking, alcohol, obesity, CVD) unless explicitly justified. Minimal models (age+sex+BMI only) overestimate effects due to residual confounding.

Cross-National Batch Mode

When

cross_national: true

Generate paired scripts for each combination (Korea + US)
Summary matrix includes both countries side-by-side
Direction agreement column: ✓ if both countries show same direction of effect
Uses /cross-national skill's dual-survey-design approach

Integration with Upstream Skills

Need	Skill
Variable coding lookup	`analyze-stats` survey_weighted guide
Template creation from paper	`/replicate-study` Phase 1–3
Cross-national paired analysis	`/cross-national`
ICD-10 claims algorithms	`analyze-stats` nhis_icd10_mapping guide
Write manuscript from results	`/write-paper` (nhis_cohort or cross_national type)
Figure generation	`/make-figures` (forest plot of all combinations)

Example Invocations

Basic: Single DB, Multiple Exposures × Single Outcome

/batch-cohort

DB: /path/to/knhanes/HN18.csv
Template: /path/to/validated_analysis.R
Exposures: [depression, obesity, smoking, heavy_drinking, low_income, low_education]
Outcome: diabetes
Mode: full

Cross-National: Full Matrix

/batch-cohort

DB Korea: /path/to/knhanes/HN18.csv
DB US: /path/to/nhanes/
Template: cross_national
Exposures: [depression, obesity, smoking]
Outcomes: [diabetes, hypertension, metabolic_syndrome]
cross_national: true
Mode: execute

NHIS Cohort: Claims-Based Batch

/batch-cohort

DB: /path/to/nhis_sample_cohort.csv
Template: nhis_cohort
Exposures: [atrial_fibrillation, heart_failure, COPD, CKD]
Outcomes: [all_cause_mortality, cardiovascular_death, stroke]
Mode: code_only

Anti-Hallucination

Never fabricate variable names, dataset column names, or variable codings. If a variable mapping is uncertain, output
```
[VERIFY: variable_name]
```
and ask the user to confirm against the data dictionary.
Never fabricate statistical results — no invented p-values, effect sizes, confidence intervals, or sample sizes. All numbers must come from executed code output.
Never generate references from memory. Use
```
/search-lit
```
for all citations.
If a function, package, or API does not exist or you are unsure, say so explicitly rather than guessing.