Medsci-skills batch-cohort
Generate N analysis scripts from a single methodology template × multiple exposure/outcome combinations. The "80-person team" pattern — same validated method, swap variables only. Produces batch R/Python code + summary matrix.
git clone https://github.com/Aperivue/medsci-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aperivue/medsci-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/batch-cohort" ~/.claude/skills/aperivue-medsci-skills-batch-cohort && rm -rf "$T"
skills/batch-cohort/SKILL.mdBatch Cohort Analysis Skill
You are assisting a medical researcher in generating multiple analysis scripts from a single validated methodology template, each differing only in the exposure/outcome variable combination. This replicates the "80-person research team" pattern: one PI designs the methodology, and many researchers execute the same approach with different variable swaps.
When to Use
- Researcher has a validated analysis template (e.g., from /replicate-study or /cross-national)
- Wants to explore multiple exposure → outcome combinations on the same database
- Goal: systematic variable-swap code generation + batch execution + result matrix
Inputs
- Database path(s): CSV/SAS data files (KNHANES, NHANES, NHIS, or any cleaned cohort)
- Methodology template: One of:
- Path to a validated R/Python analysis script (from /replicate-study or /cross-national)
- A paper type template name:
,nhis_cohort
,cross_nationalsurvey_weighted - A source paper to extract methodology from (falls back to /replicate-study Phase 1)
- Combination spec: A list of exposure/outcome pairs, provided as:
- Inline list:
exposures: [depression, obesity, smoking]; outcomes: [diabetes, hypertension, CVD] - CSV file with columns:
,exposure
, (optional)outcomesubgroup_vars
keyword: generates all pairwise combinations from the lists"all"
- Inline list:
Optional Inputs
- Covariate set: Fixed covariate list for all analyses (default: use template's set)
- Subgroup variables: Variables to stratify by (default: sex, age group)
- Output format:
(just scripts) |code_only
(run + collect results) |execute
(code + results + summary)full - Cross-national mode: If TRUE, generates paired scripts for both countries per combination
Workflow
Phase 1: Template Validation
- Read the methodology template (R script or paper type reference).
- Identify the slot variables — parts that change per combination:
: raw variable name in the databaseEXPOSURE_VAR
: human-readable label for tables/figuresEXPOSURE_LABEL
: how to derive binary/categorical exposureEXPOSURE_CODING
: raw variable nameOUTCOME_VAR
: human-readable labelOUTCOME_LABEL
: how to derive binary outcomeOUTCOME_CODING
- Verify the template runs successfully on at least one combination before batch generation.
- Output: template summary with identified slots → user approval.
Phase 2: Variable Specification
For each exposure and outcome in the combination spec:
- Look up the variable in the database:
- KNHANES: check variable name exists in the CSV header
- NHANES: check which table contains the variable (use codebook.csv if available)
- NHIS: check claims code or variable name
- Define coding:
- Binary: threshold or category mapping (e.g.,
)HE_glu >= 126 → diabetes = 1 - Categorical: level definitions (e.g.,
)smoking: current/former/never
- Binary: threshold or category mapping (e.g.,
- Check covariate overlap: If the exposure IS one of the standard covariates, remove it from the adjustment set for that analysis (no self-adjustment).
- Output: combination matrix with all variable specifications.
| # | Exposure | Exposure Coding | Outcome | Outcome Coding | Covariates (adjusted) | Notes | |---|----------|-----------------|---------|----------------|----------------------|-------| | 1 | Depression (PHQ≥10) | BP_PHQ sum ≥10 | Diabetes | HE_glu≥126|HbA1c≥6.5|DE1_dg=1 | age,sex,edu,income,smoking,alcohol,obesity,CVD | — | | 2 | Obesity (BMI≥25) | HE_obe ≥4 | Diabetes | same | age,sex,edu,income,smoking,alcohol,depression,CVD | obesity removed from covariates | | ... | | | | | | |
Phase 3: Batch Code Generation
For each combination in the matrix:
- Clone the template script.
- Replace slot variables with the combination-specific values.
- Adjust covariates: Remove exposure variable from covariate list if present.
- Set output paths: Each combination gets its own results subdirectory.
- Generate a master runner script (
orrun_all.R
) that:run_all.sh- Executes all N scripts sequentially (or in parallel via
/future
)parallel - Captures errors per script without stopping the batch
- Logs execution time per analysis
- Executes all N scripts sequentially (or in parallel via
Phase 4: Batch Execution (if execute
or full
mode)
executefull- Run the master script.
- Collect results from each combination's output directory.
- Handle failures gracefully:
- Log which combinations failed and why
- Common failures: convergence issues, too few events, empty subgroups
- Suggest fixes for failed combinations
Phase 5: Summary Matrix
Aggregate all results into a single summary:
Main Results Matrix (
summary_matrix.csv):
| Exposure | Outcome | N | Events | Model 1 OR (95% CI) | Model 2 OR (95% CI) | Model 3 OR (95% CI) | p-value | Significant |
|---|---|---|---|---|---|---|---|---|
| Depression | Diabetes | 5,811 | 487 | 2.14 (1.52–3.01) | 1.89 (1.33–2.69) | 1.36 (0.91–2.05) | 0.137 | No |
| Obesity | Diabetes | 5,811 | 487 | 3.45 (2.71–4.39) | 3.38 (2.65–4.32) | 3.12 (2.42–4.02) | <0.001 | Yes |
| ... |
Subgroup Summary (
subgroup_matrix.csv): Same format, stratified by subgroup variables.
Heatmap (optional): Visual matrix of effect sizes × significance, exposure on Y-axis, outcome on X-axis.
Output Files
{working_dir}/batch_{timestamp}/ ├── README.md — Batch run summary (N combinations, template used, date) ├── combination_matrix.csv — All exposure/outcome specs with coding ├── template/ │ └── base_template.R — The validated template (frozen copy) ├── scripts/ │ ├── 01_depression_diabetes.R │ ├── 02_obesity_diabetes.R │ ├── ... │ └── run_all.R — Master execution script ├── results/ │ ├── 01_depression_diabetes/ │ │ ├── table1.csv │ │ ├── main_results.csv │ │ └── subgroup_results.csv │ ├── 02_obesity_diabetes/ │ │ └── ... │ └── ... ├── summary/ │ ├── summary_matrix.csv — Main results across all combinations │ ├── subgroup_matrix.csv — Subgroup results across all combinations │ ├── failed_runs.csv — Combinations that failed + error messages │ └── heatmap.png — Optional effect size × significance visual └── logs/ └── batch_execution.log — Timing + error log
Critical Rules
- Never modify the core methodology across combinations — only swap exposure/outcome/covariates.
- Remove self-adjustment: If exposure = BMI, remove obesity from covariates. If exposure = education/income, remove the same variable from covariates. If outcome = MetS, consider removing obesity from covariates. Document all removals.
- Weighted analysis mandatory for KNHANES/NHANES/NHIS — inherited from template.
- Event count check: Before running, verify each outcome has ≥10 events per covariate (EPV rule). Flag underpowered combinations.
- Multiple comparisons: When generating >5 combinations, include a Bonferroni-corrected significance column in the summary matrix. Add a note about exploratory vs confirmatory framing.
- Reproducibility: Freeze the template version. Include a SHA256 hash of the data file in README.
- No p-hacking framing: The summary matrix is for hypothesis generation, not confirmation. State this explicitly in README and any manuscript output.
- Outcome definitions MUST include physician diagnosis: Diabetes = FPG≥126 OR HbA1c≥6.5 OR physician-diagnosed (KNHANES: DE1_dg=1, NHANES: DIQ010="Yes"). Hypertension = SBP≥140 OR DBP≥90 OR physician-diagnosed (KNHANES: DI1_dg=1, NHANES: BPQ020="Yes"). Lab-only definitions systematically overestimate exposure→outcome associations (validated: Joo 2026 replication showed US depression→DM wOR 1.92 without vs 1.54 with physician dx).
- Full covariate set is default: Always use 8 covariates (age, sex, education, income, smoking, alcohol, obesity, CVD) unless explicitly justified. Minimal models (age+sex+BMI only) overestimate effects due to residual confounding.
Cross-National Batch Mode
When
cross_national: true:
- Generate paired scripts for each combination (Korea + US)
- Summary matrix includes both countries side-by-side
- Direction agreement column: ✓ if both countries show same direction of effect
- Uses /cross-national skill's dual-survey-design approach
Integration with Upstream Skills
| Need | Skill |
|---|---|
| Variable coding lookup | survey_weighted guide |
| Template creation from paper | Phase 1–3 |
| Cross-national paired analysis | |
| ICD-10 claims algorithms | nhis_icd10_mapping guide |
| Write manuscript from results | (nhis_cohort or cross_national type) |
| Figure generation | (forest plot of all combinations) |
Example Invocations
Basic: Single DB, Multiple Exposures × Single Outcome
/batch-cohort DB: /path/to/knhanes/HN18.csv Template: /path/to/validated_analysis.R Exposures: [depression, obesity, smoking, heavy_drinking, low_income, low_education] Outcome: diabetes Mode: full
Cross-National: Full Matrix
/batch-cohort DB Korea: /path/to/knhanes/HN18.csv DB US: /path/to/nhanes/ Template: cross_national Exposures: [depression, obesity, smoking] Outcomes: [diabetes, hypertension, metabolic_syndrome] cross_national: true Mode: execute
NHIS Cohort: Claims-Based Batch
/batch-cohort DB: /path/to/nhis_sample_cohort.csv Template: nhis_cohort Exposures: [atrial_fibrillation, heart_failure, COPD, CKD] Outcomes: [all_cause_mortality, cardiovascular_death, stroke] Mode: code_only
Anti-Hallucination
- Never fabricate variable names, dataset column names, or variable codings. If a variable mapping is uncertain, output
and ask the user to confirm against the data dictionary.[VERIFY: variable_name] - Never fabricate statistical results — no invented p-values, effect sizes, confidence intervals, or sample sizes. All numbers must come from executed code output.
- Never generate references from memory. Use
for all citations./search-lit - If a function, package, or API does not exist or you are unsure, say so explicitly rather than guessing.