LLMs-Universal-Life-Science-and-Clinical-Skills- CHIC_ML_Framework_Agent

<!--

install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Hematology/CHIC_ML_Framework_Agent" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-chic-ml-framework- && rm -rf "$T"
manifest: Skills/Hematology/CHIC_ML_Framework_Agent/SKILL.md
source content
<!-- # COPYRIGHT NOTICE # This file is part of the "Universal Biomedical Skills" project. # Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu> # All Rights Reserved. # # This code is proprietary and confidential. # Unauthorized copying of this file, via any medium is strictly prohibited. # # Provenance: Authenticated by MD BABU MIA -->

name: 'chic-ml-framework-agent' description: 'Machine learning framework for inferring high-risk clonal hematopoiesis from complete blood count data without sequencing, reducing the number needed to sequence for CHIP screening.' measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

  • read_file
  • run_shell_command

CHIC ML Framework Agent

The CHIC (Clonal Hematopoiesis Inference from Counts) ML Framework Agent uses machine learning to identify individuals with high-risk clonal hematopoiesis from routine complete blood count (CBC) data alone. Validated on 431,531 UK Biobank participants, CHIC reduces the "number needed to sequence" from 727 to 40 individuals per case of high-risk CH, enabling population-scale CHIP screening without universal sequencing.

When to Use This Skill

  • When screening large populations for CHIP without sequencing.
  • For prioritizing individuals who should undergo genetic testing.
  • To identify undiagnosed CCUS/MDS from routine blood counts.
  • When sequencing resources are limited but CHIP detection is needed.
  • For research studies requiring CHIP prevalence estimation.

Core Capabilities

  1. CBC-Based CHIP Prediction: Predict CHIP presence from blood counts.

  2. High-Risk CH Identification: Focus on clinically relevant clones.

  3. Efficient Screening: Dramatically reduce number needed to sequence.

  4. CCUS/MDS Detection: Flag potential undiagnosed blood cancers.

  5. Population Stratification: Risk-stratify for targeted sequencing.

  6. Longitudinal Monitoring: Track CBC changes suggesting CH development.

CHIC Model Features

CBC ParameterImportanceCH Association
Red Cell Distribution Width (RDW)High↑ with CH
Mean Corpuscular Volume (MCV)High↑ macrocytosis
HemoglobinModerate↓ in CCUS
Platelet CountModerateVariable
White Blood Cell CountModerateOften ↑
Mean Platelet Volume (MPV)ModerateMay ↑
Red Blood Cell CountLow↓ with anemia

Performance Metrics

MetricValueInterpretation
AUROC0.75Good discrimination
Sensitivity (high threshold)85%High-risk CH capture
NNS (standard)727Without CHIC
NNS (with CHIC)4018x improvement
PPV (stringent cutoff)12%Enriched population

Workflow

  1. Input: CBC data (complete blood count panel).

  2. Feature Engineering: Calculate derived indices.

  3. Model Inference: Run CHIC tree-based ensemble.

  4. Risk Scoring: Generate CHIP probability score.

  5. Stratification: Categorize into risk groups.

  6. Recommendations: Prioritize for sequencing.

  7. Output: Risk scores, sequencing prioritization list.

Example Usage

User: "Screen this population cohort using CHIC to identify individuals who should undergo CHIP sequencing."

Agent Action:

python3 Skills/Hematology/CHIC_ML_Framework_Agent/chic_screening.py \
    --cbc_data population_cbc.csv \
    --demographics demographics.csv \
    --model_weights chic_xgboost_ukbb.pt \
    --threshold stringent \
    --output_prioritization true \
    --output chic_screening_results/

Input Data Format

sample_id,age,sex,rbc,hgb,hct,mcv,mch,mchc,rdw,wbc,plt,mpv
001,68,M,4.2,13.5,40.2,95.7,32.1,33.6,14.8,7.2,245,10.2
002,72,F,3.8,11.2,34.5,90.8,29.5,32.5,16.2,8.5,312,9.8

Output Components

OutputDescriptionFormat
CHIC ScorePer-individual CH probability.csv
Risk CategoryLow/Medium/High.csv
Sequencing PriorityRanked list for testing.csv
Feature ImportanceCBC parameters driving score.json
Potential CCUSFlagged individuals.csv
Population StatisticsCohort-level summary.json

Risk Stratification

CHIC ScoreRisk CategoryRecommendation
>0.8Very HighPrioritize sequencing
0.5-0.8HighRecommend sequencing
0.2-0.5ModerateConsider if other risk factors
<0.2LowRoutine monitoring

Model Architecture

ComponentMethodPurpose
Base ModelXGBoostPrimary prediction
SecondaryRandom ForestEnsemble member
TertiaryGradient BoostingEnsemble member
EnsembleStackingFinal prediction
CalibrationIsotonic regressionProbability calibration

AI/ML Components

Feature Engineering:

  • Derived indices (RDW/MCV ratio, etc.)
  • Age-sex adjusted values
  • Longitudinal trajectory features

Model Training:

  • UK Biobank 431,531 participants
  • 20,860 sequencing-confirmed CH
  • Tree-based ensemble methods

Threshold Optimization:

  • Sensitivity-specificity tradeoffs
  • NNS optimization
  • Cost-effectiveness modeling

Clinical Applications

ApplicationCHIC RoleBenefit
Population ScreeningPrioritize sequencingCost reduction
Pre-transplantFlag high-risk donorsSafety
Pre-CAR-TIdentify at-risk patientsOutcomes
Primary CareAlert for referralEarly detection
ResearchEstimate CH prevalenceStudy design

Validation Results

CohortNAUROCNNS Improvement
UK Biobank (discovery)300,0000.7518x
UK Biobank (validation)131,5310.7417x
External validationTBD~0.72~15x

Prerequisites

  • Python 3.10+
  • XGBoost, scikit-learn
  • pandas, numpy
  • CHIC model weights
  • CBC reference ranges

Related Skills

  • CHIP_Clonal_Hematopoiesis_Agent - Full CHIP analysis with sequencing
  • MPN_Progression_Monitor_Agent - MPN monitoring
  • Blood_Smear_AI_Agent - Morphology analysis
  • MDS_Classification_Agent - MDS diagnosis

CBC Red Flags for CH

FindingCHIC ImpactSignificance
RDW >15%High scoreAnisocytosis
MCV >100 fLIncreasedMacrocytosis
Unexplained anemiaFlaggedPotential CCUS
CytopeniasVery highLikely CCUS/MDS
Persistent changesEscalatedProgressive CH

Limitations

LimitationImpactMitigation
Moderate PPVFalse positivesConfirmatory sequencing
Training biasPopulation-specificMulti-cohort validation
Acute changesTransient abnormalitiesRepeat testing
Other causesNon-CH cytopeniasClinical context

Special Considerations

  1. Population Calibration: Model trained on UK Biobank demographics
  2. Acute Illness: Exclude acutely ill patients
  3. Recent Transfusion: Affects CBC parameters
  4. Medications: Consider drug effects on CBC
  5. Serial Monitoring: Longitudinal changes informative

Cost-Effectiveness

ScenarioCost per CHIP Detected
Universal Sequencing~$73,000
CHIC-Guided Sequencing~$4,000
Improvement18x reduction

Future Directions

EnhancementStatusImpact
Multi-ethnic validationIn progressBroader applicability
Longitudinal CHICDevelopmentDynamic risk
Integration with geneticsResearchCombined scoring
Point-of-careConceptualReal-time screening

Author

AI Group - Biomedical AI Platform

<!-- AUTHOR_SIGNATURE: 9a7f3c2e-MD-BABU-MIA-2026-MSSM-SECURE -->