install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Genomics/copy-number/gatk-cnv" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-gatk-cnv && rm -rf "$T"
manifest:
Skills/Genomics/copy-number/gatk-cnv/SKILL.mdsource content
<!--
# COPYRIGHT NOTICE
# This file is part of the "Universal Biomedical Skills" project.
# Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu>
# All Rights Reserved.
#
# This code is proprietary and confidential.
# Unauthorized copying of this file, via any medium is strictly prohibited.
#
# Provenance: Authenticated by MD BABU MIA
-->
name: bio-copy-number-gatk-cnv description: Call copy number variants using GATK best practices workflow. Supports both somatic (tumor-normal) and germline CNV detection from WGS or WES data. Use when following GATK best practices or integrating CNV calling with other GATK variant pipelines. tool_type: cli primary_tool: gatk measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
GATK CNV Workflow
Somatic CNV Workflow Overview
1. PreprocessIntervals → intervals.interval_list 2. CollectReadCounts → sample.counts.hdf5 3. CreateReadCountPanelOfNormals → pon.hdf5 4. DenoiseReadCounts → sample.denoised.tsv 5. CollectAllelicCounts → sample.allelicCounts.tsv 6. ModelSegments → sample.modelFinal.seg 7. CallCopyRatioSegments → sample.called.seg
Step 1: Preprocess Intervals
# For WES/targeted gatk PreprocessIntervals \ -R reference.fa \ -L targets.interval_list \ --bin-length 0 \ --interval-merging-rule OVERLAPPING_ONLY \ -O preprocessed.interval_list # For WGS gatk PreprocessIntervals \ -R reference.fa \ --bin-length 1000 \ --padding 0 \ -O wgs.interval_list
Step 2: Collect Read Counts
# For each sample gatk CollectReadCounts \ -R reference.fa \ -I sample.bam \ -L preprocessed.interval_list \ --interval-merging-rule OVERLAPPING_ONLY \ -O sample.counts.hdf5
Step 3: Create Panel of Normals
# Combine multiple normal samples gatk CreateReadCountPanelOfNormals \ -I normal1.counts.hdf5 \ -I normal2.counts.hdf5 \ -I normal3.counts.hdf5 \ --minimum-interval-median-percentile 5.0 \ -O cnv_pon.hdf5
Step 4: Denoise Read Counts
# Using panel of normals gatk DenoiseReadCounts \ -I tumor.counts.hdf5 \ --count-panel-of-normals cnv_pon.hdf5 \ --standardized-copy-ratios tumor.standardized.tsv \ --denoised-copy-ratios tumor.denoised.tsv
Step 5: Collect Allelic Counts
# From known SNP sites (for LOH detection) gatk CollectAllelicCounts \ -R reference.fa \ -I tumor.bam \ -L common_snps.vcf \ -O tumor.allelicCounts.tsv
Step 6: Model Segments
# Somatic with matched normal allelic counts gatk ModelSegments \ --denoised-copy-ratios tumor.denoised.tsv \ --allelic-counts tumor.allelicCounts.tsv \ --normal-allelic-counts normal.allelicCounts.tsv \ --output-prefix tumor \ -O results/ # Output files: tumor.cr.seg, tumor.modelFinal.seg, tumor.hets.tsv
Step 7: Call Copy Ratio Segments
gatk CallCopyRatioSegments \ -I results/tumor.cr.seg \ -O results/tumor.called.seg
Plotting
# Plot copy ratios and segments gatk PlotDenoisedCopyRatios \ --standardized-copy-ratios tumor.standardized.tsv \ --denoised-copy-ratios tumor.denoised.tsv \ --sequence-dictionary reference.dict \ --minimum-contig-length 46709983 \ --output-prefix tumor \ -O plots/ # Plot segments with allelic information gatk PlotModeledSegments \ --denoised-copy-ratios tumor.denoised.tsv \ --allelic-counts results/tumor.hets.tsv \ --segments results/tumor.modelFinal.seg \ --sequence-dictionary reference.dict \ --minimum-contig-length 46709983 \ --output-prefix tumor \ -O plots/
Germline CNV Workflow
# For germline: use cohort mode # 1. Collect counts (same as above) # 2. Determine contig ploidy gatk DetermineGermlineContigPloidy \ -I sample1.counts.hdf5 \ -I sample2.counts.hdf5 \ --model cohort_ploidy_model \ --contig-ploidy-priors ploidy_priors.tsv \ -O ploidy-calls/ # 3. Call germline CNVs gatk GermlineCNVCaller \ --run-mode COHORT \ -I sample1.counts.hdf5 \ -I sample2.counts.hdf5 \ --contig-ploidy-calls ploidy-calls/ploidy_calls \ --annotated-intervals annotated_intervals.tsv \ --output-prefix cohort \ -O germline_cnv_calls/ # 4. Post-process calls per sample gatk PostprocessGermlineCNVCalls \ --calls-shard-path germline_cnv_calls/cohort-calls \ --model-shard-path germline_cnv_calls/cohort-model \ --sample-index 0 \ --contig-ploidy-calls ploidy-calls/ploidy_calls \ --sequence-dictionary reference.dict \ --output-genotyped-intervals sample1.genotyped.tsv \ --output-denoised-copy-ratios sample1.denoised.tsv \ -O sample1_segments.vcf
Complete Somatic Pipeline Script
#!/bin/bash REFERENCE=reference.fa INTERVALS=targets.interval_list PON=cnv_pon.hdf5 SNP_SITES=common_snps.vcf TUMOR=$1 NORMAL=$2 OUTDIR=$3 mkdir -p $OUTDIR # Collect read counts gatk CollectReadCounts -R $REFERENCE -I $TUMOR -L $INTERVALS \ -O $OUTDIR/tumor.counts.hdf5 gatk CollectReadCounts -R $REFERENCE -I $NORMAL -L $INTERVALS \ -O $OUTDIR/normal.counts.hdf5 # Denoise gatk DenoiseReadCounts -I $OUTDIR/tumor.counts.hdf5 \ --count-panel-of-normals $PON \ --standardized-copy-ratios $OUTDIR/tumor.standardized.tsv \ --denoised-copy-ratios $OUTDIR/tumor.denoised.tsv # Allelic counts gatk CollectAllelicCounts -R $REFERENCE -I $TUMOR -L $SNP_SITES \ -O $OUTDIR/tumor.allelicCounts.tsv gatk CollectAllelicCounts -R $REFERENCE -I $NORMAL -L $SNP_SITES \ -O $OUTDIR/normal.allelicCounts.tsv # Model and call gatk ModelSegments \ --denoised-copy-ratios $OUTDIR/tumor.denoised.tsv \ --allelic-counts $OUTDIR/tumor.allelicCounts.tsv \ --normal-allelic-counts $OUTDIR/normal.allelicCounts.tsv \ --output-prefix tumor -O $OUTDIR/ gatk CallCopyRatioSegments -I $OUTDIR/tumor.cr.seg -O $OUTDIR/tumor.called.seg
Key Output Files
| File | Description |
|---|---|
| .counts.hdf5 | Raw read counts per interval |
| .denoised.tsv | Denoised log2 copy ratios |
| .modelFinal.seg | Segmented copy ratios with confidence |
| .called.seg | Final called segments with CN state |
| .hets.tsv | Heterozygous SNP allelic counts |
Related Skills
- copy-number/cnvkit-analysis - Alternative CNV caller
- copy-number/cnv-visualization - Plotting results
- alignment-files/bam-statistics - Input BAM QC
- variant-calling/variant-calling - SNP calling for allelic counts