BioSkills bio-workflows-somatic-variant-pipeline
End-to-end somatic variant calling from tumor-normal paired samples using Mutect2 or Strelka2. Covers preprocessing, variant calling, filtering, and annotation for cancer genomics. Use when calling somatic mutations from tumor-normal pairs.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/workflows/somatic-variant-pipeline" ~/.claude/skills/gptomics-bioskills-bio-workflows-somatic-variant-pipeline && rm -rf "$T"
workflows/somatic-variant-pipeline/SKILL.mdVersion Compatibility
Reference examples tested with: CNVkit 0.9+, Ensembl VEP 111+, GATK 4.5+, SnpEff 5.2+, bcftools 1.19+, picard 3.1+
Before using code patterns, verify installed versions match. If versions differ:
- CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Somatic Variant Pipeline
"Call somatic mutations from my tumor-normal pair" → Orchestrate alignment, Mutect2 somatic calling, contamination filtering, variant annotation (Funcotator/VEP), TMB calculation, and mutational signature analysis.
Complete workflow for calling somatic mutations from tumor-normal paired samples.
Pipeline Overview
Tumor BAM + Normal BAM │ ├── Preprocessing (if needed) │ └── MarkDuplicates, BQSR │ ├── Variant Calling │ ├── Mutect2 (GATK) - SNVs + indels │ └── Strelka2 - SNVs + indels (faster) │ ├── Filtering │ ├── FilterMutectCalls │ ├── Contamination estimation │ └── Orientation bias filtering │ ├── Annotation │ ├── Funcotator / VEP │ └── Cancer-specific databases │ └── Output: Filtered somatic VCF
Mutect2 Workflow (GATK)
Step 1: Panel of Normals (Optional but Recommended)
# Create PON from multiple normal samples for normal in normal1.bam normal2.bam normal3.bam; do sample=$(basename $normal .bam) gatk Mutect2 \ -R reference.fa \ -I $normal \ --max-mnp-distance 0 \ -O ${sample}.vcf.gz done # Combine into PON gatk GenomicsDBImport \ -R reference.fa \ --genomicsdb-workspace-path pon_db \ -V normal1.vcf.gz \ -V normal2.vcf.gz \ -V normal3.vcf.gz \ -L intervals.bed gatk CreateSomaticPanelOfNormals \ -R reference.fa \ -V gendb://pon_db \ -O pon.vcf.gz
Step 2: Call Somatic Variants
gatk Mutect2 \ -R reference.fa \ -I tumor.bam \ -I normal.bam \ -normal normal_sample_name \ --germline-resource af-only-gnomad.vcf.gz \ --panel-of-normals pon.vcf.gz \ --f1r2-tar-gz f1r2.tar.gz \ -O unfiltered.vcf.gz
Step 3: Learn Orientation Bias
gatk LearnReadOrientationModel \ -I f1r2.tar.gz \ -O read-orientation-model.tar.gz
Step 4: Calculate Contamination
gatk GetPileupSummaries \ -I tumor.bam \ -V small_exac_common.vcf.gz \ -L small_exac_common.vcf.gz \ -O tumor_pileups.table gatk GetPileupSummaries \ -I normal.bam \ -V small_exac_common.vcf.gz \ -L small_exac_common.vcf.gz \ -O normal_pileups.table gatk CalculateContamination \ -I tumor_pileups.table \ -matched normal_pileups.table \ -O contamination.table \ --tumor-segmentation segments.table
Step 5: Filter Variants
gatk FilterMutectCalls \ -R reference.fa \ -V unfiltered.vcf.gz \ --contamination-table contamination.table \ --tumor-segmentation segments.table \ --ob-priors read-orientation-model.tar.gz \ -O filtered.vcf.gz # Extract PASS variants bcftools view -f PASS filtered.vcf.gz -Oz -o somatic_final.vcf.gz
Strelka2 Workflow (Faster Alternative)
# Configure configureStrelkaSomaticWorkflow.py \ --normalBam normal.bam \ --tumorBam tumor.bam \ --referenceFasta reference.fa \ --runDir strelka_run # Execute strelka_run/runWorkflow.py -m local -j 16 # Output files # strelka_run/results/variants/somatic.snvs.vcf.gz # strelka_run/results/variants/somatic.indels.vcf.gz # Merge SNVs and indels bcftools concat \ strelka_run/results/variants/somatic.snvs.vcf.gz \ strelka_run/results/variants/somatic.indels.vcf.gz \ -a -Oz -o strelka_somatic.vcf.gz
Annotation
Funcotator (GATK)
gatk Funcotator \ -R reference.fa \ -V somatic_final.vcf.gz \ -O annotated.vcf.gz \ --output-file-format VCF \ --data-sources-path funcotator_dataSources.v1.7 \ --ref-version hg38
VEP with Cancer Databases
vep -i somatic_final.vcf.gz -o annotated.vcf \ --vcf --cache --offline \ --assembly GRCh38 \ --everything \ --plugin CADD,cadd_scores.tsv.gz \ --custom cosmic.vcf.gz,COSMIC,vcf,exact,0,CNT \ --fork 4
Complete Pipeline Script
#!/bin/bash set -euo pipefail TUMOR_BAM=$1 NORMAL_BAM=$2 NORMAL_NAME=$3 REFERENCE=$4 OUTPUT_PREFIX=$5 GNOMAD=$6 PON=$7 THREADS=16 echo "=== Step 1: Mutect2 calling ===" gatk Mutect2 \ -R $REFERENCE \ -I $TUMOR_BAM \ -I $NORMAL_BAM \ -normal $NORMAL_NAME \ --germline-resource $GNOMAD \ --panel-of-normals $PON \ --f1r2-tar-gz ${OUTPUT_PREFIX}_f1r2.tar.gz \ --native-pair-hmm-threads $THREADS \ -O ${OUTPUT_PREFIX}_unfiltered.vcf.gz echo "=== Step 2: Learn orientation bias ===" gatk LearnReadOrientationModel \ -I ${OUTPUT_PREFIX}_f1r2.tar.gz \ -O ${OUTPUT_PREFIX}_orientation.tar.gz echo "=== Step 3: Pileup summaries ===" gatk GetPileupSummaries \ -I $TUMOR_BAM \ -V $GNOMAD \ -L $GNOMAD \ -O ${OUTPUT_PREFIX}_tumor_pileups.table gatk GetPileupSummaries \ -I $NORMAL_BAM \ -V $GNOMAD \ -L $GNOMAD \ -O ${OUTPUT_PREFIX}_normal_pileups.table echo "=== Step 4: Calculate contamination ===" gatk CalculateContamination \ -I ${OUTPUT_PREFIX}_tumor_pileups.table \ -matched ${OUTPUT_PREFIX}_normal_pileups.table \ -O ${OUTPUT_PREFIX}_contamination.table \ --tumor-segmentation ${OUTPUT_PREFIX}_segments.table echo "=== Step 5: Filter variants ===" gatk FilterMutectCalls \ -R $REFERENCE \ -V ${OUTPUT_PREFIX}_unfiltered.vcf.gz \ --contamination-table ${OUTPUT_PREFIX}_contamination.table \ --tumor-segmentation ${OUTPUT_PREFIX}_segments.table \ --ob-priors ${OUTPUT_PREFIX}_orientation.tar.gz \ -O ${OUTPUT_PREFIX}_filtered.vcf.gz echo "=== Step 6: Extract PASS variants ===" bcftools view -f PASS ${OUTPUT_PREFIX}_filtered.vcf.gz \ -Oz -o ${OUTPUT_PREFIX}_somatic.vcf.gz bcftools index -t ${OUTPUT_PREFIX}_somatic.vcf.gz echo "=== Step 7: Statistics ===" bcftools stats ${OUTPUT_PREFIX}_somatic.vcf.gz > ${OUTPUT_PREFIX}_stats.txt echo "=== Pipeline complete ===" echo "Somatic variants: ${OUTPUT_PREFIX}_somatic.vcf.gz" echo "Stats: ${OUTPUT_PREFIX}_stats.txt"
Tumor-Only Mode
When matched normal is unavailable (e.g., archival FFPE, cell lines):
gatk Mutect2 \ -R reference.fa \ -I tumor.bam \ --germline-resource af-only-gnomad.vcf.gz \ --panel-of-normals pon.vcf.gz \ -O tumor_only.vcf.gz
Higher false positive rate without matched normal -- many germline variants will pass filters. The PoN and gnomAD germline resource become critical for artifact and germline removal respectively.
Consensus Calling (Improved Accuracy)
Running multiple callers and requiring agreement improves both precision and recall:
# Run Mutect2, Strelka2, and MuSE independently, then intersect # Majority voting (2/3 agreement) achieves F1 ~0.927 for SNVs bcftools isec -n+2 -p consensus_dir \ mutect2_pass.vcf.gz strelka2_pass.vcf.gz muse_pass.vcf.gz # For indels: Mutect2 + Strelka2 + VarScan2 with 2/3 agreement
Strict intersection (all agree) sacrifices too much recall; union includes too many false positives. Majority voting provides the best balance.
Emerging: DeepSomatic
DeepSomatic extends DeepVariant's CNN approach to somatic calling with platform-specific models (Illumina, PacBio HiFi, ONT). Published Nature Biotechnology 2025, it achieves higher F1 than existing callers across all platforms and supports tumor-only and FFPE modes.
Key Resources
| Resource | Purpose |
|---|---|
| gnomAD AF-only | Germline filtering |
| Panel of Normals | Technical artifact removal |
| COSMIC | Known cancer mutations |
| Funcotator data sources | Functional annotation |
Quality Metrics
# Variant counts by filter status bcftools query -f '%FILTER\n' filtered.vcf.gz | sort | uniq -c # Ti/Tv ratio (expect ~2-3 for somatic) bcftools stats filtered.vcf.gz | grep TSTV # Variant allele frequency distribution bcftools query -f '%AF\n' somatic_final.vcf.gz | \ awk '{print int($1*100)/100}' | sort -n | uniq -c
Related Skills
- variant-calling/gatk-variant-calling - Germline variant calling
- variant-calling/filtering-best-practices - Filtering strategies
- variant-calling/variant-annotation - VEP/SnpEff annotation
- variant-calling/structural-variant-calling - Somatic SV detection (Manta tumor-normal mode)
- copy-number/cnvkit-analysis - Somatic CNV calling