git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/variant-interpretation-acmg/bioSkills/gatk-variant-calling" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-gatk-variant-calling && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/variant-interpretation-acmg/bioSkills/gatk-variant-calling" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-gatk-variant-calling && rm -rf "$T"
skills/variant-interpretation-acmg/bioSkills/gatk-variant-calling/SKILL.mdname: bio-gatk-variant-calling description: Variant calling with GATK HaplotypeCaller following best practices. Covers germline SNP/indel calling, GVCF workflow for cohorts, joint genotyping, and variant quality score recalibration (VQSR). Use when calling variants with GATK HaplotypeCaller. tool_type: cli primary_tool: gatk measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
GATK Variant Calling
GATK HaplotypeCaller is the gold standard for germline variant calling. This skill covers the GATK Best Practices workflow.
Prerequisites
BAM files should be preprocessed:
- Mark duplicates
- Base quality score recalibration (BQSR) - optional but recommended
Single-Sample Calling
Basic HaplotypeCaller
gatk HaplotypeCaller \ -R reference.fa \ -I sample.bam \ -O sample.vcf.gz
With Standard Annotations
gatk HaplotypeCaller \ -R reference.fa \ -I sample.bam \ -O sample.vcf.gz \ -A Coverage \ -A QualByDepth \ -A FisherStrand \ -A StrandOddsRatio \ -A MappingQualityRankSumTest \ -A ReadPosRankSumTest
Target Intervals (Exome/Panel)
gatk HaplotypeCaller \ -R reference.fa \ -I sample.bam \ -L targets.interval_list \ -O sample.vcf.gz
Adjust Calling Confidence
gatk HaplotypeCaller \ -R reference.fa \ -I sample.bam \ -O sample.vcf.gz \ --standard-min-confidence-threshold-for-calling 20
GVCF Workflow (Recommended for Cohorts)
The GVCF workflow enables joint genotyping across samples for better variant calls.
Step 1: Generate GVCFs per Sample
gatk HaplotypeCaller \ -R reference.fa \ -I sample.bam \ -O sample.g.vcf.gz \ -ERC GVCF
Step 2: Combine GVCFs (GenomicsDBImport)
# Create sample map file # sample_map.txt: # sample1 /path/to/sample1.g.vcf.gz # sample2 /path/to/sample2.g.vcf.gz gatk GenomicsDBImport \ --genomicsdb-workspace-path genomicsdb \ --sample-name-map sample_map.txt \ -L intervals.interval_list
Alternative: CombineGVCFs (smaller cohorts)
gatk CombineGVCFs \ -R reference.fa \ -V sample1.g.vcf.gz \ -V sample2.g.vcf.gz \ -V sample3.g.vcf.gz \ -O cohort.g.vcf.gz
Step 3: Joint Genotyping
# From GenomicsDB gatk GenotypeGVCFs \ -R reference.fa \ -V gendb://genomicsdb \ -O cohort.vcf.gz # From combined GVCF gatk GenotypeGVCFs \ -R reference.fa \ -V cohort.g.vcf.gz \ -O cohort.vcf.gz
Variant Quality Score Recalibration (VQSR)
Machine learning-based filtering using known variant sites. Requires many variants (WGS preferred).
SNP Recalibration
# Build SNP model gatk VariantRecalibrator \ -R reference.fa \ -V cohort.vcf.gz \ --resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf.gz \ --resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf.gz \ --resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf.gz \ --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \ -mode SNP \ -O snp.recal \ --tranches-file snp.tranches # Apply SNP filter gatk ApplyVQSR \ -R reference.fa \ -V cohort.vcf.gz \ -O cohort.snp_recal.vcf.gz \ --recal-file snp.recal \ --tranches-file snp.tranches \ --truth-sensitivity-filter-level 99.5 \ -mode SNP
Indel Recalibration
# Build Indel model gatk VariantRecalibrator \ -R reference.fa \ -V cohort.snp_recal.vcf.gz \ --resource:mills,known=false,training=true,truth=true,prior=12.0 Mills.vcf.gz \ --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \ -an QD -an MQRankSum -an ReadPosRankSum -an FS -an SOR \ -mode INDEL \ --max-gaussians 4 \ -O indel.recal \ --tranches-file indel.tranches # Apply Indel filter gatk ApplyVQSR \ -R reference.fa \ -V cohort.snp_recal.vcf.gz \ -O cohort.vqsr.vcf.gz \ --recal-file indel.recal \ --tranches-file indel.tranches \ --truth-sensitivity-filter-level 99.0 \ -mode INDEL
Hard Filtering (When VQSR Not Suitable)
For small datasets, exomes, or single samples where VQSR fails.
Extract SNPs and Indels
gatk SelectVariants \ -R reference.fa \ -V cohort.vcf.gz \ --select-type-to-include SNP \ -O snps.vcf.gz gatk SelectVariants \ -R reference.fa \ -V cohort.vcf.gz \ --select-type-to-include INDEL \ -O indels.vcf.gz
Apply Hard Filters
# Filter SNPs gatk VariantFiltration \ -R reference.fa \ -V snps.vcf.gz \ -O snps.filtered.vcf.gz \ --filter-expression "QD < 2.0" --filter-name "QD2" \ --filter-expression "FS > 60.0" --filter-name "FS60" \ --filter-expression "MQ < 40.0" --filter-name "MQ40" \ --filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \ --filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \ --filter-expression "SOR > 3.0" --filter-name "SOR3" # Filter Indels gatk VariantFiltration \ -R reference.fa \ -V indels.vcf.gz \ -O indels.filtered.vcf.gz \ --filter-expression "QD < 2.0" --filter-name "QD2" \ --filter-expression "FS > 200.0" --filter-name "FS200" \ --filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \ --filter-expression "SOR > 10.0" --filter-name "SOR10"
Merge Filtered Variants
gatk MergeVcfs \ -I snps.filtered.vcf.gz \ -I indels.filtered.vcf.gz \ -O cohort.filtered.vcf.gz
Base Quality Score Recalibration (BQSR)
Preprocessing step to correct systematic errors in base quality scores.
Step 1: BaseRecalibrator
gatk BaseRecalibrator \ -R reference.fa \ -I sample.bam \ --known-sites dbsnp.vcf.gz \ --known-sites known_indels.vcf.gz \ -O recal_data.table
Step 2: ApplyBQSR
gatk ApplyBQSR \ -R reference.fa \ -I sample.bam \ --bqsr-recal-file recal_data.table \ -O sample.recal.bam
Parallel Processing
Scatter by Interval
# Split calling across intervals for interval in chr{1..22} chrX chrY; do gatk HaplotypeCaller \ -R reference.fa \ -I sample.bam \ -L $interval \ -O sample.${interval}.g.vcf.gz \ -ERC GVCF & done wait # Gather GVCFs gatk GatherVcfs \ -I sample.chr1.g.vcf.gz \ -I sample.chr2.g.vcf.gz \ ... \ -O sample.g.vcf.gz
Native Pairwise Parallelism
gatk HaplotypeCaller \ -R reference.fa \ -I sample.bam \ -O sample.vcf.gz \ --native-pair-hmm-threads 4
CNN Score Variant Filter (Deep Learning)
Alternative to VQSR using convolutional neural network.
Score Variants
gatk CNNScoreVariants \ -R reference.fa \ -V cohort.vcf.gz \ -O cohort.cnn_scored.vcf.gz \ --tensor-type reference
Filter by CNN Score
gatk FilterVariantTranches \ -V cohort.cnn_scored.vcf.gz \ -O cohort.cnn_filtered.vcf.gz \ --resource hapmap.vcf.gz \ --resource mills.vcf.gz \ --info-key CNN_1D \ --snp-tranche 99.95 \ --indel-tranche 99.4
Complete Single-Sample Pipeline
#!/bin/bash SAMPLE=$1 REF=reference.fa DBSNP=dbsnp.vcf.gz KNOWN_INDELS=known_indels.vcf.gz # BQSR gatk BaseRecalibrator -R $REF -I ${SAMPLE}.bam \ --known-sites $DBSNP --known-sites $KNOWN_INDELS \ -O ${SAMPLE}.recal.table gatk ApplyBQSR -R $REF -I ${SAMPLE}.bam \ --bqsr-recal-file ${SAMPLE}.recal.table \ -O ${SAMPLE}.recal.bam # Call variants gatk HaplotypeCaller -R $REF -I ${SAMPLE}.recal.bam \ -O ${SAMPLE}.g.vcf.gz -ERC GVCF # Single-sample genotyping gatk GenotypeGVCFs -R $REF -V ${SAMPLE}.g.vcf.gz \ -O ${SAMPLE}.vcf.gz # Hard filter gatk VariantFiltration -R $REF -V ${SAMPLE}.vcf.gz \ -O ${SAMPLE}.filtered.vcf.gz \ --filter-expression "QD < 2.0" --filter-name "LowQD" \ --filter-expression "FS > 60.0" --filter-name "HighFS" \ --filter-expression "MQ < 40.0" --filter-name "LowMQ"
Key Annotations
| Annotation | Description | Good Values |
|---|---|---|
| QD | Quality by Depth | > 2.0 |
| FS | Fisher Strand | < 60 (SNP), < 200 (Indel) |
| SOR | Strand Odds Ratio | < 3 (SNP), < 10 (Indel) |
| MQ | Mapping Quality | > 40 |
| MQRankSum | MQ Rank Sum Test | > -12.5 |
| ReadPosRankSum | Read Position Rank Sum | > -8.0 (SNP), > -20.0 (Indel) |
Resource Files
| Resource | Use |
|---|---|
| dbSNP | Known variants (prior=2.0) |
| HapMap | Training/truth SNPs (prior=15.0) |
| Omni | Training SNPs (prior=12.0) |
| 1000G SNPs | Training SNPs (prior=10.0) |
| Mills Indels | Training/truth indels (prior=12.0) |
Related Skills
- variant-calling - bcftools alternative
- alignment-files - BAM preprocessing
- filtering-best-practices - Post-calling filtering
- variant-normalization - Normalize before annotation
- vep-snpeff-annotation - Annotate final calls