Claude-skill-registry bio-variant-calling-joint-calling
Joint genotype calling across multiple samples using GATK CombineGVCFs and GenotypeGVCFs. Essential for cohort studies, population genetics, and leveraging VQSR. Use when performing joint genotyping across multiple samples.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/joint-calling" ~/.claude/skills/majiayu000-claude-skill-registry-bio-variant-calling-joint-calling && rm -rf "$T"
manifest:
skills/data/joint-calling/SKILL.mdsource content
Joint Calling
Call variants jointly across multiple samples for improved accuracy and consistent genotyping.
Why Joint Calling?
- Improved sensitivity - Leverage information across samples
- Consistent genotyping - Same sites called across all samples
- VQSR eligible - Requires cohort for machine learning filtering
- Population analysis - Allele frequencies across cohort
Workflow Overview
Sample BAMs │ ├── HaplotypeCaller (per-sample, -ERC GVCF) │ └── sample1.g.vcf.gz, sample2.g.vcf.gz, ... │ ├── CombineGVCFs or GenomicsDBImport │ └── Combine into cohort database │ ├── GenotypeGVCFs │ └── Joint genotyping │ └── VQSR or Hard Filtering └── Final VCF
Step 1: Per-Sample gVCF Generation
# Generate gVCF for each sample gatk HaplotypeCaller \ -R reference.fa \ -I sample1.bam \ -O sample1.g.vcf.gz \ -ERC GVCF # With intervals (faster) gatk HaplotypeCaller \ -R reference.fa \ -I sample1.bam \ -O sample1.g.vcf.gz \ -ERC GVCF \ -L intervals.bed
Batch Processing
# Process all samples for bam in *.bam; do sample=$(basename $bam .bam) gatk HaplotypeCaller \ -R reference.fa \ -I $bam \ -O ${sample}.g.vcf.gz \ -ERC GVCF & done wait
Step 2a: CombineGVCFs (Small Cohorts)
For <100 samples:
gatk CombineGVCFs \ -R reference.fa \ -V sample1.g.vcf.gz \ -V sample2.g.vcf.gz \ -V sample3.g.vcf.gz \ -O cohort.g.vcf.gz
From Sample Map
# Create sample map file # sample1 /path/to/sample1.g.vcf.gz # sample2 /path/to/sample2.g.vcf.gz ls *.g.vcf.gz | while read f; do echo -e "$(basename $f .g.vcf.gz)\t$f" done > sample_map.txt # Combine with -V for each gatk CombineGVCFs \ -R reference.fa \ $(cat sample_map.txt | cut -f2 | sed 's/^/-V /') \ -O cohort.g.vcf.gz
Step 2b: GenomicsDBImport (Large Cohorts)
For >100 samples, use GenomicsDB:
# Create sample map ls *.g.vcf.gz | while read f; do echo -e "$(basename $f .g.vcf.gz)\t$f" done > sample_map.txt # Import to GenomicsDB (per chromosome for parallelism) gatk GenomicsDBImport \ --sample-name-map sample_map.txt \ --genomicsdb-workspace-path genomicsdb_chr1 \ -L chr1 \ --reader-threads 4 # Or all chromosomes for chr in {1..22} X Y; do gatk GenomicsDBImport \ --sample-name-map sample_map.txt \ --genomicsdb-workspace-path genomicsdb_chr${chr} \ -L chr${chr} & done wait
Update GenomicsDB with New Samples
gatk GenomicsDBImport \ --genomicsdb-update-workspace-path genomicsdb_chr1 \ --sample-name-map new_samples.txt \ -L chr1
Step 3: GenotypeGVCFs
From Combined gVCF
gatk GenotypeGVCFs \ -R reference.fa \ -V cohort.g.vcf.gz \ -O cohort.vcf.gz
From GenomicsDB
gatk GenotypeGVCFs \ -R reference.fa \ -V gendb://genomicsdb_chr1 \ -O chr1.vcf.gz # All chromosomes for chr in {1..22} X Y; do gatk GenotypeGVCFs \ -R reference.fa \ -V gendb://genomicsdb_chr${chr} \ -O chr${chr}.vcf.gz & done wait # Merge chromosomes bcftools concat chr{1..22}.vcf.gz chrX.vcf.gz chrY.vcf.gz \ -Oz -o cohort.vcf.gz
With Allele-Specific Annotations
gatk GenotypeGVCFs \ -R reference.fa \ -V gendb://genomicsdb \ -O cohort.vcf.gz \ -G StandardAnnotation \ -G AS_StandardAnnotation
Step 4: Filtering
VQSR (Recommended for >30 Samples)
# SNPs gatk VariantRecalibrator \ -R reference.fa \ -V cohort.vcf.gz \ --resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf.gz \ --resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf.gz \ --resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf.gz \ --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \ -mode SNP \ -O snps.recal \ --tranches-file snps.tranches gatk ApplyVQSR \ -R reference.fa \ -V cohort.vcf.gz \ --recal-file snps.recal \ --tranches-file snps.tranches \ -mode SNP \ --truth-sensitivity-filter-level 99.5 \ -O cohort.snps.vcf.gz # Indels gatk VariantRecalibrator \ -R reference.fa \ -V cohort.snps.vcf.gz \ --resource:mills,known=false,training=true,truth=true,prior=12.0 mills.vcf.gz \ --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \ -an QD -an MQRankSum -an ReadPosRankSum -an FS -an SOR \ -mode INDEL \ -O indels.recal \ --tranches-file indels.tranches gatk ApplyVQSR \ -R reference.fa \ -V cohort.snps.vcf.gz \ --recal-file indels.recal \ --tranches-file indels.tranches \ -mode INDEL \ --truth-sensitivity-filter-level 99.0 \ -O cohort.filtered.vcf.gz
Hard Filtering (Small Cohorts)
# See filtering-best-practices skill gatk VariantFiltration \ -R reference.fa \ -V cohort.vcf.gz \ --filter-expression "QD < 2.0" --filter-name "QD2" \ --filter-expression "FS > 60.0" --filter-name "FS60" \ --filter-expression "MQ < 40.0" --filter-name "MQ40" \ -O cohort.filtered.vcf.gz
Complete Pipeline Script
#!/bin/bash set -euo pipefail REFERENCE=$1 OUTPUT_DIR=$2 THREADS=16 mkdir -p $OUTPUT_DIR/{gvcfs,genomicsdb,vcfs} echo "=== Step 1: Generate gVCFs ===" for bam in data/*.bam; do sample=$(basename $bam .bam) gatk HaplotypeCaller \ -R $REFERENCE \ -I $bam \ -O $OUTPUT_DIR/gvcfs/${sample}.g.vcf.gz \ -ERC GVCF & # Limit parallelism while [ $(jobs -r | wc -l) -ge $THREADS ]; do sleep 1; done done wait echo "=== Step 2: Create sample map ===" ls $OUTPUT_DIR/gvcfs/*.g.vcf.gz | while read f; do echo -e "$(basename $f .g.vcf.gz)\t$(realpath $f)" done > $OUTPUT_DIR/sample_map.txt echo "=== Step 3: GenomicsDBImport ===" gatk GenomicsDBImport \ --sample-name-map $OUTPUT_DIR/sample_map.txt \ --genomicsdb-workspace-path $OUTPUT_DIR/genomicsdb \ -L intervals.bed \ --reader-threads 4 echo "=== Step 4: Joint genotyping ===" gatk GenotypeGVCFs \ -R $REFERENCE \ -V gendb://$OUTPUT_DIR/genomicsdb \ -O $OUTPUT_DIR/vcfs/cohort.vcf.gz echo "=== Step 5: Index ===" bcftools index -t $OUTPUT_DIR/vcfs/cohort.vcf.gz echo "=== Statistics ===" bcftools stats $OUTPUT_DIR/vcfs/cohort.vcf.gz > $OUTPUT_DIR/vcfs/cohort_stats.txt echo "=== Complete ===" echo "Joint VCF: $OUTPUT_DIR/vcfs/cohort.vcf.gz"
Tips
Memory for Large Cohorts
# Increase Java heap gatk --java-options "-Xmx64g" GenotypeGVCFs ... # Batch size for GenomicsDBImport gatk GenomicsDBImport --batch-size 50 ...
Incremental Updates
# Add new samples to existing database gatk GenomicsDBImport \ --genomicsdb-update-workspace-path existing_db \ --sample-name-map new_samples.txt
Related Skills
- variant-calling/gatk-variant-calling - Single-sample calling
- variant-calling/filtering-best-practices - VQSR and hard filtering
- population-genetics/plink-basics - Population analysis of joint calls
- workflows/fastq-to-variants - End-to-end germline pipeline