BioSkills bio-variant-calling-joint-calling

Joint genotype calling across multiple samples using GATK CombineGVCFs and GenotypeGVCFs. Essential for cohort studies, population genetics, and leveraging VQSR. Use when performing joint genotyping across multiple samples.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/variant-calling/joint-calling" ~/.claude/skills/gptomics-bioskills-bio-variant-calling-joint-calling && rm -rf "$T"
manifest: variant-calling/joint-calling/SKILL.md
source content

Version Compatibility

Reference examples tested with: GATK 4.5+, bcftools 1.19+

Before using code patterns, verify installed versions match. If versions differ:

  • CLI:
    <tool> --version
    then
    <tool> --help
    to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Joint Calling

"Joint genotype my cohort samples" → Combine per-sample gVCFs into a single cohort callset with consistent genotyping across all sites, enabling VQSR and population-level analysis.

  • CLI:
    gatk HaplotypeCaller -ERC GVCF
    gatk GenomicsDBImport
    gatk GenotypeGVCFs

Why Joint Calling Matters

Single-sample calling discards cross-sample evidence that is critical for accurate genotyping:

  • Statistical power from shared evidence - A site with 2 alt reads in one sample is borderline and would typically be missed. If 50 other samples in the cohort also show 2 alt reads at that site, the evidence is overwhelming and the variant is clearly real. Joint calling aggregates this weak-per-sample signal into strong cohort-level evidence.
  • Genotype refinement via cohort priors - Individual genotype likelihoods are combined with cohort allele frequencies as a Bayesian prior. A heterozygous call at a common variant site (AF=0.3) receives more support than the same call at a site with no other carriers. This prior dramatically improves accuracy for low-coverage samples.
  • Consistent site representation - All samples are genotyped at the same sites, producing homozygous-reference calls where applicable. Without joint calling, a missing genotype is ambiguous: it could mean homozygous-reference or simply insufficient coverage. This "missing = reference" assumption is a common source of false negatives in downstream analysis.
  • VQSR eligibility - Variant quality score recalibration requires cohort-level variant distributions and generally needs 30+ samples to build reliable models.

Cohort Size Decision Table

Cohort SizeApproachNotes
<100CombineGVCFs or GenomicsDBEither works; CombineGVCFs is simpler to manage
100-10,000GenomicsDB + GenotypeGVCFsStandard GATK Best Practices; shard by chromosome
10,000-100,000GATK Biggest PracticesHeavily sharded and parallelized across intervals
>100,000DeepVariant + GLnexus, or Hail VDSGATK becomes unwieldy at this scale; purpose-built tools required

Workflow Overview

Sample BAMs
    │
    ├── HaplotypeCaller (per-sample, -ERC GVCF)
    │   └── sample1.g.vcf.gz, sample2.g.vcf.gz, ...
    │
    ├── CombineGVCFs or GenomicsDBImport
    │   └── Combine into cohort database
    │
    ├── GenotypeGVCFs
    │   └── Joint genotyping
    │
    └── VQSR or Hard Filtering
        └── Final VCF

Step 1: Per-Sample gVCF Generation

# Generate gVCF for each sample
gatk HaplotypeCaller \
    -R reference.fa \
    -I sample1.bam \
    -O sample1.g.vcf.gz \
    -ERC GVCF

# With intervals (faster)
gatk HaplotypeCaller \
    -R reference.fa \
    -I sample1.bam \
    -O sample1.g.vcf.gz \
    -ERC GVCF \
    -L intervals.bed

Batch Processing

# Process all samples
for bam in *.bam; do
    sample=$(basename $bam .bam)
    gatk HaplotypeCaller \
        -R reference.fa \
        -I $bam \
        -O ${sample}.g.vcf.gz \
        -ERC GVCF &
done
wait

Step 2a: CombineGVCFs (Small Cohorts)

For <100 samples:

gatk CombineGVCFs \
    -R reference.fa \
    -V sample1.g.vcf.gz \
    -V sample2.g.vcf.gz \
    -V sample3.g.vcf.gz \
    -O cohort.g.vcf.gz

From Sample Map

# Create sample map file
# sample1    /path/to/sample1.g.vcf.gz
# sample2    /path/to/sample2.g.vcf.gz

ls *.g.vcf.gz | while read f; do
    echo -e "$(basename $f .g.vcf.gz)\t$f"
done > sample_map.txt

# Combine with -V for each
gatk CombineGVCFs \
    -R reference.fa \
    $(cat sample_map.txt | cut -f2 | sed 's/^/-V /') \
    -O cohort.g.vcf.gz

Step 2b: GenomicsDBImport (Large Cohorts)

For >100 samples, use GenomicsDB:

# Create sample map
ls *.g.vcf.gz | while read f; do
    echo -e "$(basename $f .g.vcf.gz)\t$f"
done > sample_map.txt

# Import to GenomicsDB (per chromosome for parallelism)
gatk GenomicsDBImport \
    --sample-name-map sample_map.txt \
    --genomicsdb-workspace-path genomicsdb_chr1 \
    -L chr1 \
    --reader-threads 4

# Or all chromosomes
for chr in {1..22} X Y; do
    gatk GenomicsDBImport \
        --sample-name-map sample_map.txt \
        --genomicsdb-workspace-path genomicsdb_chr${chr} \
        -L chr${chr} &
done
wait

Update GenomicsDB with New Samples

gatk GenomicsDBImport \
    --genomicsdb-update-workspace-path genomicsdb_chr1 \
    --sample-name-map new_samples.txt \
    -L chr1

GenomicsDB Critical Caveats

GenomicsDB is powerful but has sharp edges that can cause data loss or silent failures:

  • No sample replacement - Existing samples cannot be updated or overwritten. Only new samples with different names can be added. To fix a sample, the entire workspace must be recreated.
  • Intervals locked at import time - The genomic intervals specified during the initial import cannot be changed on incremental updates. Adding new regions requires reimporting from scratch.
  • Fragment accumulation - Each incremental batch creates a new database fragment. After thousands of incremental additions, file handle exhaustion becomes likely. Run
    --consolidate
    periodically to merge fragments.
  • Corruption risk on failed adds - A failed incremental import can leave the datastore in an inconsistent state. Always backup the workspace directory before running
    --genomicsdb-update-workspace-path
    .
  • Batch size for memory - Use
    --batch-size 50
    (the default) to control memory consumption. Larger batches load more gVCFs simultaneously and can exhaust heap space.

Step 3: GenotypeGVCFs

From Combined gVCF

gatk GenotypeGVCFs \
    -R reference.fa \
    -V cohort.g.vcf.gz \
    -O cohort.vcf.gz

From GenomicsDB

gatk GenotypeGVCFs \
    -R reference.fa \
    -V gendb://genomicsdb_chr1 \
    -O chr1.vcf.gz

# All chromosomes
for chr in {1..22} X Y; do
    gatk GenotypeGVCFs \
        -R reference.fa \
        -V gendb://genomicsdb_chr${chr} \
        -O chr${chr}.vcf.gz &
done
wait

# Merge chromosomes
bcftools concat chr{1..22}.vcf.gz chrX.vcf.gz chrY.vcf.gz \
    -Oz -o cohort.vcf.gz

With Allele-Specific Annotations

For larger cohorts where multiallelic sites are common, allele-specific annotations allow VQSR to evaluate each allele independently rather than penalizing a good allele because a co-occurring allele is poor:

gatk GenotypeGVCFs \
    -R reference.fa \
    -V gendb://genomicsdb \
    -O cohort.vcf.gz \
    -G StandardAnnotation \
    -G AS_StandardAnnotation

When allele-specific annotations are present, use

-AS
mode in VariantRecalibrator and ApplyVQSR for allele-level filtering.

Step 4: Filtering

VQSR (Recommended for >30 Samples)

# SNPs
gatk VariantRecalibrator \
    -R reference.fa \
    -V cohort.vcf.gz \
    --resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf.gz \
    --resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf.gz \
    --resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf.gz \
    --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \
    -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
    -mode SNP \
    -O snps.recal \
    --tranches-file snps.tranches

gatk ApplyVQSR \
    -R reference.fa \
    -V cohort.vcf.gz \
    --recal-file snps.recal \
    --tranches-file snps.tranches \
    -mode SNP \
    --truth-sensitivity-filter-level 99.5 \
    -O cohort.snps.vcf.gz

# Indels
gatk VariantRecalibrator \
    -R reference.fa \
    -V cohort.snps.vcf.gz \
    --resource:mills,known=false,training=true,truth=true,prior=12.0 mills.vcf.gz \
    --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \
    -an QD -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
    -mode INDEL \
    -O indels.recal \
    --tranches-file indels.tranches

gatk ApplyVQSR \
    -R reference.fa \
    -V cohort.snps.vcf.gz \
    --recal-file indels.recal \
    --tranches-file indels.tranches \
    -mode INDEL \
    --truth-sensitivity-filter-level 99.0 \
    -O cohort.filtered.vcf.gz

Hard Filtering (Small Cohorts)

# See filtering-best-practices skill
gatk VariantFiltration \
    -R reference.fa \
    -V cohort.vcf.gz \
    --filter-expression "QD < 2.0" --filter-name "QD2" \
    --filter-expression "FS > 60.0" --filter-name "FS60" \
    --filter-expression "MQ < 40.0" --filter-name "MQ40" \
    -O cohort.filtered.vcf.gz

Batch Effects in Joint Calling

Joint genotyping mitigates most batch effects because it re-evaluates genotype likelihoods across all samples simultaneously, recalibrating quality scores against the full cohort distribution. However, certain batch effects persist through joint calling because they affect the underlying read data, not the genotyping model:

  • Different library prep protocols - PCR-free vs PCR-based libraries produce different duplicate and error profiles
  • Different capture kits (WES) - Exome kits target different regions; sites outside the intersection have systematically missing data in some batches
  • Significantly different coverage distributions - 10x WGS samples mixed with 30x samples will have systematically different genotype quality at heterozygous sites
  • Different reference genome versions - Mixing GRCh37 and GRCh38 alignments is not valid; all samples must use the same reference
  • Mixing WGS and WES - Fundamentally different coverage profiles; off-target WES regions behave like very-low-coverage WGS

Mitigation: process all samples through an identical upstream pipeline (same aligner, same duplicate marking, same BQSR resources). If batches are unavoidable, include batch as a covariate in downstream association or differential analyses.

When to Re-genotype

ScenarioActionRationale
Adding new samplesRe-genotype (GenomicsDB incremental add + GenotypeGVCFs on full database)New samples change cohort allele frequencies, improving all genotype calls
Changing reference genomeFull reprocess from alignmentgVCF coordinates are reference-specific
Updating caller versionOptional but recommended for consistencyDifferent caller versions may produce different quality scores; mixing versions adds noise
Adding new genomic intervalsReimport from scratchGenomicsDB intervals are locked at initial import; incremental update cannot expand them

DeepVariant + GLnexus Alternative

For cohorts using DeepVariant as the primary caller, GLnexus provides a scalable joint calling pathway that has been validated at 250,000+ samples:

# Step 1: Run DeepVariant per sample to produce gVCFs
deepvariant --model_type=WGS \
    --ref=reference.fa --reads=sample.bam \
    --output_vcf=sample.vcf.gz --output_gvcf=sample.g.vcf.gz

# Step 2: Joint call with GLnexus (pre-tuned configs available)
glnexus_cli --config DeepVariantWGS --bed intervals.bed \
    sample1.g.vcf.gz sample2.g.vcf.gz ... | bcftools view - | bgzip -c > cohort.vcf.gz

Pre-tuned GLnexus configs:

DeepVariantWGS
for whole-genome,
DeepVariantWES
for whole-exome. These configs encode appropriate genotype quality thresholds and multiallelic handling tuned for DeepVariant output.

Complete Pipeline Script

Goal: Run the full joint calling workflow from BAMs to filtered cohort VCF.

Approach: Generate per-sample gVCFs, import into GenomicsDB, joint genotype, then index and compute statistics.

#!/bin/bash
set -euo pipefail

REFERENCE=$1
OUTPUT_DIR=$2
THREADS=16

mkdir -p $OUTPUT_DIR/{gvcfs,genomicsdb,vcfs}

echo "=== Step 1: Generate gVCFs ==="
for bam in data/*.bam; do
    sample=$(basename $bam .bam)
    gatk HaplotypeCaller \
        -R $REFERENCE \
        -I $bam \
        -O $OUTPUT_DIR/gvcfs/${sample}.g.vcf.gz \
        -ERC GVCF &

    # Limit parallelism
    while [ $(jobs -r | wc -l) -ge $THREADS ]; do sleep 1; done
done
wait

echo "=== Step 2: Create sample map ==="
ls $OUTPUT_DIR/gvcfs/*.g.vcf.gz | while read f; do
    echo -e "$(basename $f .g.vcf.gz)\t$(realpath $f)"
done > $OUTPUT_DIR/sample_map.txt

echo "=== Step 3: GenomicsDBImport ==="
gatk GenomicsDBImport \
    --sample-name-map $OUTPUT_DIR/sample_map.txt \
    --genomicsdb-workspace-path $OUTPUT_DIR/genomicsdb \
    -L intervals.bed \
    --reader-threads 4

echo "=== Step 4: Joint genotyping ==="
gatk GenotypeGVCFs \
    -R $REFERENCE \
    -V gendb://$OUTPUT_DIR/genomicsdb \
    -O $OUTPUT_DIR/vcfs/cohort.vcf.gz

echo "=== Step 5: Index ==="
bcftools index -t $OUTPUT_DIR/vcfs/cohort.vcf.gz

echo "=== Statistics ==="
bcftools stats $OUTPUT_DIR/vcfs/cohort.vcf.gz > $OUTPUT_DIR/vcfs/cohort_stats.txt

echo "=== Complete ==="
echo "Joint VCF: $OUTPUT_DIR/vcfs/cohort.vcf.gz"

Tips

Memory for Large Cohorts

# Increase Java heap for GenotypeGVCFs (default 4g is often insufficient for >500 samples)
gatk --java-options "-Xmx64g" GenotypeGVCFs ...

# For GenomicsDBImport, --batch-size controls how many gVCFs are loaded simultaneously
gatk GenomicsDBImport --batch-size 50 ...

Related Skills

  • variant-calling/gatk-variant-calling - Single-sample HaplotypeCaller workflow
  • variant-calling/deepvariant - DeepVariant caller for use with GLnexus pathway
  • variant-calling/filtering-best-practices - VQSR and hard filtering strategies
  • variant-calling/vcf-manipulation - Merging and subsetting joint-called VCFs
  • population-genetics/plink-basics - Population analysis of joint calls
  • workflows/fastq-to-variants - End-to-end germline pipeline