BioSkills bio-variant-calling

Call SNPs and indels from aligned reads using bcftools mpileup and call. Use when detecting variants from BAM files or generating VCF from alignments.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/variant-calling/variant-calling" ~/.claude/skills/gptomics-bioskills-bio-variant-calling && rm -rf "$T"
manifest: variant-calling/variant-calling/SKILL.md
source content

Version Compatibility

Reference examples tested with: bcftools 1.19+

Before using code patterns, verify installed versions match. If versions differ:

  • CLI:
    <tool> --version
    then
    <tool> --help
    to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Variant Calling with bcftools

Call SNPs and indels from aligned reads using bcftools mpileup/call.

When to Use bcftools vs Other Callers

ScenarioRecommended CallerRationale
Quick exploratory analysisbcftoolsFast, minimal setup, good accuracy for SNPs
Non-model organismsbcftools or FreeBayesNo training data required (unlike VQSR); adjustable ploidy
Haploid/bacterial genomesbcftools with
--ploidy 1
Native ploidy support; simpler model appropriate for haploids
Highest accuracy (human)DeepVariant or GATK DRAGEN-modeDeep learning or local assembly outperforms pileup-based calling
Cohort joint genotypingGATK (GVCF workflow)bcftools multi-sample calling does not scale well beyond ~100 samples
Somatic variantsMutect2 or DeepSomaticbcftools is not designed for low-VAF somatic detection

bcftools mpileup/call uses a Bayesian statistical model on pileup data. It is fast and effective for germline SNPs, but its indel calling is less accurate than local-assembly callers (GATK HaplotypeCaller, DeepVariant) in homopolymer and low-complexity regions. For production human germline pipelines, prefer DeepVariant or GATK; use bcftools for rapid exploration, non-model organisms, or when computational resources are limited.

Basic Workflow

BAM file + Reference FASTA
         |
         v
   bcftools mpileup (generate pileup)
         |
         v
   bcftools call (call variants)
         |
         v
   VCF file

bcftools mpileup + call

Goal: Detect SNPs and indels from aligned reads using the bcftools pileup-and-call pipeline.

Approach: Generate per-position pileup likelihoods with mpileup, then call genotypes with the multiallelic caller.

"Call variants from my BAM file" → Generate genotype likelihoods from aligned reads and identify variant sites using a Bayesian caller.

Basic Variant Calling

bcftools mpileup -f reference.fa input.bam | bcftools call -mv -o variants.vcf

Output Compressed VCF

bcftools mpileup -f reference.fa input.bam | bcftools call -mv -Oz -o variants.vcf.gz
bcftools index variants.vcf.gz

Call Specific Region

bcftools mpileup -f reference.fa -r chr1:1000000-2000000 input.bam | \
    bcftools call -mv -o region.vcf

Call from Multiple BAMs

bcftools mpileup -f reference.fa sample1.bam sample2.bam sample3.bam | \
    bcftools call -mv -o variants.vcf

BAM List File

# bams.txt: one BAM path per line
bcftools mpileup -f reference.fa -b bams.txt | bcftools call -mv -o variants.vcf

mpileup Options

Goal: Control pileup generation with quality thresholds, annotations, and region restrictions.

Approach: Set minimum mapping/base quality, request specific FORMAT/INFO tags, and restrict to target regions.

Quality Filtering

bcftools mpileup -f reference.fa \
    -q 20 \           # Min mapping quality
    -Q 20 \           # Min base quality
    input.bam | bcftools call -mv -o variants.vcf

Annotate with Read Depth

bcftools mpileup -f reference.fa -a DP,AD input.bam | bcftools call -mv -o variants.vcf

Full Annotation Set

bcftools mpileup -f reference.fa \
    -a FORMAT/DP,FORMAT/AD,FORMAT/ADF,FORMAT/ADR,INFO/AD \
    input.bam | bcftools call -mv -o variants.vcf

Target Regions (BED)

bcftools mpileup -f reference.fa -R targets.bed input.bam | \
    bcftools call -mv -o variants.vcf

Max Depth

bcftools mpileup -f reference.fa -d 1000 input.bam | bcftools call -mv -o variants.vcf

call Options

Calling Models

FlagModelUse Case
-m
Multiallelic callerDefault, recommended for all new analyses
-c
Consensus callerLegacy; only for backward compatibility with older pipelines

The multiallelic caller (

-m
) handles sites with multiple ALT alleles natively and is statistically superior. The consensus caller (
-c
) is retained only for reproducibility of legacy analyses.

Output Variants Only

bcftools mpileup -f reference.fa input.bam | bcftools call -mv -o variants.vcf
# -v outputs variant sites only (not reference calls)

Output All Sites

bcftools mpileup -f reference.fa input.bam | bcftools call -m -o all_sites.vcf
# Without -v, outputs all sites including reference

Ploidy

Correct ploidy is critical -- calling a diploid organism as haploid halves heterozygous sensitivity; calling haploid as diploid produces false heterozygous calls.

# Haploid calling (bacteria, mitochondria, chrY in males)
bcftools mpileup -f reference.fa input.bam | bcftools call -m --ploidy 1 -o variants.vcf

# Ploidy file for mixed-ploidy organisms (e.g., sex chromosomes, polyploids)
# Format: CHROM FROM TO SEX PLOIDY
bcftools mpileup -f reference.fa input.bam | bcftools call -m --ploidy-file ploidy.txt -o variants.vcf

# Built-in ploidy presets
bcftools call -m --ploidy GRCh38 ...  # Human with sex chromosome handling

Prior Probability

# Adjust variant prior (default 1.1e-3 for human)
# Lower for highly inbred lines; higher for diverse/outbred populations
bcftools mpileup -f reference.fa input.bam | bcftools call -m -P 0.001 -o variants.vcf

Common Pipelines

Goal: Run production-ready variant calling workflows for single-sample and multi-sample analyses.

Approach: Chain mpileup and call with quality filters, annotations, and compressed output, optionally parallelized by chromosome.

Standard SNP/Indel Calling

bcftools mpileup -Ou -f reference.fa \
    -q 20 -Q 20 \
    -a FORMAT/DP,FORMAT/AD \
    input.bam | \
bcftools call -mv -Oz -o variants.vcf.gz

bcftools index variants.vcf.gz

Multi-sample Calling

bcftools mpileup -Ou -f reference.fa \
    -a FORMAT/DP,FORMAT/AD \
    sample1.bam sample2.bam sample3.bam | \
bcftools call -mv -Oz -o cohort.vcf.gz

bcftools index cohort.vcf.gz

Calling with Regions

bcftools mpileup -Ou -f reference.fa \
    -R targets.bed \
    -a FORMAT/DP,FORMAT/AD \
    input.bam | \
bcftools call -mv -Oz -o targets.vcf.gz

Parallel by Chromosome

for chr in chr1 chr2 chr3; do
    bcftools mpileup -Ou -f reference.fa -r "$chr" input.bam | \
        bcftools call -mv -Oz -o "${chr}.vcf.gz" &
done
wait

# Concatenate results
bcftools concat -Oz -o all.vcf.gz chr*.vcf.gz
bcftools index all.vcf.gz

Annotation Tags

INFO Tags

TagDescription
DP
Total read depth
AD
Allelic depths
MQ
Mapping quality
FS
Fisher strand bias
SGB
Segregation based metric

FORMAT Tags

TagDescription
GT
Genotype
DP
Read depth per sample
AD
Allelic depths per sample
ADF
Forward strand allelic depths
ADR
Reverse strand allelic depths
GQ
Genotype quality
PL
Phred-scaled likelihoods

Request Specific Annotations

bcftools mpileup -f reference.fa \
    -a FORMAT/DP,FORMAT/AD,FORMAT/SP,INFO/AD \
    input.bam | bcftools call -mv -o variants.vcf

Performance Options

Goal: Speed up variant calling for large datasets.

Approach: Use multi-threading and uncompressed BCF piping to reduce I/O overhead.

Multi-threading

bcftools mpileup -f reference.fa --threads 4 input.bam | \
    bcftools call -mv --threads 4 -o variants.vcf

Uncompressed BCF for Speed

bcftools mpileup -Ou -f reference.fa input.bam | bcftools call -mv -Ou | \
    bcftools filter -Oz -o filtered.vcf.gz

Quick Reference

TaskCommand
Basic calling
bcftools mpileup -f ref.fa in.bam | bcftools call -mv -o out.vcf
With quality filter
bcftools mpileup -f ref.fa -q 20 -Q 20 in.bam | bcftools call -mv
Region
bcftools mpileup -f ref.fa -r chr1:1-1000 in.bam | bcftools call -mv
Multi-sample
bcftools mpileup -f ref.fa s1.bam s2.bam | bcftools call -mv
With annotations
bcftools mpileup -f ref.fa -a DP,AD in.bam | bcftools call -mv

Difficult Region Considerations

bcftools pileup-based calling is particularly vulnerable in:

  • Homopolymers: Runs of identical bases cause indel false positives. Validate indel calls in homopolymers >6bp with a second caller or manual review.
  • Low-complexity regions: ~35 Mb of GRCh38 is flagged as LCR. Consider excluding with
    -T ^lcr.bed
    or flagging calls in these regions.
  • Segmental duplications: Reads from paralogs pile up, creating false SNPs. Check MQ values -- sites with mean MQ <40 suggest ambiguous mapping.
  • High-depth regions: Set
    -d
    (max depth) to 3-4x the expected mean coverage. The default (250) may be too low for targeted panels or too high for low-coverage WGS.

Common Errors

ErrorCauseSolution
no FASTA reference
Missing -fAdd
-f reference.fa
reference mismatch
Wrong referenceUse same reference as alignment
no variants called
Low quality/depthLower quality thresholds or check BAM is not empty

Related Skills

  • vcf-basics - View and query resulting VCF
  • filtering-best-practices - Filter variants by quality
  • variant-normalization - Normalize indels
  • variant-calling/gatk-variant-calling - Higher accuracy with local assembly
  • variant-calling/deepvariant - Deep learning alternative
  • alignment-files/pileup-generation - Alternative pileup generation