SciAgent-Skills bcftools-variant-manipulation
Command-line toolkit for VCF/BCF variant file manipulation. Filter, merge, annotate, query, normalize, and compute statistics on variant call files. Essential for post-variant-calling pipelines: quality filtering, multi-sample merging, rsID annotation, and genotype extraction. Companion to samtools in the HTSlib ecosystem. Use GATK instead for complex indel realignment during variant calling; use VCFtools instead for population genetics statistics.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/bcftools-variant-manipulation" ~/.claude/skills/jaechang-hits-sciagent-skills-bcftools-variant-manipulation && rm -rf "$T"
skills/genomics-bioinformatics/bcftools-variant-manipulation/SKILL.mdbcftools — VCF/BCF Variant Manipulation Toolkit
Overview
bcftools is the standard command-line toolkit for processing VCF (Variant Call Format) and BCF (Binary Call Format) files in the HTSlib ecosystem. It covers the complete post-variant-calling workflow: format conversion, quality filtering, variant normalization, multi-sample merging, annotation with external databases, genotype extraction, and QC statistics. bcftools uses streaming by design — most commands read from stdin and write to stdout, making it ideal for memory-efficient pipelines on large cohorts.
When to Use
- Filtering variants by quality (QUAL, DP, AF) after variant calling
- Merging VCF files from multiple samples into a joint call set
- Adding rsIDs or gene annotations to variant calls
- Extracting specific fields (genotypes, allele depths) as tabular output
- Normalizing indel representations and splitting multi-allelic records
- Calling variants from pileup output (mpileup + call)
- Computing per-sample and overall VCF QC statistics
- Use
instead when calling variants with local realignment in human samplesGATK HaplotypeCaller - Use
instead for population genetics statistics (Fst, LD, Hardy-Weinberg)VCFtools - Use
in the HTSlib pipeline; usebcftools
for duplicate-marking and library metricspicard
Prerequisites
- Installation: bcftools 1.17+ (part of HTSlib suite with samtools)
- Input requirements: VCF or BGzipped+tabix-indexed VCF (
) for region queries.vcf.gz + .vcf.gz.tbi - Companion tools:
for BAM processing;samtools
for VCF indexingtabix
Check before installing: The tool may already be available in the current environment (e.g., inside a
/pixienv). Runcondafirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viacommand -v bcftoolsrather than barepixi run bcftools.bcftools
# Bioconda (recommended — installs HTSlib suite) conda install -c bioconda bcftools # Homebrew (macOS) brew install bcftools # Verify bcftools --version | head -1 # bcftools 1.20 # Index a VCF for region queries bcftools index -t variants.vcf.gz # creates .tbi bcftools index -c variants.vcf.gz # creates .csi (for chromosomes > 512 Mb)
Quick Start
# Typical post-calling workflow: normalize → filter → annotate → extract bcftools norm -d any -f reference.fa variants.vcf.gz \ | bcftools filter -i 'QUAL>20 && DP>10' \ | bcftools annotate -a dbSNP.vcf.gz -c ID \ | bcftools view -O z -o final.vcf.gz # Index the output bcftools index -t final.vcf.gz # Count variants at each stage bcftools stats final.vcf.gz | grep "^SN"
Core API
Module 1: VCF/BCF I/O and Format Conversion
Convert between text VCF and binary BCF; compress and index for random access.
# VCF → compressed BCF (fastest format for piping) bcftools view -O b -o variants.bcf variants.vcf # BCF → VCF (for human-readable output) bcftools view -O v -o variants.vcf variants.bcf # VCF → bgzipped + indexed (standard archive format) bcftools view -O z -W -o variants.vcf.gz variants.vcf # -W automatically creates .tbi index after writing
# Extract specific samples bcftools view -s sample1,sample2 -O z -o subset.vcf.gz variants.vcf.gz # Exclude samples (prefix with ^) bcftools view -s ^outlier_sample -O z -o cleaned.vcf.gz variants.vcf.gz # Extract by region (fast; requires index) bcftools view -r chr1:1000000-2000000 variants.vcf.gz -O v -o chr1_region.vcf # Streaming pipeline: no intermediate files samtools mpileup -Ou input.bam | bcftools call -m -Oz -o calls.vcf.gz
Module 2: Variant Filtering
Apply quality thresholds and FLAG-based filters to retain high-confidence calls.
# Expression-based filter (include) bcftools filter -i 'QUAL>20 && DP>10' variants.vcf.gz -O z -o filtered.vcf.gz # Expression-based filter (exclude) bcftools filter -e 'QUAL<10 || DP<5' variants.vcf.gz -O v -o filtered.vcf # Soft filter: mark but keep (sets FILTER field to label) bcftools filter -s LowQual -e 'QUAL<20' variants.vcf.gz -O z -o soft_filtered.vcf.gz # Variants with QUAL<20 get FILTER="LowQual"; others get FILTER=PASS
# Keep only PASS variants bcftools view -f PASS variants.vcf.gz -O z -o pass_only.vcf.gz # SNP-only output bcftools view --type snps variants.vcf.gz -O z -o snps.vcf.gz # Indel-only output bcftools view --type indels variants.vcf.gz -O z -o indels.vcf.gz # Filter by allele frequency and depth bcftools filter -i 'AF>0.1 && DP>20 && MQ>40' variants.vcf.gz -O z -o confident.vcf.gz # Remove SNPs within 3 bp of indels bcftools filter --SnpGap 3 variants.vcf.gz -O z -o gapfiltered.vcf.gz
Module 3: VCF Query and Extraction
Transform VCF content into tabular text for downstream analysis.
# Extract chrom, position, ref, alt, quality bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%QUAL\n' variants.vcf.gz > variants.txt # With header row (-H adds #-prefixed column names) bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\t%QUAL\n' variants.vcf.gz > variants.tsv # Per-sample genotypes and allele depths bcftools query -f '[%SAMPLE\t%GT\t%AD\n]' variants.vcf.gz > genotypes.txt # Output: sample1 0/1 25,18 (ref_depth,alt_depth)
# Rare variants (AF < 1%) bcftools query -i 'AF<0.01' -f '%CHROM\t%POS\t%REF\t%ALT\t%AF\n' \ variants.vcf.gz > rare_variants.txt # Count variants per chromosome bcftools query -f '%CHROM\n' variants.vcf.gz | sort | uniq -c | sort -rn # Extract genotype matrix across all samples bcftools query -f '%CHROM:%POS\t[%GT\t]\n' -H variants.vcf.gz > genotype_matrix.tsv
Module 4: Multi-file Operations
Combine VCF files from multiple samples (merge) or chromosomes (concat).
# Merge: join VCFs from DIFFERENT sample sets (same variants) bcftools merge sample1.vcf.gz sample2.vcf.gz sample3.vcf.gz \ -O z -o cohort.vcf.gz # Merge with auto-indexing and threading bcftools merge -O b -W --threads 4 sample*.vcf.gz > cohort.bcf # Concat: join VCFs from SAME sample set (different chromosomes or batches) bcftools concat chr1.vcf.gz chr2.vcf.gz chr3.vcf.gz -O z -o full.vcf.gz # Concat with overlap handling (from batched calling) bcftools concat -a --threads 4 batch*.vcf.gz -O z -o concat.vcf.gz
# Pipeline: merge → filter → normalize bcftools merge sample1.vcf.gz sample2.vcf.gz \ | bcftools filter -i 'QUAL>20' \ | bcftools norm -d any -f genome.fa \ | bcftools view -O z -o merged_clean.vcf.gz bcftools index -t merged_clean.vcf.gz # Extract genotype matrix from merged cohort bcftools merge cohort*.vcf.gz | bcftools query -f '[%GT\t]\n' > gt_matrix.tsv
Module 5: Variant Annotation
Add identifiers, gene annotations, or external data to VCF records.
# Add rsIDs from dbSNP bcftools annotate -a dbSNP.vcf.gz -c ID variants.vcf.gz -O z -o rsid_annotated.vcf.gz # Annotate with BED file (adds gene names) bcftools annotate -a genes.bed.gz \ -h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">') \ -c CHROM,FROM,TO,GENE \ variants.vcf.gz -O z -o gene_annotated.vcf.gz # Remove unwanted INFO fields bcftools annotate -x INFO/AC,INFO/AN,INFO/MQ variants.vcf.gz -O v -o stripped.vcf
# Normalize: left-align indels, split multi-allelic records bcftools norm -f reference.fa variants.vcf.gz -O v -o normalized.vcf # Split multi-allelic sites into separate records bcftools norm -m -any variants.vcf.gz -O v -o split.vcf # Deduplicate overlapping records bcftools norm -d any variants.vcf.gz -O z -o deduped.vcf.gz # Full normalize pipeline bcftools norm -m -any variants.vcf.gz | \ bcftools norm -d any -f reference.fa | \ bcftools view -O z -o normalized_split.vcf.gz
Module 6: Statistics and QC
Generate summary metrics and per-sample variant counts.
# Full VCF statistics report bcftools stats variants.vcf.gz > qc.stats.txt # Extract Summary Numbers section only grep "^SN" qc.stats.txt | cut -f3,4 # number of records: 45231 # number of SNPs: 38941 # number of indels: 6290 # ... # Per-sample stats (PSC = per-sample counts) bcftools stats -s - variants.vcf.gz | grep "^PSC" > per_sample.txt # cols: id sample hom_RR het hom_AA ts tv indel missing singleton
# Transition/transversion ratio (genome-wide QC) bcftools stats variants.vcf.gz | grep "Ts/Tv" # Ts/Tv ratio: 2.06 (healthy WGS; <1.8 or >2.2 suggests quality issues) # Check for sample contamination (F-statistic per sample) bcftools stats -s - variants.vcf.gz | grep "^PSC" | awk '{print $2, $9}' # sample F_missing (high = poor sample quality) # Variant calling (mpileup → call pipeline) samtools mpileup -Ou -f genome.fa *.bam | bcftools call -m -v -Oz -o calls.vcf.gz bcftools stats calls.vcf.gz | grep "^SN"
Key Concepts
Output Format Flags
-O v → VCF text (uncompressed) default for human inspection -O z → bgzipped VCF (.vcf.gz) standard for archiving -O b → binary BCF (compressed) fastest for piping -O u → binary BCF (uncompressed) fastest output (no compression)
Rule: Use
-O b or -O u for intermediate pipeline steps (no I/O overhead). Use -O z for files you will store or index with tabix.
Filter Expression Syntax
Expressions use INFO and FORMAT fields with comparison operators:
# INFO fields (one value per variant) QUAL>20 # quality score DP>10 # total depth AF<0.05 # allele frequency MQ>40 # mapping quality # FORMAT fields (per-sample; use [] to iterate) GT=="1/1" # homozygous alternate AD[1]>5 # alt allele depth > 5 GQ>20 # genotype quality # Combined QUAL>20 && DP>10 && AF>0.01 (GT=="0/0" || GT=="1/1") && GQ>30
Common Workflows
Workflow 1: Variant QC and Filtering Pipeline
Goal: Normalize, filter, and annotate a raw variant call set for downstream analysis.
#!/bin/bash VCF="raw_calls.vcf.gz" REF="reference.fa" DBSNP="dbSNP_hg38.vcf.gz" FINAL="variants_filtered_annotated.vcf.gz" # 1. Normalize: left-align indels, split multi-allelic, deduplicate bcftools norm -m -any $VCF \ | bcftools norm -d any -f $REF \ | bcftools view -O z -o normalized.vcf.gz bcftools index -t normalized.vcf.gz # 2. Filter by quality and depth bcftools filter -i 'QUAL>20 && DP>10' normalized.vcf.gz \ | bcftools filter --SnpGap 3 \ | bcftools view -f PASS -O z -o filtered.vcf.gz bcftools index -t filtered.vcf.gz # 3. Annotate with rsIDs bcftools annotate -a $DBSNP -c ID filtered.vcf.gz -O z -o $FINAL bcftools index -t $FINAL # Report variant counts echo "Final variant count:" bcftools stats $FINAL | grep "number of records"
Workflow 2: Multi-sample Cohort Merging and Genotype Extraction
Goal: Merge per-sample VCFs into a cohort VCF; extract a genotype matrix for GWAS.
#!/bin/bash SAMPLES=(sample1 sample2 sample3 sample4 sample5) # 1. Ensure all VCFs are indexed for s in "${SAMPLES[@]}"; do bcftools index -t ${s}.vcf.gz done # 2. Merge into cohort VCF (only sites present in ALL samples: -m none) bcftools merge -m none "${SAMPLES[@]/%/.vcf.gz}" \ -O z --threads 8 -o cohort.vcf.gz bcftools index -t cohort.vcf.gz # 3. Filter: PASS, SNPs only, MAF > 1% bcftools view -f PASS --type snps cohort.vcf.gz \ | bcftools filter -i 'AF>0.01 && AF<0.99' \ | bcftools view -O z -o cohort_snps_filtered.vcf.gz # 4. Extract numeric genotype matrix (for plink/R/Python) bcftools query -H \ -f '%CHROM\t%POS\t%REF\t%ALT\t[%GT\t]\n' \ cohort_snps_filtered.vcf.gz > genotype_matrix.tsv echo "Genotype matrix: $(wc -l < genotype_matrix.tsv) variants x ${#SAMPLES[@]} samples"
Key Parameters
| Parameter | Command | Default | Range/Options | Effect |
|---|---|---|---|---|
| Most | | ,,, | Output format: VCF, bgzip-VCF, BCF, uncompressed BCF |
| Most | — | | Region filter (requires tabix index) |
| Most | All | sample names | Include specific samples (prefix to exclude) |
| filter | — | expression | Include variants matching expression |
| filter | — | expression | Exclude variants matching expression |
| query | — | format string | Custom output format string |
| Most | 1 | 1–N | Compression/decompression threads |
| annotate | — | file path | Annotation source (BED, VCF, TSV) |
| norm | none | , | Split (−) or join (+) multi-allelic records |
| norm | — | ,, | Deduplication strategy |
| view | — | flag | Auto-create index after writing |
| call | — | flag | Output variant sites only (skip reference sites) |
Best Practices
-
Index before region queries:
requires abcftools view -r chr1:...
or.tbi
index. Always run.csi
after creating any bgzipped VCF.bcftools index -t output.vcf.gz -
Normalize before merging or annotation: Different callers represent the same indel differently. Run
before merging to prevent duplicate records.bcftools norm -m -any | bcftools norm -d any -f ref.fa -
Use
for pipeline intermediates: Uncompressed BCF output (-O u
) eliminates compression/decompression overhead in multi-step pipes — typically 2-3× faster than-O u
.-O z -
Verify Ts/Tv ratio after calling:
. For human WGS, expect 2.0–2.1; exome 2.5–3.0. Values outside these ranges indicate quality problems.bcftools stats variants.vcf.gz | grep Ts/Tv -
Filter before merge for large cohorts: Filtering per-sample VCFs before merging reduces memory and I/O. Apply site-level QC (
) to each sample beforeQUAL>20 && DP>5
.bcftools merge -
Check chromosome naming consistency: bcftools fails silently if merging
-style withchr1
-style VCFs. Verify with1
.bcftools view -h file.vcf.gz | grep "^##contig"
Common Recipes
Recipe: Count Variants Before and After Filtering
echo "Before:" $(bcftools view -c 1 raw.vcf.gz | wc -l) echo "PASS only:" $(bcftools view -f PASS raw.vcf.gz | bcftools view -c 1 | wc -l) echo "SNPs PASS:" $(bcftools view -f PASS --type snps raw.vcf.gz | bcftools view -c 1 | wc -l)
Recipe: Extract Heterozygous Sites for One Sample
bcftools view -s SAMPLE_A variants.vcf.gz \ | bcftools filter -i 'GT="0/1"' \ | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\n' > het_sites.txt echo "$(wc -l < het_sites.txt) heterozygous sites"
Recipe: Compare Two VCF Files for Concordance
# Find variants unique to each file and shared bcftools isec -p isec_dir file1.vcf.gz file2.vcf.gz ls isec_dir/ # 0000.vcf: private to file1 # 0001.vcf: private to file2 # 0002.vcf: shared (from file1 perspective) # 0003.vcf: shared (from file2 perspective) echo "Shared: $(wc -l < isec_dir/0002.vcf) variants"
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
error | VCF not indexed (needed for region queries) | Run |
parse error | Malformed VCF FORMAT or INFO field | Validate: ; check source tool version |
| Empty filter output | Expression too strict or field missing | Test expression: |
| Merge: sample duplication | Duplicate sample names across input VCFs | Rename with before merging |
| Wrong Ts/Tv ratio (<1.8) | Low-quality calls or poor coverage | Apply stricter quality filter (); check alignment quality |
fails with overlap | Overlapping regions in input VCFs | Use flag: |
| Annotation mismatch | chr naming conflict (chr1 vs 1) | Check: ; rename with |
returns empty fields | FORMAT field not populated for sample | Check VCF header: |
Related Skills
- samtools-bam-processing — BAM processing that feeds into bcftools variant calling pipeline
- bedtools-genomic-intervals — intersecting VCF variants with genomic features (genes, regions)
- gget-genomic-databases — Ensembl/NCBI queries to annotate variant gene context
References
- bcftools documentation — complete command reference
- GitHub: samtools/bcftools — source code, releases, issue tracker
- Danecek et al. (2021) "Twelve years of SAMtools and BCFtools" — GigaScience 10(2)
- VCF file format specification — field definitions and encoding rules