SciAgent-Skills bwa-mem2-dna-aligner

Fast short-read DNA aligner for whole-genome, whole-exome, and ChIP-seq alignment to a reference genome. BWA-MEM2 is the 2× faster successor to BWA-MEM; outputs SAM/BAM with read group headers required by GATK. Produces primary alignments with supplementary records for chimeric reads. Use STAR instead for RNA-seq splice-aware alignment; use Bowtie2 as an alternative with comparable accuracy.

install

source · Clone the upstream repo

git clone https://github.com/jaechang-hits/SciAgent-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/bwa-mem2-dna-aligner" ~/.claude/skills/jaechang-hits-sciagent-skills-bwa-mem2-dna-aligner && rm -rf "$T"

manifest: skills/genomics-bioinformatics/bwa-mem2-dna-aligner/SKILL.md

source content

BWA-MEM2 — DNA Short-Read Aligner

Overview

BWA-MEM2 aligns short DNA reads (Illumina, 50–250 bp) to a reference genome using the BWT-FM index. It is the standard aligner for whole-genome sequencing (WGS), whole-exome sequencing (WES), ChIP-seq, and ATAC-seq DNA alignment. BWA-MEM2 is 2× faster than the original BWA-MEM while producing identical results. It outputs SAM format with proper read group (

@RG

) headers required by GATK HaplotypeCaller and Picard tools. For paired-end reads, it marks proper pairs and resolves chimeric/split reads into supplementary alignments.

When to Use

Aligning WGS or WES Illumina reads to a reference genome for variant calling (SNP, indel, SV)
ChIP-seq or ATAC-seq DNA alignment to produce BAM files for peak calling with MACS3
Producing GATK-compatible BAM files with
```
@RG
```
read group tags
Aligning reads ≥ 50 bp; for shorter reads (< 50 bp), BWA-backtrack may be more appropriate
Re-aligning legacy FASTQ files to an updated reference genome assembly
Use STAR instead for RNA-seq reads that span splice junctions
Use Bowtie2 as an alternative for local alignment or when index size must be minimized

Prerequisites

Software: bwa-mem2 (conda or pre-compiled binary), samtools
Reference: genome FASTA (e.g., GRCh38, hg19, mm10)
RAM: ~28 GB for human genome index; 6–8 GB for mouse

Check before installing: The tool may already be available in the current environment (e.g., inside a
pixi
/
conda
env). Run
command -v bwa-mem2
first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via
pixi run bwa-mem2
rather than bare
bwa-mem2
.

# Install with conda (recommended)
conda install -c bioconda bwa-mem2 samtools

# Or download pre-compiled binary
wget https://github.com/bwa-mem2/bwa-mem2/releases/download/v2.2.1/bwa-mem2-2.2.1_x64-linux.tar.bz2
tar -jxf bwa-mem2-2.2.1_x64-linux.tar.bz2
export PATH="$PWD/bwa-mem2-2.2.1_x64-linux:$PATH"

# Verify
bwa-mem2 version
# 2.2.1

Quick Start

# 1. Build genome index (~30 min, run once)
bwa-mem2 index GRCh38.fa

# 2. Align paired-end reads and sort
bwa-mem2 mem -t 16 -R "@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA" \
    GRCh38.fa sample1_R1.fastq.gz sample1_R2.fastq.gz \
    | samtools sort -@ 8 -o sample1.sorted.bam

# 3. Index the BAM
samtools index sample1.sorted.bam
echo "Aligned reads: $(samtools view -c -F 4 sample1.sorted.bam)"

Workflow

Step 1: Download Reference Genome

Obtain the reference genome FASTA file matching the target assembly.

# Download GRCh38 primary assembly (human)
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
mv GCA_000001405.15_GRCh38_no_alt_analysis_set.fna GRCh38.fa

# Or use ENSEMBL/GENCODE
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/GRCh38.primary_assembly.genome.fa.gz
gunzip GRCh38.primary_assembly.genome.fa.gz

echo "Reference size: $(du -sh GRCh38.fa)"

Step 2: Build BWA-MEM2 Index

Index the reference genome — required once per genome, takes ~25-35 min for human.

# Build index (~28 GB RAM required for human genome)
bwa-mem2 index GRCh38.fa

# This creates: GRCh38.fa.0123, GRCh38.fa.amb, GRCh38.fa.ann,
#               GRCh38.fa.bwt.2bit.64, GRCh38.fa.pac
echo "Index files: $(ls GRCh38.fa.* | wc -l) created"
ls -lh GRCh38.fa.*

Step 3: Align Paired-End Reads

Align FASTQ reads with a read group header required for GATK compatibility.

# Align with read group (required for GATK)
# @RG fields: ID (run ID), SM (sample name), PL (platform), LB (library), PU (flowcell)
bwa-mem2 mem \
    -t 16 \
    -R "@RG\tID:sample1_run1\tSM:sample1\tPL:ILLUMINA\tLB:lib1\tPU:flowcell1" \
    GRCh38.fa \
    sample1_R1.fastq.gz \
    sample1_R2.fastq.gz \
    | samtools sort -@ 8 -m 2G -o sample1.sorted.bam

samtools index sample1.sorted.bam
echo "Alignment complete."
echo "Total reads:   $(samtools view -c sample1.sorted.bam)"
echo "Mapped reads:  $(samtools view -c -F 4 sample1.sorted.bam)"

Step 4: Mark PCR Duplicates

Remove or mark optical and PCR duplicates before variant calling.

# Option A: samtools markdup (fast)
samtools fixmate -m sample1.sorted.bam sample1.fixmate.bam
samtools sort -@ 8 -o sample1.fixmate.sorted.bam sample1.fixmate.bam
samtools markdup -@ 8 sample1.fixmate.sorted.bam sample1.markdup.bam
samtools index sample1.markdup.bam

echo "Duplication rate:"
samtools flagstat sample1.markdup.bam | grep "duplicate"

# Option B: Picard MarkDuplicates (GATK best practices)
picard MarkDuplicates \
    INPUT=sample1.sorted.bam \
    OUTPUT=sample1.markdup.bam \
    METRICS_FILE=sample1.dupmetrics.txt \
    REMOVE_DUPLICATES=false \
    CREATE_INDEX=true
cat sample1.dupmetrics.txt | grep -A2 "ESTIMATED"

Step 5: Assess Alignment Quality

Generate alignment statistics and check key quality metrics.

# Full alignment statistics
samtools flagstat sample1.markdup.bam > sample1.flagstat.txt
cat sample1.flagstat.txt

# Coverage statistics
samtools coverage sample1.markdup.bam | head -30

# Parse key metrics with Python
python3 - << 'EOF'
from pathlib import Path

flagstat = Path("sample1.flagstat.txt").read_text()
for line in flagstat.splitlines():
    if any(kw in line for kw in ["total", "mapped", "properly paired", "duplicate"]):
        print(line)
EOF

Step 6: Complete WGS/WES Pipeline → Variant Calling

Pipe BWA-MEM2 output directly into GATK HaplotypeCaller.

#!/bin/bash
# Complete WGS alignment → variant calling pipeline
GENOME="GRCh38.fa"
SAMPLE="sample1"
R1="data/${SAMPLE}_R1.fastq.gz"
R2="data/${SAMPLE}_R2.fastq.gz"
THREADS=16
OUTDIR="results/${SAMPLE}"
mkdir -p "$OUTDIR"

# Step 1: Align + sort
bwa-mem2 mem -t $THREADS \
    -R "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tPL:ILLUMINA\tLB:lib1\tPU:run1" \
    $GENOME $R1 $R2 \
    | samtools sort -@ 8 -o $OUTDIR/${SAMPLE}.sorted.bam
samtools index $OUTDIR/${SAMPLE}.sorted.bam

# Step 2: Mark duplicates
samtools fixmate -m $OUTDIR/${SAMPLE}.sorted.bam - \
    | samtools sort -@ 8 \
    | samtools markdup -@ 8 - $OUTDIR/${SAMPLE}.markdup.bam
samtools index $OUTDIR/${SAMPLE}.markdup.bam

# Step 3: GATK variant calling
gatk HaplotypeCaller \
    -R $GENOME \
    -I $OUTDIR/${SAMPLE}.markdup.bam \
    -O $OUTDIR/${SAMPLE}.g.vcf.gz \
    -ERC GVCF \
    --native-pair-hmm-threads 4

echo "Pipeline complete: $OUTDIR/${SAMPLE}.g.vcf.gz"

Key Parameters

Parameter	Default	Range/Options	Effect
`-t`	`1`	1–64	CPU threads; use 8–16 for production runs
`-R`	—	`@RG\tID:...\tSM:...`	Read group string; required for GATK compatibility
`-k`	`19`	10–28	Minimum seed length; lower = more sensitive for shorter reads
`-w`	`100`	50–500	Band width for Smith-Waterman alignment
`-M`	off	flag	Mark split/supplementary reads as secondary (BWA-MEM style); needed for Picard compatibility
`-a`	off	flag	Output all alignments for single-end reads (for seeding; increases file size)
`-p`	off	flag	Treat input as interleaved paired-end FASTQ
`-c`	`500`	100–10000	Skip alignment for MEM count > threshold (reduces multi-mapper noise)
`-T`	`30`	20–60	Minimum alignment score threshold; lower = report more low-quality alignments
`-Y`	off	flag	Use soft clipping for supplementary alignments (recommended for GATK)

Common Recipes

Recipe 1: Batch Align Multiple Samples

#!/bin/bash
# Align all samples in parallel using GNU parallel or sequential loop
GENOME="GRCh38.fa"
SAMPLES=(ctrl_1 ctrl_2 treat_1 treat_2)
THREADS=12

for sample in "${SAMPLES[@]}"; do
    echo "=== Aligning $sample ==="
    bwa-mem2 mem -t $THREADS \
        -R "@RG\tID:${sample}\tSM:${sample}\tPL:ILLUMINA\tLB:lib1\tPU:run1" \
        $GENOME \
        data/${sample}_R1.fastq.gz \
        data/${sample}_R2.fastq.gz \
        | samtools sort -@ 4 -m 2G -o results/${sample}.sorted.bam
    samtools index results/${sample}.sorted.bam
    
    MAPPED=$(samtools view -c -F 4 results/${sample}.sorted.bam)
    TOTAL=$(samtools view -c results/${sample}.sorted.bam)
    echo "$sample: $MAPPED / $TOTAL reads mapped"
done

Recipe 2: Parse Alignment Metrics with Python

import subprocess
import pandas as pd
from pathlib import Path

samples = ["ctrl_1", "ctrl_2", "treat_1", "treat_2"]
metrics = []

for sample in samples:
    bam = f"results/{sample}.sorted.bam"
    result = subprocess.run(
        ["samtools", "flagstat", bam], capture_output=True, text=True
    )
    stats = {}
    for line in result.stdout.splitlines():
        if "total" in line:
            stats["total"] = int(line.split()[0])
        elif "mapped" in line and "%" in line:
            stats["mapped"] = int(line.split()[0])
            stats["pct_mapped"] = float(line.split("(")[1].split("%")[0])
        elif "properly paired" in line:
            stats["properly_paired"] = int(line.split()[0])
    stats["sample"] = sample
    metrics.append(stats)

df = pd.DataFrame(metrics).set_index("sample")
print(df[["total", "mapped", "pct_mapped", "properly_paired"]])
df.to_csv("alignment_metrics.tsv", sep="\t")

Expected Outputs

Output	Format	Description
`*.sorted.bam`	BAM	Coordinate-sorted aligned reads; index with `samtools index`
`*.sorted.bam.bai`	BAI	BAM index; required for random access by GATK and IGV
`*.flagstat.txt`	Text	Alignment summary: total/mapped/paired/duplicate counts and percentages
`*.markdup.bam`	BAM	Duplicate-marked BAM; use as input to GATK HaplotypeCaller
`*.dupmetrics.txt`	Text	Picard duplication metrics with estimated library size

Troubleshooting

Problem	Cause	Solution
Low mapping rate (< 85%)	Genome mismatch, contamination, or low quality reads	Verify genome assembly matches sample; run FastQC; trim adapters with Trim Galore
`@RG` header missing error in GATK	`-R` flag not specified during alignment	Re-align with `-R "@RG\tID:...\tSM:...\tPL:ILLUMINA"`
Out of memory during indexing	Insufficient RAM for genome index	BWA-MEM2 requires ~28 GB for human; use classic bwa with less RAM if needed
Unbalanced paired-end counts	Interleaved FASTQ or file mismatch	Add `-p` for interleaved; verify R1/R2 read counts with `zcat r1.fq.gz \| wc -l`
`[E::bwa_idx_load_from_disk]` error	Index files missing or wrong prefix	Re-run `bwa-mem2 index genome.fa` ; ensure all `.0123` , `.bwt.2bit.64` files exist
Slow alignment speed	Low thread count or slow I/O	Use `-t 16` or more; store data on SSD; pipe directly to `samtools sort`
Supplementary alignments causing issues	Split reads in downstream tools	Add `-M` flag to mark split reads as secondary (Picard compatibility)
GATK base quality score recalibration fails	Missing known variant VCF	Download dbSNP VCF for your genome assembly from NCBI or GATK resource bundle

References

BWA-MEM2 GitHub: bwa-mem2/bwa-mem2 — source code, pre-compiled binaries, and benchmarks
Vasimuddin M et al. (2019) "Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems" — IPDPS 2019. DOI:10.1109/IPDPS.2019.00041
Li H & Durbin R (2009) "Fast and accurate short read alignment with Burrows-Wheeler Aligner" — Bioinformatics 25(14):1754-1760. DOI:10.1093/bioinformatics/btp324
GATK Best Practices for germline short variant discovery — GATK workflow recommending BWA-MEM2 alignment