SciAgent-Skills bwa-mem2-dna-aligner

Fast short-read DNA aligner for whole-genome, whole-exome, and ChIP-seq alignment to a reference genome. BWA-MEM2 is the 2× faster successor to BWA-MEM; outputs SAM/BAM with read group headers required by GATK. Produces primary alignments with supplementary records for chimeric reads. Use STAR instead for RNA-seq splice-aware alignment; use Bowtie2 as an alternative with comparable accuracy.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/bwa-mem2-dna-aligner" ~/.claude/skills/jaechang-hits-sciagent-skills-bwa-mem2-dna-aligner && rm -rf "$T"
manifest: skills/genomics-bioinformatics/bwa-mem2-dna-aligner/SKILL.md
source content

BWA-MEM2 — DNA Short-Read Aligner

Overview

BWA-MEM2 aligns short DNA reads (Illumina, 50–250 bp) to a reference genome using the BWT-FM index. It is the standard aligner for whole-genome sequencing (WGS), whole-exome sequencing (WES), ChIP-seq, and ATAC-seq DNA alignment. BWA-MEM2 is 2× faster than the original BWA-MEM while producing identical results. It outputs SAM format with proper read group (

@RG
) headers required by GATK HaplotypeCaller and Picard tools. For paired-end reads, it marks proper pairs and resolves chimeric/split reads into supplementary alignments.

When to Use

  • Aligning WGS or WES Illumina reads to a reference genome for variant calling (SNP, indel, SV)
  • ChIP-seq or ATAC-seq DNA alignment to produce BAM files for peak calling with MACS3
  • Producing GATK-compatible BAM files with
    @RG
    read group tags
  • Aligning reads ≥ 50 bp; for shorter reads (< 50 bp), BWA-backtrack may be more appropriate
  • Re-aligning legacy FASTQ files to an updated reference genome assembly
  • Use STAR instead for RNA-seq reads that span splice junctions
  • Use Bowtie2 as an alternative for local alignment or when index size must be minimized

Prerequisites

  • Software: bwa-mem2 (conda or pre-compiled binary), samtools
  • Reference: genome FASTA (e.g., GRCh38, hg19, mm10)
  • RAM: ~28 GB for human genome index; 6–8 GB for mouse

Check before installing: The tool may already be available in the current environment (e.g., inside a

pixi
/
conda
env). Run
command -v bwa-mem2
first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via
pixi run bwa-mem2
rather than bare
bwa-mem2
.

# Install with conda (recommended)
conda install -c bioconda bwa-mem2 samtools

# Or download pre-compiled binary
wget https://github.com/bwa-mem2/bwa-mem2/releases/download/v2.2.1/bwa-mem2-2.2.1_x64-linux.tar.bz2
tar -jxf bwa-mem2-2.2.1_x64-linux.tar.bz2
export PATH="$PWD/bwa-mem2-2.2.1_x64-linux:$PATH"

# Verify
bwa-mem2 version
# 2.2.1

Quick Start

# 1. Build genome index (~30 min, run once)
bwa-mem2 index GRCh38.fa

# 2. Align paired-end reads and sort
bwa-mem2 mem -t 16 -R "@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA" \
    GRCh38.fa sample1_R1.fastq.gz sample1_R2.fastq.gz \
    | samtools sort -@ 8 -o sample1.sorted.bam

# 3. Index the BAM
samtools index sample1.sorted.bam
echo "Aligned reads: $(samtools view -c -F 4 sample1.sorted.bam)"

Workflow

Step 1: Download Reference Genome

Obtain the reference genome FASTA file matching the target assembly.

# Download GRCh38 primary assembly (human)
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
mv GCA_000001405.15_GRCh38_no_alt_analysis_set.fna GRCh38.fa

# Or use ENSEMBL/GENCODE
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/GRCh38.primary_assembly.genome.fa.gz
gunzip GRCh38.primary_assembly.genome.fa.gz

echo "Reference size: $(du -sh GRCh38.fa)"

Step 2: Build BWA-MEM2 Index

Index the reference genome — required once per genome, takes ~25-35 min for human.

# Build index (~28 GB RAM required for human genome)
bwa-mem2 index GRCh38.fa

# This creates: GRCh38.fa.0123, GRCh38.fa.amb, GRCh38.fa.ann,
#               GRCh38.fa.bwt.2bit.64, GRCh38.fa.pac
echo "Index files: $(ls GRCh38.fa.* | wc -l) created"
ls -lh GRCh38.fa.*

Step 3: Align Paired-End Reads

Align FASTQ reads with a read group header required for GATK compatibility.

# Align with read group (required for GATK)
# @RG fields: ID (run ID), SM (sample name), PL (platform), LB (library), PU (flowcell)
bwa-mem2 mem \
    -t 16 \
    -R "@RG\tID:sample1_run1\tSM:sample1\tPL:ILLUMINA\tLB:lib1\tPU:flowcell1" \
    GRCh38.fa \
    sample1_R1.fastq.gz \
    sample1_R2.fastq.gz \
    | samtools sort -@ 8 -m 2G -o sample1.sorted.bam

samtools index sample1.sorted.bam
echo "Alignment complete."
echo "Total reads:   $(samtools view -c sample1.sorted.bam)"
echo "Mapped reads:  $(samtools view -c -F 4 sample1.sorted.bam)"

Step 4: Mark PCR Duplicates

Remove or mark optical and PCR duplicates before variant calling.

# Option A: samtools markdup (fast)
samtools fixmate -m sample1.sorted.bam sample1.fixmate.bam
samtools sort -@ 8 -o sample1.fixmate.sorted.bam sample1.fixmate.bam
samtools markdup -@ 8 sample1.fixmate.sorted.bam sample1.markdup.bam
samtools index sample1.markdup.bam

echo "Duplication rate:"
samtools flagstat sample1.markdup.bam | grep "duplicate"

# Option B: Picard MarkDuplicates (GATK best practices)
picard MarkDuplicates \
    INPUT=sample1.sorted.bam \
    OUTPUT=sample1.markdup.bam \
    METRICS_FILE=sample1.dupmetrics.txt \
    REMOVE_DUPLICATES=false \
    CREATE_INDEX=true
cat sample1.dupmetrics.txt | grep -A2 "ESTIMATED"

Step 5: Assess Alignment Quality

Generate alignment statistics and check key quality metrics.

# Full alignment statistics
samtools flagstat sample1.markdup.bam > sample1.flagstat.txt
cat sample1.flagstat.txt

# Coverage statistics
samtools coverage sample1.markdup.bam | head -30

# Parse key metrics with Python
python3 - << 'EOF'
from pathlib import Path

flagstat = Path("sample1.flagstat.txt").read_text()
for line in flagstat.splitlines():
    if any(kw in line for kw in ["total", "mapped", "properly paired", "duplicate"]):
        print(line)
EOF

Step 6: Complete WGS/WES Pipeline → Variant Calling

Pipe BWA-MEM2 output directly into GATK HaplotypeCaller.

#!/bin/bash
# Complete WGS alignment → variant calling pipeline
GENOME="GRCh38.fa"
SAMPLE="sample1"
R1="data/${SAMPLE}_R1.fastq.gz"
R2="data/${SAMPLE}_R2.fastq.gz"
THREADS=16
OUTDIR="results/${SAMPLE}"
mkdir -p "$OUTDIR"

# Step 1: Align + sort
bwa-mem2 mem -t $THREADS \
    -R "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tPL:ILLUMINA\tLB:lib1\tPU:run1" \
    $GENOME $R1 $R2 \
    | samtools sort -@ 8 -o $OUTDIR/${SAMPLE}.sorted.bam
samtools index $OUTDIR/${SAMPLE}.sorted.bam

# Step 2: Mark duplicates
samtools fixmate -m $OUTDIR/${SAMPLE}.sorted.bam - \
    | samtools sort -@ 8 \
    | samtools markdup -@ 8 - $OUTDIR/${SAMPLE}.markdup.bam
samtools index $OUTDIR/${SAMPLE}.markdup.bam

# Step 3: GATK variant calling
gatk HaplotypeCaller \
    -R $GENOME \
    -I $OUTDIR/${SAMPLE}.markdup.bam \
    -O $OUTDIR/${SAMPLE}.g.vcf.gz \
    -ERC GVCF \
    --native-pair-hmm-threads 4

echo "Pipeline complete: $OUTDIR/${SAMPLE}.g.vcf.gz"

Key Parameters

ParameterDefaultRange/OptionsEffect
-t
1
1–64CPU threads; use 8–16 for production runs
-R
@RG\tID:...\tSM:...
Read group string; required for GATK compatibility
-k
19
10–28Minimum seed length; lower = more sensitive for shorter reads
-w
100
50–500Band width for Smith-Waterman alignment
-M
offflagMark split/supplementary reads as secondary (BWA-MEM style); needed for Picard compatibility
-a
offflagOutput all alignments for single-end reads (for seeding; increases file size)
-p
offflagTreat input as interleaved paired-end FASTQ
-c
500
100–10000Skip alignment for MEM count > threshold (reduces multi-mapper noise)
-T
30
20–60Minimum alignment score threshold; lower = report more low-quality alignments
-Y
offflagUse soft clipping for supplementary alignments (recommended for GATK)

Common Recipes

Recipe 1: Batch Align Multiple Samples

#!/bin/bash
# Align all samples in parallel using GNU parallel or sequential loop
GENOME="GRCh38.fa"
SAMPLES=(ctrl_1 ctrl_2 treat_1 treat_2)
THREADS=12

for sample in "${SAMPLES[@]}"; do
    echo "=== Aligning $sample ==="
    bwa-mem2 mem -t $THREADS \
        -R "@RG\tID:${sample}\tSM:${sample}\tPL:ILLUMINA\tLB:lib1\tPU:run1" \
        $GENOME \
        data/${sample}_R1.fastq.gz \
        data/${sample}_R2.fastq.gz \
        | samtools sort -@ 4 -m 2G -o results/${sample}.sorted.bam
    samtools index results/${sample}.sorted.bam
    
    MAPPED=$(samtools view -c -F 4 results/${sample}.sorted.bam)
    TOTAL=$(samtools view -c results/${sample}.sorted.bam)
    echo "$sample: $MAPPED / $TOTAL reads mapped"
done

Recipe 2: Parse Alignment Metrics with Python

import subprocess
import pandas as pd
from pathlib import Path

samples = ["ctrl_1", "ctrl_2", "treat_1", "treat_2"]
metrics = []

for sample in samples:
    bam = f"results/{sample}.sorted.bam"
    result = subprocess.run(
        ["samtools", "flagstat", bam], capture_output=True, text=True
    )
    stats = {}
    for line in result.stdout.splitlines():
        if "total" in line:
            stats["total"] = int(line.split()[0])
        elif "mapped" in line and "%" in line:
            stats["mapped"] = int(line.split()[0])
            stats["pct_mapped"] = float(line.split("(")[1].split("%")[0])
        elif "properly paired" in line:
            stats["properly_paired"] = int(line.split()[0])
    stats["sample"] = sample
    metrics.append(stats)

df = pd.DataFrame(metrics).set_index("sample")
print(df[["total", "mapped", "pct_mapped", "properly_paired"]])
df.to_csv("alignment_metrics.tsv", sep="\t")

Expected Outputs

OutputFormatDescription
*.sorted.bam
BAMCoordinate-sorted aligned reads; index with
samtools index
*.sorted.bam.bai
BAIBAM index; required for random access by GATK and IGV
*.flagstat.txt
TextAlignment summary: total/mapped/paired/duplicate counts and percentages
*.markdup.bam
BAMDuplicate-marked BAM; use as input to GATK HaplotypeCaller
*.dupmetrics.txt
TextPicard duplication metrics with estimated library size

Troubleshooting

ProblemCauseSolution
Low mapping rate (< 85%)Genome mismatch, contamination, or low quality readsVerify genome assembly matches sample; run FastQC; trim adapters with Trim Galore
@RG
header missing error in GATK
-R
flag not specified during alignment
Re-align with
-R "@RG\tID:...\tSM:...\tPL:ILLUMINA"
Out of memory during indexingInsufficient RAM for genome indexBWA-MEM2 requires ~28 GB for human; use classic bwa with less RAM if needed
Unbalanced paired-end countsInterleaved FASTQ or file mismatchAdd
-p
for interleaved; verify R1/R2 read counts with
zcat r1.fq.gz | wc -l
[E::bwa_idx_load_from_disk]
error
Index files missing or wrong prefixRe-run
bwa-mem2 index genome.fa
; ensure all
.0123
,
.bwt.2bit.64
files exist
Slow alignment speedLow thread count or slow I/OUse
-t 16
or more; store data on SSD; pipe directly to
samtools sort
Supplementary alignments causing issuesSplit reads in downstream toolsAdd
-M
flag to mark split reads as secondary (Picard compatibility)
GATK base quality score recalibration failsMissing known variant VCFDownload dbSNP VCF for your genome assembly from NCBI or GATK resource bundle

References