BioSkills bio-alignment-validation

Validate alignment quality with insert size distribution, proper pairing rates, GC bias, strand balance, and other post-alignment metrics. Use when verifying alignment data quality before variant calling or quantification.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/alignment-files/alignment-validation" ~/.claude/skills/gptomics-bioskills-bio-alignment-validation && rm -rf "$T"
manifest: alignment-files/alignment-validation/SKILL.md
source content

Version Compatibility

Reference examples tested with: matplotlib 3.8+, numpy 1.26+, picard 3.1+, pysam 0.22+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:

  • Python:
    pip show <package>
    then
    help(module.function)
    to check signatures
  • CLI:
    <tool> --version
    then
    <tool> --help
    to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Alignment Validation

Post-alignment quality control to verify alignment quality and identify issues.

"Check alignment quality" → Compute post-alignment QC metrics (mapping rate, pairing, insert size, strand balance) to identify issues before downstream analysis.

  • CLI:
    samtools flagstat
    ,
    samtools stats
    , Picard
    CollectAlignmentSummaryMetrics
  • Python:
    pysam.AlignmentFile
    iteration with metric calculations

Insert Size Distribution

Goal: Verify that the fragment length distribution matches the library preparation protocol.

Approach: Extract template_length from properly paired reads and compare the distribution to expected values for the library type.

samtools stats

samtools stats input.bam > stats.txt
grep "^IS" stats.txt | cut -f2,3 > insert_sizes.txt

Picard CollectInsertSizeMetrics

java -jar picard.jar CollectInsertSizeMetrics \
    I=input.bam \
    O=insert_metrics.txt \
    H=insert_histogram.pdf

Expected Insert Sizes

Library TypeExpected Size
Standard WGS300-500 bp
PCR-free350-550 bp
RNA-seq150-300 bp
ChIP-seq150-300 bp
ATAC-seqMultimodal

Python Insert Size Analysis

import pysam
import numpy as np
import matplotlib.pyplot as plt

def get_insert_sizes(bam_file, max_reads=100000):
    sizes = []
    bam = pysam.AlignmentFile(bam_file, 'rb')
    for i, read in enumerate(bam.fetch()):
        if i >= max_reads:
            break
        if read.is_proper_pair and not read.is_secondary and read.template_length > 0:
            sizes.append(read.template_length)
    bam.close()
    return sizes

sizes = get_insert_sizes('sample.bam')
print(f'Median insert size: {np.median(sizes):.0f}')
print(f'Mean insert size: {np.mean(sizes):.0f}')
print(f'Std dev: {np.std(sizes):.0f}')

plt.hist(sizes, bins=100, range=(0, 1000))
plt.xlabel('Insert Size')
plt.ylabel('Count')
plt.savefig('insert_size_dist.pdf')

Proper Pairing Rate

Percentage of reads correctly paired.

samtools flagstat

samtools flagstat input.bam

samtools flagstat input.bam | grep "properly paired"

Calculate Pairing Rate

proper=$(samtools view -c -f 2 input.bam)
mapped=$(samtools view -c -F 4 input.bam)
rate=$(echo "scale=4; $proper / $mapped * 100" | bc)
echo "Proper pairing rate: ${rate}%"

Expected Rates

MetricGoodMarginalPoor
Proper pair> 90%80-90%< 80%
Mapped> 95%90-95%< 90%
Singletons< 5%5-10%> 10%

GC Bias

GC content correlation with coverage.

Picard CollectGcBiasMetrics

java -jar picard.jar CollectGcBiasMetrics \
    I=input.bam \
    O=gc_bias_metrics.txt \
    CHART=gc_bias_chart.pdf \
    S=gc_summary.txt \
    R=reference.fa

deepTools computeGCBias

computeGCBias \
    -b input.bam \
    --effectiveGenomeSize 2913022398 \
    -g hg38.2bit \
    -o gc_bias.txt \
    --biasPlot gc_bias.pdf

Interpret GC Bias

IssueSymptom
Under-representationLow GC coverage drops
Over-representationHigh GC coverage elevated
PCR biasStrong correlation

Strand Balance

Forward and reverse strand should be balanced.

Calculate Strand Ratio

forward=$(samtools view -c -F 16 input.bam)
reverse=$(samtools view -c -f 16 input.bam)
echo "Forward: $forward"
echo "Reverse: $reverse"
ratio=$(echo "scale=4; $forward / $reverse" | bc)
echo "F/R ratio: $ratio"

Check Strand Bias per Chromosome

for chr in chr1 chr2 chr3; do
    fwd=$(samtools view -c -F 16 input.bam $chr)
    rev=$(samtools view -c -f 16 input.bam $chr)
    echo "$chr: F=$fwd R=$rev ratio=$(echo "scale=2; $fwd/$rev" | bc)"
done

Mapping Quality Distribution

Extract MAPQ Distribution

samtools view input.bam | cut -f5 | sort -n | uniq -c | sort -k2 -n

Calculate Mean MAPQ

samtools view input.bam | awk '{sum+=$5; count++} END {print "Mean MAPQ:", sum/count}'

MAPQ Thresholds

MAPQMeaning
0Multi-mapper
1-10Low confidence
20-30Moderate
40+High confidence
60Unique (BWA)

Chromosome Coverage Balance

Calculate Per-Chromosome Coverage

samtools idxstats input.bam | awk '{print $1, $3/$2}' | head -25

Check for Aneuploidy/Contamination

samtools idxstats input.bam | awk '$3 > 0 {
    sum += $3
    len[$1] = $2
    reads[$1] = $3
} END {
    for (chr in reads) {
        expected = len[chr] / sum * reads[chr]
        ratio = reads[chr] / expected
        if (ratio < 0.8 || ratio > 1.2) print chr, ratio
    }
}'

Mismatch Rate

Picard CollectAlignmentSummaryMetrics

java -jar picard.jar CollectAlignmentSummaryMetrics \
    I=input.bam \
    R=reference.fa \
    O=alignment_summary.txt

Key Metrics

MetricDescriptionGood Value
PCT_PF_READS_ALIGNEDMapped %> 95%
PF_MISMATCH_RATEMismatches< 1%
PF_INDEL_RATEIndels< 0.1%
STRAND_BALANCEStrand ratio~0.5

Comprehensive Validation Script

Goal: Run all key alignment QC checks in a single pass and generate a summary report.

Approach: Combine samtools flagstat, stats, idxstats, and strand counts into one script that outputs pass/warn/fail calls.

#!/bin/bash
BAM=$1
REF=$2
NAME=$(basename $BAM .bam)
OUTDIR=${3:-qc}

mkdir -p $OUTDIR

echo "=== Alignment Validation: $NAME ===" | tee $OUTDIR/report.txt

echo -e "\n--- Flagstat ---" | tee -a $OUTDIR/report.txt
samtools flagstat $BAM | tee -a $OUTDIR/report.txt

echo -e "\n--- Mapping Rate ---" | tee -a $OUTDIR/report.txt
mapped=$(samtools view -c -F 4 $BAM)
total=$(samtools view -c $BAM)
rate=$(echo "scale=2; $mapped / $total * 100" | bc)
echo "Mapping rate: ${rate}%" | tee -a $OUTDIR/report.txt

echo -e "\n--- Proper Pairing ---" | tee -a $OUTDIR/report.txt
proper=$(samtools view -c -f 2 $BAM)
pair_rate=$(echo "scale=2; $proper / $mapped * 100" | bc)
echo "Proper pairing: ${pair_rate}%" | tee -a $OUTDIR/report.txt

echo -e "\n--- Insert Size ---" | tee -a $OUTDIR/report.txt
samtools stats $BAM | grep "insert size average" | tee -a $OUTDIR/report.txt

echo -e "\n--- Strand Balance ---" | tee -a $OUTDIR/report.txt
fwd=$(samtools view -c -F 16 $BAM)
rev=$(samtools view -c -f 16 $BAM)
strand_ratio=$(echo "scale=3; $fwd / $rev" | bc)
echo "Forward: $fwd, Reverse: $rev, Ratio: $strand_ratio" | tee -a $OUTDIR/report.txt

echo -e "\n--- Chromosome Coverage ---" | tee -a $OUTDIR/report.txt
samtools idxstats $BAM | head -25 | tee -a $OUTDIR/report.txt

echo -e "\nReport: $OUTDIR/report.txt"

Python Validation Module

import pysam
import numpy as np
from collections import Counter

class AlignmentValidator:
    def __init__(self, bam_file):
        self.bam = pysam.AlignmentFile(bam_file, 'rb')

    def mapping_rate(self):
        stats = self.bam.get_index_statistics()
        mapped = sum(s.mapped for s in stats)
        unmapped = sum(s.unmapped for s in stats)
        return mapped / (mapped + unmapped) * 100

    def proper_pair_rate(self, sample_size=100000):
        proper = 0
        paired = 0
        for i, read in enumerate(self.bam.fetch()):
            if i >= sample_size:
                break
            if read.is_paired:
                paired += 1
                if read.is_proper_pair:
                    proper += 1
        return proper / paired * 100 if paired > 0 else 0

    def mapq_distribution(self, sample_size=100000):
        mapqs = []
        for i, read in enumerate(self.bam.fetch()):
            if i >= sample_size:
                break
            if not read.is_unmapped:
                mapqs.append(read.mapping_quality)
        return Counter(mapqs)

    def strand_balance(self, sample_size=100000):
        forward = 0
        reverse = 0
        for i, read in enumerate(self.bam.fetch()):
            if i >= sample_size:
                break
            if not read.is_unmapped:
                if read.is_reverse:
                    reverse += 1
                else:
                    forward += 1
        return forward / (forward + reverse) if (forward + reverse) > 0 else 0.5

    def report(self):
        print(f'Mapping rate: {self.mapping_rate():.1f}%')
        print(f'Proper pairing: {self.proper_pair_rate():.1f}%')
        print(f'Strand balance: {self.strand_balance():.3f}')

        mapq_dist = self.mapq_distribution()
        high_qual = sum(v for k, v in mapq_dist.items() if k >= 30)
        total = sum(mapq_dist.values())
        print(f'High MAPQ (>=30): {high_qual/total*100:.1f}%')

    def close(self):
        self.bam.close()

validator = AlignmentValidator('sample.bam')
validator.report()
validator.close()

Quality Thresholds Summary

MetricGoodWarningFail
Mapping rate> 95%90-95%< 90%
Proper pairing> 90%80-90%< 80%
Duplicate rate< 10%10-20%> 20%
Strand balance0.48-0.520.45-0.55Outside
Mean MAPQ> 4030-40< 30
GC bias< 1.2x1.2-1.5x> 1.5x

Related Skills

  • bam-statistics - Basic flagstat and depth
  • duplicate-handling - Mark/remove duplicates
  • alignment-filtering - Filter low-quality reads
  • chip-seq/chipseq-qc - ChIP-specific metrics