BioSkills bio-alignment-files-bam-statistics
Generate alignment statistics using samtools flagstat, stats, depth, and coverage. Use when assessing alignment quality, calculating coverage, or generating QC reports.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/alignment-files/bam-statistics" ~/.claude/skills/gptomics-bioskills-bio-alignment-files-bam-statistics && rm -rf "$T"
alignment-files/bam-statistics/SKILL.mdVersion Compatibility
Reference examples tested with: pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
BAM Statistics
"Get alignment statistics and coverage from my BAM file" → Generate read counts, mapping rates, per-chromosome statistics, depth profiles, and coverage summaries.
- CLI:
,samtools flagstat
,samtools stats
,samtools depth
(samtools)samtools coverage - Python:
withpysam.AlignmentFile
andpileup()
(pysam)get_index_statistics()
Generate alignment statistics using samtools and pysam.
Quick Summary Commands
| Command | Output | Speed |
|---|---|---|
| Read counts by category | Very fast |
| Per-chromosome counts | Very fast (needs index) |
| Comprehensive statistics | Moderate |
| Per-position depth | Slow (full scan) |
| Per-region coverage | Fast (needs index) |
samtools flagstat
Fast summary of alignment flags.
samtools flagstat input.bam
Output:
10000000 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 50000 + 0 supplementary 0 + 0 duplicates 9800000 + 0 mapped (98.00% : N/A) 9950000 + 0 paired in sequencing 4975000 + 0 read1 4975000 + 0 read2 9700000 + 0 properly paired (97.49% : N/A) 9750000 + 0 with itself and mate mapped 100000 + 0 singletons (1.01% : N/A) 25000 + 0 with mate mapped to a different chr 10000 + 0 with mate mapped to a different chr (mapQ>=5)
Multi-threaded
samtools flagstat -@ 4 input.bam
Output to File
samtools flagstat input.bam > flagstat.txt
samtools idxstats
Per-chromosome read counts (requires index).
samtools idxstats input.bam
Output format:
chrom length mapped unmapped
chr1 248956422 5000000 1000 chr2 242193529 4800000 800 chrM 16569 50000 100 * 0 0 150000
Parse idxstats
# Total mapped reads samtools idxstats input.bam | awk '{sum += $3} END {print sum}' # Mitochondrial percentage samtools idxstats input.bam | awk ' /^chrM/ {mt = $3} {total += $3} END {print mt/total*100 "% mitochondrial"}'
samtools stats
Comprehensive statistics including insert size, base quality, and more.
samtools stats input.bam > stats.txt
View Summary Numbers
samtools stats input.bam | grep "^SN"
Key summary fields:
- Total readsraw total sequences
- Mapped readsreads mapped
- Properly pairedreads mapped and paired
- Mean insert sizeinsert size average
- Insert size spreadinsert size standard deviation
- Mean read lengthaverage length
- Mismatch rateerror rate
Generate Plots (with plot-bamstats)
samtools stats input.bam > stats.txt plot-bamstats -p plots/ stats.txt
Stats for Specific Region
samtools stats input.bam chr1:1000000-2000000 > region_stats.txt
samtools depth
Per-position read depth.
Basic Depth
samtools depth input.bam > depth.txt
Output:
chrom position depth
Depth at Specific Positions
samtools depth -r chr1:1000-2000 input.bam
Include Zero-Depth Positions
samtools depth -a input.bam > depth_with_zeros.txt
Maximum Depth Cap
samtools depth -d 0 input.bam # No cap (default 8000)
Depth from BED Regions
samtools depth -b regions.bed input.bam
Calculate Mean Depth
samtools depth input.bam | awk '{sum += $3; n++} END {print sum/n}'
samtools coverage
Per-chromosome or per-region coverage statistics (faster than depth).
samtools coverage input.bam
Output columns:
- Reference name#rname
- Start positionstartpos
- End positionendpos
- Number of readsnumreads
- Bases with coveragecovbases
- Percentage of bases coveredcoverage
- Mean depthmeandepth
- Mean base qualitymeanbaseq
- Mean mapping qualitymeanmapq
Coverage for Specific Region
samtools coverage -r chr1:1000000-2000000 input.bam
Coverage from BED
samtools coverage -b regions.bed input.bam
Histogram Output
samtools coverage -m input.bam
pysam Python Alternative
Count Reads
import pysam with pysam.AlignmentFile('input.bam', 'rb') as bam: total = mapped = paired = proper = 0 for read in bam: total += 1 if not read.is_unmapped: mapped += 1 if read.is_paired: paired += 1 if read.is_proper_pair: proper += 1 print(f'Total: {total}') print(f'Mapped: {mapped} ({mapped/total*100:.1f}%)') print(f'Properly paired: {proper} ({proper/paired*100:.1f}%)')
Per-Chromosome Counts
import pysam with pysam.AlignmentFile('input.bam', 'rb') as bam: for stat in bam.get_index_statistics(): print(f'{stat.contig}: {stat.mapped} mapped, {stat.unmapped} unmapped')
Calculate Depth at Position
import pysam with pysam.AlignmentFile('input.bam', 'rb') as bam: for pileup in bam.pileup('chr1', 1000000, 1000001): print(f'Position {pileup.pos}: depth {pileup.n}')
Mean Depth in Region
import pysam def mean_depth(bam_path, chrom, start, end): depths = [] with pysam.AlignmentFile(bam_path, 'rb') as bam: for pileup in bam.pileup(chrom, start, end, truncate=True): depths.append(pileup.n) if depths: return sum(depths) / len(depths) return 0 depth = mean_depth('input.bam', 'chr1', 1000000, 2000000) print(f'Mean depth: {depth:.1f}x')
Coverage Statistics
Goal: Compute coverage breadth and depth for a genomic region from a BAM file.
Approach: Iterate pileup columns in the region, count covered positions and accumulate depth, then derive percentages and means.
Reference (pysam 0.22+):
import pysam def coverage_stats(bam_path, chrom, start, end): covered = 0 total_depth = 0 with pysam.AlignmentFile(bam_path, 'rb') as bam: for pileup in bam.pileup(chrom, start, end, truncate=True): covered += 1 total_depth += pileup.n length = end - start pct_covered = covered / length * 100 mean_depth = total_depth / length if length > 0 else 0 return { 'length': length, 'covered_bases': covered, 'pct_covered': pct_covered, 'mean_depth': mean_depth } stats = coverage_stats('input.bam', 'chr1', 1000000, 2000000) print(f'Coverage: {stats["pct_covered"]:.1f}%') print(f'Mean depth: {stats["mean_depth"]:.1f}x')
Insert Size Distribution
Goal: Compute the insert size distribution to assess library preparation quality.
Approach: Iterate properly paired read1 records, accumulate template lengths into a Counter, then compute summary statistics.
Reference (pysam 0.22+):
import pysam from collections import Counter insert_sizes = Counter() with pysam.AlignmentFile('input.bam', 'rb') as bam: for read in bam: if read.is_proper_pair and read.is_read1 and read.template_length > 0: insert_sizes[read.template_length] += 1 sizes = list(insert_sizes.keys()) mean_insert = sum(s * c for s, c in insert_sizes.items()) / sum(insert_sizes.values()) print(f'Mean insert size: {mean_insert:.0f}') print(f'Min: {min(sizes)}, Max: {max(sizes)}')
Quick Reference
| Task | Command |
|---|---|
| Quick counts | |
| Per-chrom counts | |
| Full stats | |
| Coverage summary | |
| Per-position depth | |
| Mean depth | |
Common Metrics
| Metric | Good | Concerning |
|---|---|---|
| Mapping rate | >95% | <80% |
| Proper pair rate | >90% | <70% |
| Duplicate rate | <20% | >40% |
| Error rate | <1% | >2% |
| Coverage uniformity | <2x CV | >3x CV |
Related Skills
- sam-bam-basics - View alignment files
- alignment-indexing - idxstats requires index
- duplicate-handling - Check duplicate rates
- alignment-filtering - Filter before stats
- sequence-io/sequence-statistics - FASTA/FASTQ statistics