BioSkills bio-alignment-indexing
Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/alignment-files/alignment-indexing" ~/.claude/skills/gptomics-bioskills-bio-alignment-indexing && rm -rf "$T"
alignment-files/alignment-indexing/SKILL.mdVersion Compatibility
Reference examples tested with: pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Alignment Indexing
Create indices for random access to alignment files using samtools and pysam.
"Index a BAM file" → Create a .bai/.csi index enabling random access to genomic regions.
- CLI:
samtools index file.bam - Python:
pysam.index('file.bam')
Index Types
| Index | Extension | Use Case |
|---|---|---|
| BAI | | Standard BAM index, chromosomes < 512 Mbp |
| CSI | | Large chromosomes, custom bin sizes |
| CRAI | | CRAM index |
samtools index
Create BAI Index
samtools index input.bam # Creates input.bam.bai
Create CSI Index
samtools index -c input.bam # Creates input.bam.csi
Specify Output Name
samtools index input.bam output.bai
Multi-threaded Indexing
samtools index -@ 4 input.bam
Index CRAM
samtools index input.cram # Creates input.cram.crai
Index Requirements
Indexing requires coordinate-sorted files:
# Check sort order samtools view -H input.bam | grep "^@HD" # Should show SO:coordinate # Sort if needed, then index samtools sort -o sorted.bam input.bam samtools index sorted.bam
Using Indices for Region Access
Goal: Extract reads overlapping specific genomic coordinates from an indexed BAM.
Approach: With the index present,
samtools view or pysam.fetch() can jump directly to the relevant file offset instead of scanning the entire file.
samtools view with Region
# Requires index file present samtools view input.bam chr1:1000000-2000000
Multiple Regions
samtools view input.bam chr1:1000-2000 chr2:3000-4000
Regions from BED File
samtools view -L regions.bed input.bam
pysam Python Alternative
Create Index
import pysam pysam.index('input.bam') # Creates input.bam.bai
Create CSI Index
pysam.index('input.bam', 'input.bam.csi', csi=True)
Fetch with Index
with pysam.AlignmentFile('input.bam', 'rb') as bam: # fetch() requires index for read in bam.fetch('chr1', 1000000, 2000000): print(read.query_name)
Check if Indexed
import pysam from pathlib import Path def is_indexed(bam_path): bam_path = Path(bam_path) return (bam_path.with_suffix('.bam.bai').exists() or Path(str(bam_path) + '.bai').exists() or bam_path.with_suffix('.bam.csi').exists()) if not is_indexed('input.bam'): pysam.index('input.bam')
Fetch Multiple Regions
regions = [('chr1', 1000, 2000), ('chr1', 5000, 6000), ('chr2', 1000, 2000)] with pysam.AlignmentFile('input.bam', 'rb') as bam: for chrom, start, end in regions: count = sum(1 for _ in bam.fetch(chrom, start, end)) print(f'{chrom}:{start}-{end}: {count} reads')
Count Reads in Region
with pysam.AlignmentFile('input.bam', 'rb') as bam: count = bam.count('chr1', 1000000, 2000000) print(f'Reads in region: {count}')
Get Reads Covering Position
with pysam.AlignmentFile('input.bam', 'rb') as bam: for read in bam.fetch('chr1', 1000000, 1000001): if read.reference_start <= 1000000 < read.reference_end: print(f'{read.query_name} covers position 1000000')
Index File Locations
samtools looks for indices in two locations:
input.bam.bai # Standard location input.bai # Alternative location
For CRAM:
input.cram.crai
idxstats - Index Statistics
Get Per-Chromosome Counts
samtools idxstats input.bam
Output format:
chr1 248956422 5000000 0 chr2 242193529 4500000 0 * 0 0 10000
Columns: reference name, length, mapped reads, unmapped reads
Sum Total Mapped Reads
samtools idxstats input.bam | awk '{sum += $3} END {print sum}'
pysam idxstats
with pysam.AlignmentFile('input.bam', 'rb') as bam: for stat in bam.get_index_statistics(): print(f'{stat.contig}: {stat.mapped} mapped, {stat.unmapped} unmapped')
FASTA Index (faidx)
Related but different - index reference FASTA for random access:
samtools faidx reference.fa # Creates reference.fa.fai # Fetch region from indexed FASTA samtools faidx reference.fa chr1:1000-2000
pysam FastaFile
with pysam.FastaFile('reference.fa') as ref: seq = ref.fetch('chr1', 1000, 2000) print(seq)
Quick Reference
| Task | samtools | pysam |
|---|---|---|
| Create BAI | | |
| Create CSI | | |
| Fetch region | | |
| Count in region | | |
| Index stats | | |
| Index FASTA | | Automatic with FastaFile |
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Missing index | Run |
| Unsorted BAM | Sort first with |
| Wrong chromosome name | Check names with |
Related Skills
- sam-bam-basics - View and convert alignment files
- alignment-sorting - Sort BAM files (required before indexing)
- alignment-filtering - Filter by regions using index
- bam-statistics - Use idxstats for quick counts
- sequence-io/read-sequences - Index FASTA with SeqIO.index_db()