BioSkills bio-alignment-sorting

Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/alignment-files/alignment-sorting" ~/.claude/skills/gptomics-bioskills-bio-alignment-sorting && rm -rf "$T"
manifest: alignment-files/alignment-sorting/SKILL.md
source content

Version Compatibility

Reference examples tested with: pysam 0.22+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:

  • Python:
    pip show <package>
    then
    help(module.function)
    to check signatures
  • CLI:
    <tool> --version
    then
    <tool> --help
    to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Alignment Sorting

Sort alignment files by coordinate or read name using samtools and pysam.

"Sort a BAM file" → Reorder reads by genomic coordinate (for indexing/variant calling) or by name (for paired-end processing).

  • CLI:
    samtools sort -o sorted.bam input.bam
  • Python:
    pysam.sort('-o', 'sorted.bam', 'input.bam')

Sort Orders

OrderFlagUse Case
CoordinatedefaultIndexing, visualization, variant calling
Name
-n
Paired-end processing, fixmate, markdup
Tag
-t TAG
Sort by specific tag value

samtools sort

Sort by Coordinate (Default)

samtools sort -o sorted.bam input.bam

Sort by Read Name

samtools sort -n -o namesorted.bam input.bam

Multi-threaded Sorting

samtools sort -@ 8 -o sorted.bam input.bam

Control Memory Usage

samtools sort -m 4G -@ 4 -o sorted.bam input.bam

Set Temporary Directory

samtools sort -T /tmp/sort_tmp -o sorted.bam input.bam

Specify Output Format

# Output as BAM (default)
samtools sort -O bam -o sorted.bam input.bam

# Output as CRAM
samtools sort -O cram --reference ref.fa -o sorted.cram input.bam

Sort by Tag

# Sort by cell barcode (10x Genomics)
samtools sort -t CB -o sorted_by_barcode.bam input.bam

Pipe from Aligner

bwa mem ref.fa reads.fq | samtools sort -o aligned.bam

samtools collate

Group paired reads together without full sorting (faster than name sort for some workflows):

# Collate paired reads
samtools collate -o collated.bam input.bam

# With output prefix for temp files
samtools collate -O input.bam /tmp/collate > collated.bam

# Fast mode (output to stdout)
samtools collate -u -O input.bam /tmp/collate | samtools fastq -1 R1.fq -2 R2.fq -

Check Sort Order

From Header

samtools view -H input.bam | grep "^@HD"
# SO:coordinate = coordinate sorted
# SO:queryname = name sorted
# SO:unsorted = not sorted

Verify Sorted

# Check if coordinate sorted (returns 0 if sorted)
samtools view input.bam | awk '$4 < prev {exit 1} {prev=$4}'

pysam Python Alternative

Sort with pysam

import pysam

pysam.sort('-o', 'sorted.bam', 'input.bam')

Sort by Name

pysam.sort('-n', '-o', 'namesorted.bam', 'input.bam')

Sort with Options

pysam.sort('-@', '4', '-m', '2G', '-o', 'sorted.bam', 'input.bam')

Manual Sorting in Python

import pysam

with pysam.AlignmentFile('input.bam', 'rb') as infile:
    header = infile.header
    reads = list(infile)

reads.sort(key=lambda r: (r.reference_id, r.reference_start))

with pysam.AlignmentFile('sorted.bam', 'wb', header=header) as outfile:
    for read in reads:
        outfile.write(read)

Check Sort Order in pysam

import pysam

with pysam.AlignmentFile('input.bam', 'rb') as bam:
    hd = bam.header.get('HD', {})
    sort_order = hd.get('SO', 'unknown')
    print(f'Sort order: {sort_order}')

Stream Sort from Aligner

For streaming from aligners, use shell pipes (simpler and more reliable):

import subprocess

subprocess.run(
    'bwa mem ref.fa reads.fq | samtools sort -o aligned.bam',
    shell=True, check=True
)

Or use pysam with a named pipe:

import os
import pysam
import subprocess

os.mkfifo('aligner.pipe')
try:
    aligner = subprocess.Popen(['bwa', 'mem', 'ref.fa', 'reads.fq'],
                               stdout=open('aligner.pipe', 'w'))
    pysam.sort('-o', 'aligned.bam', 'aligner.pipe')
    aligner.wait()
finally:
    os.unlink('aligner.pipe')

samtools merge

Combine multiple BAM files into one.

Basic Merge

samtools merge merged.bam sample1.bam sample2.bam sample3.bam

Merge with Threads

samtools merge -@ 4 merged.bam sample1.bam sample2.bam sample3.bam

Merge from File List

# files.txt contains one BAM path per line
samtools merge -b files.txt merged.bam

Force Overwrite

samtools merge -f merged.bam sample1.bam sample2.bam

Merge Specific Region

samtools merge -R chr1:1000000-2000000 merged_region.bam sample1.bam sample2.bam

pysam Merge

import pysam

pysam.merge('-f', 'merged.bam', 'sample1.bam', 'sample2.bam', 'sample3.bam')

Common Workflows

Goal: Combine sorting with other alignment processing steps into efficient pipelines.

Approach: Pipe aligner output directly into

samtools sort
to avoid writing unsorted intermediates, then index for downstream access.

Align and Sort

bwa mem -t 8 ref.fa R1.fq R2.fq | samtools sort -@ 4 -o aligned.bam
samtools index aligned.bam

Re-sort by Name for Duplicate Marking

# Full workflow: sort by name, fixmate, sort by coord, markdup
samtools sort -n -o namesorted.bam input.bam
samtools fixmate -m namesorted.bam fixmate.bam
samtools sort -o sorted.bam fixmate.bam
samtools markdup sorted.bam marked.bam

Convert Name-sorted to Coordinate-sorted

samtools sort -o coord_sorted.bam name_sorted.bam
samtools index coord_sorted.bam

Extract FASTQ from Sorted BAM

# Collate first to group pairs
samtools collate -u -O input.bam /tmp/collate | \
    samtools fastq -1 R1.fq -2 R2.fq -0 /dev/null -s /dev/null -

Performance Tips

ParameterEffect
-@ N
Use N additional threads
-m SIZE
Memory per thread (e.g., 4G)
-T PREFIX
Temp file location (use fast disk)
-l LEVEL
Compression level (1-9, default 6)

Optimal Settings for Large Files

# Use 8 threads, 4GB per thread, low compression for speed
samtools sort -@ 8 -m 4G -l 1 -o sorted.bam input.bam

Quick Reference

TaskCommand
Sort by coordinate
samtools sort -o out.bam in.bam
Sort by name
samtools sort -n -o out.bam in.bam
Sort with threads
samtools sort -@ 8 -o out.bam in.bam
Collate pairs
samtools collate -o out.bam in.bam
Merge BAMs
samtools merge out.bam in1.bam in2.bam
Check sort order
samtools view -H in.bam | grep "^@HD"
Sort + index
samtools sort -o out.bam in.bam && samtools index out.bam

Common Errors

ErrorCauseSolution
out of memory
Insufficient RAMUse
-m
to limit per-thread memory
disk full
Temp files filling diskUse
-T
to specify different location
truncated file
Interrupted sortRe-run sort from original

Related Skills

  • sam-bam-basics - View and convert alignment files
  • alignment-indexing - Index after coordinate sorting
  • duplicate-handling - Requires name-sorted input for fixmate
  • alignment-filtering - Filter before or after sorting