BioSkills bio-alignment-io
Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/alignment/alignment-io" ~/.claude/skills/gptomics-bioskills-bio-alignment-io && rm -rf "$T"
alignment/alignment-io/SKILL.mdVersion Compatibility
Reference examples tested with: BioPython 1.83+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function)
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Alignment File I/O
Read, write, and convert multiple sequence alignment files in various formats.
Required Import
Goal: Load modules for reading, writing, and manipulating multiple sequence alignments.
Approach: Import AlignIO for file I/O and supporting classes for programmatic alignment construction.
from Bio import AlignIO from Bio.Align import MultipleSeqAlignment from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq
Supported Formats
| Format | Extension | Read | Write | Description |
|---|---|---|---|---|
| .aln | Yes | Yes | Clustal W/X output |
| .fasta, .fa | Yes | Yes | Aligned FASTA |
| .phy | Yes | Yes | Interleaved PHYLIP |
| .phy | Yes | Yes | Sequential PHYLIP |
| .phy | Yes | Yes | PHYLIP with long names |
| .sto, .stk | Yes | Yes | Pfam/Rfam annotated |
| .nex | Yes | Yes | NEXUS format |
| .txt | Yes | No | EMBOSS tools output |
| .txt | Yes | No | FASTA -m 10 output |
| .maf | Yes | Yes | Multiple Alignment Format |
| .xmfa | Yes | No | progressiveMauve output |
| .msf | Yes | No | GCG MSF format |
Reading Alignments
"Read an alignment file" → Parse an alignment file into an alignment object with sequences and metadata accessible.
Goal: Load alignment data from files in various formats (Clustal, PHYLIP, Stockholm, FASTA).
Approach: Use
AlignIO.read() for single-alignment files or AlignIO.parse() for files containing multiple alignments.
Single Alignment File
from Bio import AlignIO alignment = AlignIO.read('alignment.aln', 'clustal') print(f'Alignment length: {alignment.get_alignment_length()}') print(f'Number of sequences: {len(alignment)}')
Multiple Alignments in One File
for alignment in AlignIO.parse('multi_alignment.sto', 'stockholm'): print(f'Alignment with {len(alignment)} sequences, length {alignment.get_alignment_length()}')
Read as List
alignments = list(AlignIO.parse('alignments.phy', 'phylip')) print(f'Read {len(alignments)} alignments')
Writing Alignments
Goal: Save alignment data to files in standard formats for downstream tools or archival.
Approach: Use
AlignIO.write() with the target format specifier, supporting single or multiple alignments and file handles.
Write Single Alignment
AlignIO.write(alignment, 'output.fasta', 'fasta')
Write Multiple Alignments
alignments = [alignment1, alignment2, alignment3] count = AlignIO.write(alignments, 'output.sto', 'stockholm') print(f'Wrote {count} alignments')
Write to Handle
with open('output.aln', 'w') as handle: AlignIO.write(alignment, handle, 'clustal')
Format Conversion
"Convert alignment format" → Transform an alignment file from one format to another (e.g., Clustal to PHYLIP).
Goal: Convert alignment files between formats for compatibility with different analysis tools.
Approach: Use
AlignIO.convert() for direct one-step conversion, or read-modify-write for cases requiring intermediate manipulation.
Direct Conversion (Most Efficient)
AlignIO.convert('input.aln', 'clustal', 'output.phy', 'phylip')
With Alphabet Specification
AlignIO.convert('input.sto', 'stockholm', 'output.nex', 'nexus', molecule_type='DNA')
Manual Conversion (When Modification Needed)
alignment = AlignIO.read('input.aln', 'clustal') # ... modify alignment ... AlignIO.write(alignment, 'output.fasta', 'fasta')
Accessing Alignment Data
Goal: Navigate and extract data from alignment objects including sequences, columns, and slices.
Approach: Use iteration, indexing, and column slicing on the alignment object.
alignment = AlignIO.read('alignment.aln', 'clustal') # Iterate over sequences for record in alignment: print(f'{record.id}: {record.seq}') # Access by index first_seq = alignment[0] last_seq = alignment[-1] # Slice columns column_slice = alignment[:, 10:20] # Columns 10-19 # Get specific column column = alignment[:, 5] # Column 5 as string
Working with Alignment Objects
Get Alignment Properties
alignment = AlignIO.read('alignment.aln', 'clustal') length = alignment.get_alignment_length() num_seqs = len(alignment) seq_ids = [record.id for record in alignment]
Slice Alignments
# Get subset of sequences subset = alignment[0:5] # First 5 sequences # Get subset of columns trimmed = alignment[:, 50:150] # Columns 50-149 # Combine slicing region = alignment[0:5, 50:150] # 5 sequences, columns 50-149
Creating Alignments Programmatically
Goal: Build an alignment object from sequences defined in code rather than read from a file.
Approach: Construct SeqRecord objects with gap characters and wrap them in a MultipleSeqAlignment.
from Bio.Align import MultipleSeqAlignment from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq records = [ SeqRecord(Seq('ACTGACTGACTG'), id='seq1'), SeqRecord(Seq('ACTGACT-ACTG'), id='seq2'), SeqRecord(Seq('ACTG-CTGACTG'), id='seq3'), ] alignment = MultipleSeqAlignment(records) AlignIO.write(alignment, 'new_alignment.fasta', 'fasta')
Format Selection for Downstream Tools
Choosing the output format depends on which downstream tool consumes the alignment:
| Downstream Tool | Required Format | BioPython Format String |
|---|---|---|
| RAxML-NG, IQ-TREE | PHYLIP (relaxed) | |
| MrBayes | NEXUS | |
| PAUP* | NEXUS or PHYLIP | or |
| HMMER, Infernal | Stockholm | |
| Pfam/Rfam databases | Stockholm | |
| PAML/codeml | PHYLIP (sequential) | |
| Most tools | FASTA | |
Annotation Preservation
Not all formats support annotations. Converting between formats can silently discard metadata:
| Format | Sequence Annotations | Column Annotations | Secondary Structure |
|---|---|---|---|
| Stockholm | Yes (GS/GR lines) | Yes (GC lines) | Yes (SS_cons) |
| NEXUS | Partial (SETS block) | Via CHARSET | No |
| Clustal | No (conservation marks not parsed) | No | No |
| PHYLIP | No | No | No |
| FASTA | No | No | No |
Converting Stockholm to FASTA or PHYLIP discards all annotations, secondary structure markup, and per-residue quality scores. If annotations matter, keep a Stockholm master copy.
Format-Specific Notes
PHYLIP Format Pitfalls
PHYLIP has two incompatible variants (interleaved vs sequential) and two name-length modes (strict vs relaxed). Confusing these causes silent data corruption.
Strict PHYLIP truncates sequence names to exactly 10 characters. This can silently merge distinct sequences whose names share a 10-character prefix (e.g.,
Homo_sapiens_chr1 and Homo_sapiens_chr2 both become Homo_sapie).
# Strict PHYLIP (10-char names, interleaved) -- only for tools requiring it alignment = AlignIO.read('file.phy', 'phylip') # Sequential PHYLIP (10-char names, one sequence at a time) -- PAML/codeml alignment = AlignIO.read('file.phy', 'phylip-sequential') # Relaxed PHYLIP (no name limit) -- RAxML-NG, IQ-TREE (recommended default) alignment = AlignIO.read('file.phy', 'phylip-relaxed') # Always prefer phylip-relaxed for writing unless the downstream tool # specifically requires strict format AlignIO.write(alignment, 'output.phy', 'phylip-relaxed')
Stockholm Format Annotations
Stockholm format (used by Pfam, Rfam, HMMER) supports four annotation line types:
| Line Prefix | Scope | Description | Example |
|---|---|---|---|
| File | Alignment-level metadata (ID, accession, description) | |
| Column | Per-column annotation (1 char per alignment column) | |
| Sequence | Per-sequence free text (organism, description) | |
| Residue | Per-residue annotation (1 char per residue) | |
Common GC annotations:
SS_cons (consensus secondary structure), RF (reference coordinates), seq_cons (consensus sequence). RNA families in Rfam use <> for base pairs, . for unpaired.
alignment = AlignIO.read('pfam.sto', 'stockholm') for record in alignment: print(record.id, record.annotations) if 'secondary_structure' in record.letter_annotations: print(f' SS: {record.letter_annotations["secondary_structure"]}') # Column annotations are accessible via alignment.column_annotations
Clustal Format
# Clustal preserves conservation symbols in file but not when parsed alignment = AlignIO.read('clustal.aln', 'clustal')
Batch Processing Multiple Files
Goal: Convert a directory of alignment files from one format to another in bulk.
Approach: Glob for input files and iterate, reading each alignment and writing to the target format.
from pathlib import Path input_dir = Path('alignments/') output_dir = Path('converted/') for input_file in input_dir.glob('*.aln'): alignment = AlignIO.read(input_file, 'clustal') output_file = output_dir / f'{input_file.stem}.fasta' AlignIO.write(alignment, output_file, 'fasta')
Alternative: Bio.Align Module I/O
Goal: Use the modern Bio.Align module for alignment I/O with access to newer features like counts and substitutions.
Approach: Use
Align.read(), Align.parse(), and Align.write() which return Alignment objects instead of MultipleSeqAlignment.
The newer
Bio.Align module provides its own I/O functions that return Alignment objects (instead of MultipleSeqAlignment). These support additional formats and provide access to modern alignment features.
from Bio import Align # Read single alignment (returns Alignment object) alignment = Align.read('alignment.aln', 'clustal') # Parse multiple alignments for alignment in Align.parse('multi.sto', 'stockholm'): print(f'Alignment with {len(alignment)} sequences') # Write alignment Align.write(alignment, 'output.fasta', 'fasta')
When to Use Which
| Use Case | Module |
|---|---|
| Legacy code, MultipleSeqAlignment needed | |
| Modern features (counts, substitutions) | |
| Format conversion | Either works |
| Working with pairwise alignments | |
Quick Reference: Common Operations
| Task | Code |
|---|---|
| Read single alignment | |
| Read multiple alignments | |
| Write alignment(s) | |
| Convert format | |
| Get length | |
| Get sequence count | |
| Slice columns | |
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Empty file | Check file path and format |
| Multiple alignments with | Use instead |
| Invalid alignment | Ensure all sequences same length |
| Unsupported format string | Check supported formats list |
Related Skills
- multiple-alignment - Run MSA tools (MAFFT, MUSCLE5, ClustalOmega) to generate alignments
- pairwise-alignment - Create pairwise alignments with PairwiseAligner
- msa-parsing - Analyze alignment content and annotations
- msa-statistics - Calculate conservation and identity
- sequence-io/format-conversion - Convert sequence (non-alignment) formats