BioSkills bio-alignment-msa-parsing
Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/alignment/msa-parsing" ~/.claude/skills/gptomics-bioskills-bio-alignment-msa-parsing && rm -rf "$T"
alignment/msa-parsing/SKILL.mdVersion Compatibility
Reference examples tested with: BioPython 1.83+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function)
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
MSA Parsing and Analysis
Parse multiple sequence alignments to extract information, analyze content, and prepare for downstream analysis.
Required Import
Goal: Load modules for parsing, analyzing, and manipulating multiple sequence alignments.
Approach: Import AlignIO for reading, Counter for column analysis, and alignment classes for constructing modified alignments.
from Bio import AlignIO from Bio.Align import MultipleSeqAlignment from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq from collections import Counter
Loading Alignments
Goal: Read an MSA file and inspect its dimensions.
Approach: Use
AlignIO.read() specifying the file and format.
from Bio import AlignIO alignment = AlignIO.read('alignment.fasta', 'fasta') print(f'{len(alignment)} sequences, {alignment.get_alignment_length()} columns')
Extracting Sequence Information
Get All Sequence IDs
seq_ids = [record.id for record in alignment]
Get Sequences as Strings
sequences = [str(record.seq) for record in alignment]
Get Sequence by ID
def get_sequence_by_id(alignment, seq_id): for record in alignment: if record.id == seq_id: return record return None target = get_sequence_by_id(alignment, 'species_A')
Access Descriptions and Annotations
for record in alignment: print(f'ID: {record.id}') print(f'Description: {record.description}') print(f'Annotations: {record.annotations}')
Column-wise Analysis
Goal: Analyze alignment content column by column to assess composition, conservation, and variability.
Approach: Use column indexing (
alignment[:, idx]) and Counter to examine character frequencies at each position.
Get Single Column
column_5 = alignment[:, 5] # Returns string of characters at position 5 print(column_5) # e.g., 'AAAGA'
Iterate Over Columns
for col_idx in range(alignment.get_alignment_length()): column = alignment[:, col_idx] print(f'Column {col_idx}: {column}')
Count Characters in Column
from collections import Counter def column_composition(alignment, col_idx): column = alignment[:, col_idx] return Counter(column) counts = column_composition(alignment, 0) print(counts) # Counter({'A': 3, 'G': 1, '-': 1})
Find Conserved Positions
def find_conserved_positions(alignment, threshold=1.0): conserved = [] for col_idx in range(alignment.get_alignment_length()): column = alignment[:, col_idx] counts = Counter(column) most_common_char, most_common_count = counts.most_common(1)[0] if most_common_char != '-': conservation = most_common_count / len(alignment) if conservation >= threshold: conserved.append((col_idx, most_common_char)) return conserved fully_conserved = find_conserved_positions(alignment, threshold=1.0) mostly_conserved = find_conserved_positions(alignment, threshold=0.8)
Gap Analysis
Goal: Quantify gap distribution across sequences and columns to identify problematic regions or sequences.
Approach: Count gap characters per sequence and per column, then identify positions exceeding a gap fraction threshold.
Count Gaps Per Sequence
gap_counts = [(record.id, str(record.seq).count('-')) for record in alignment] for seq_id, gaps in gap_counts: print(f'{seq_id}: {gaps} gaps')
Count Gaps Per Column
def gaps_per_column(alignment): return [alignment[:, i].count('-') for i in range(alignment.get_alignment_length())] gap_profile = gaps_per_column(alignment)
Find Gappy Columns
def find_gappy_columns(alignment, threshold=0.5): gappy = [] num_seqs = len(alignment) for col_idx in range(alignment.get_alignment_length()): column = alignment[:, col_idx] gap_fraction = column.count('-') / num_seqs if gap_fraction >= threshold: gappy.append(col_idx) return gappy columns_to_remove = find_gappy_columns(alignment, threshold=0.5)
Remove Gappy Columns
def remove_gappy_columns(alignment, threshold=0.5): num_seqs = len(alignment) keep_columns = [] for col_idx in range(alignment.get_alignment_length()): column = alignment[:, col_idx] gap_fraction = column.count('-') / num_seqs if gap_fraction < threshold: keep_columns.append(col_idx) new_records = [] for record in alignment: new_seq = ''.join(str(record.seq)[i] for i in keep_columns) new_records.append(SeqRecord(Seq(new_seq), id=record.id, description=record.description)) return MultipleSeqAlignment(new_records) cleaned = remove_gappy_columns(alignment, threshold=0.5)
Alignment Trimming: Decision Framework
Trimming (removing unreliable columns before downstream analysis) is controversial. Research is split on whether it helps or hurts phylogenetic inference. The approach matters more than whether to trim.
Tool Recommendations (Best to Worst)
| Tool | Approach | When to Use |
|---|---|---|
| ClipKIT | Retains parsimony-informative + constant sites | Default choice; mode consistently outperforms others |
trimAl | Gap + similarity scoring | When ClipKIT unavailable; equal or better than unfiltered in most tests |
| Gblocks (relaxed params) | Block-based removal | Only with relaxed parameters; default settings are too aggressive |
Key insight: ClipKIT inverts the traditional approach. Instead of removing "bad" columns, it identifies and retains informative ones. This paradigm shift consistently produces better trees.
When NOT to trim: Single-gene phylogenetics with well-curated alignments. Aggressive trimming (>20-30% of sites removed) causes rapid tree deterioration.
# ClipKIT (recommended) clipkit alignment.fasta -m kpic-smart-gap -o trimmed.fasta # trimAl automated mode trimal -in alignment.fasta -out trimmed.fasta -automated1 # trimAl with explicit gap threshold trimal -in alignment.fasta -out trimmed.fasta -gt 0.5
Gap Handling for Phylogenetics
How gaps are treated in downstream phylogenetic analysis significantly affects tree topology:
| Treatment | Method | Tradeoff |
|---|---|---|
| Missing data (default) | Gaps = unknown character | Most common; can be statistically inconsistent under ML |
| Fifth state | Gap = 5th nucleotide | Biologically problematic (gaps of different lengths treated equally) |
| Simple indel coding | Each unique indel coded as binary character | Most biologically realistic; adds phylogenetic signal |
Indel coding and fifth-state outperform missing-data treatment ~90% of the time on empirical datasets. For important analyses, consider indel coding to capture the phylogenetic information in gap patterns.
Identifying Unreliable Alignment Regions
Columns exhibiting both high gap fraction AND low conservation are the strongest indicators of alignment uncertainty. These often reflect guide tree artifacts rather than true evolutionary events. Before phylogenetic analysis:
- Flag columns with gap fraction >50%, which may be alignment artifacts
- Check if gappy regions coincide with insertions in a single divergent sequence (remove that sequence and re-align)
- For critical analyses, run GUIDANCE2 or MUSCLE5 ensemble to get per-column confidence scores; mask columns below the reliability threshold (default: 0.93 for GUIDANCE2)
Consensus Sequence
"Get consensus sequence" → Derive a single representative sequence from an MSA based on majority-rule voting at each column.
Goal: Generate a consensus sequence from the alignment using a frequency threshold.
Approach: At each column, select the most common non-gap character if it exceeds the threshold; otherwise mark as ambiguous.
Simple Majority Consensus
def consensus_sequence(alignment, threshold=0.5, gap_char='-', ambiguous='N'): consensus = [] for col_idx in range(alignment.get_alignment_length()): column = alignment[:, col_idx] counts = Counter(column) most_common_char, most_common_count = counts.most_common(1)[0] if most_common_char == gap_char: counts.pop(gap_char, None) if counts: most_common_char, most_common_count = counts.most_common(1)[0] else: most_common_char = gap_char if most_common_count / len(alignment) >= threshold: consensus.append(most_common_char) else: consensus.append(ambiguous) return ''.join(consensus) consensus = consensus_sequence(alignment, threshold=0.5)
Note on Bio.Align.AlignInfo
The
AlignInfo.SummaryInfo class is deprecated in recent Biopython versions. The custom consensus_sequence() function above is the recommended approach. If you see deprecation warnings when using AlignInfo, use the custom implementation instead.
Extracting Regions
Slice by Column Range
region = alignment[:, 100:200] # Columns 100-199
Slice by Sequence Range
subset = alignment[0:10] # First 10 sequences
Extract Ungapped Regions from Reference
def extract_ungapped_regions(alignment, ref_idx=0): ref_seq = str(alignment[ref_idx].seq) ungapped_cols = [i for i, char in enumerate(ref_seq) if char != '-'] new_records = [] for record in alignment: new_seq = ''.join(str(record.seq)[i] for i in ungapped_cols) new_records.append(SeqRecord(Seq(new_seq), id=record.id, description=record.description)) return MultipleSeqAlignment(new_records) ungapped = extract_ungapped_regions(alignment, ref_idx=0)
Sequence Filtering
Goal: Subset an alignment to retain only sequences matching specific criteria (ID pattern, gap content, uniqueness).
Approach: Iterate over alignment records, apply filter conditions, and reconstruct a new MultipleSeqAlignment from matching records.
Filter by Sequence ID Pattern
import re def filter_by_id(alignment, pattern): regex = re.compile(pattern) matching = [record for record in alignment if regex.search(record.id)] return MultipleSeqAlignment(matching) bacteria_only = filter_by_id(alignment, r'^Bac_')
Filter by Gap Content
def filter_by_gap_content(alignment, max_gap_fraction=0.1): filtered = [] for record in alignment: gap_fraction = str(record.seq).count('-') / len(record.seq) if gap_fraction <= max_gap_fraction: filtered.append(record) return MultipleSeqAlignment(filtered) low_gap_seqs = filter_by_gap_content(alignment, max_gap_fraction=0.1)
Remove Duplicate Sequences
def remove_duplicates(alignment): seen_seqs = {} unique_records = [] for record in alignment: seq_str = str(record.seq) if seq_str not in seen_seqs: seen_seqs[seq_str] = record.id unique_records.append(record) return MultipleSeqAlignment(unique_records) unique_alignment = remove_duplicates(alignment)
Working with Annotations
Stockholm Format Annotations
alignment = AlignIO.read('pfam.sto', 'stockholm') for record in alignment: if 'secondary_structure' in record.letter_annotations: ss = record.letter_annotations['secondary_structure'] print(f'{record.id}: {ss}')
Add Annotations to Records
for record in alignment: record.annotations['source'] = 'my_analysis' record.annotations['quality'] = 'high'
Position Mapping
Goal: Convert between alignment column coordinates and ungapped sequence coordinates.
Approach: Walk through the sequence tracking gap characters to map between the two coordinate systems.
Map Alignment Position to Sequence Position
def alignment_to_sequence_position(record, align_pos): seq_pos = 0 for i, char in enumerate(str(record.seq)): if i == align_pos: return seq_pos if char != '-' else None if char != '-': seq_pos += 1 return None
Map Sequence Position to Alignment Position
def sequence_to_alignment_position(record, seq_pos): current_seq_pos = 0 for i, char in enumerate(str(record.seq)): if char != '-': if current_seq_pos == seq_pos: return i current_seq_pos += 1 return None
Quick Reference: Common Operations
| Task | Code |
|---|---|
| Get column | |
| Get sequence | |
| Column count | |
| Sequence count | |
| Find gaps | |
| Consensus | Use custom function |
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Column index out of range | Check |
| Unequal sequence lengths | Invalid MSA | Ensure all sequences same length |
| Empty Counter | All gaps in column | Handle gap-only columns |
Related Skills
- multiple-alignment - Run MSA tools (MAFFT, MUSCLE5, ClustalOmega) to generate alignments
- alignment-io - Read/write alignment files in various formats
- pairwise-alignment - Create pairwise alignments
- msa-statistics - Calculate conservation metrics
- phylogenetics/modern-tree-inference - Build trees from processed alignments