BioSkills bio-alignment-msa-statistics
Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/alignment/msa-statistics" ~/.claude/skills/gptomics-bioskills-bio-alignment-msa-statistics && rm -rf "$T"
alignment/msa-statistics/SKILL.mdVersion Compatibility
Reference examples tested with: BioPython 1.83+, numpy 1.26+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function)
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
MSA Statistics
Calculate sequence identity, conservation scores, substitution counts, and other alignment metrics.
Required Import
Goal: Load modules for alignment I/O, substitution scoring, and statistical calculations.
Approach: Import AlignIO for reading alignments, Counter for column analysis, numpy for matrix operations, and math for entropy calculations.
from Bio import AlignIO from Bio.Align import substitution_matrices from collections import Counter import numpy as np import math
Pairwise Identity
"Calculate percent identity" → Compute the fraction of identical aligned residues between sequence pairs.
Goal: Measure sequence similarity as percent identity for individual pairs or across all sequences in an alignment.
Approach: Count matching non-gap positions divided by total aligned positions; optionally compute a full N-by-N identity matrix.
Percent Identity Definitions
There are four common denominators, producing up to 11.5% difference on the same alignment. Combined with different alignment algorithms, variation reaches 22%. Always report which method was used.
| Method | Denominator | Code |
|---|---|---|
| PID1 | Aligned positions including internal gaps | |
| PID2 | Aligned residue pairs only (no gaps) | |
| PID3 | Shorter sequence length (ungapped) | |
| PID4 | Mean sequence length (ungapped) | Best correlation with structural similarity (r=0.86) |
PID2 always gives the highest value; PID4 correlates best with structural similarity and is recommended for evolutionary analyses.
Calculate Identity Between Two Sequences
def pairwise_identity(seq1, seq2, method='pid1'): matches = sum(a == b and a != '-' for a, b in zip(seq1, seq2)) if method == 'pid1': denom = sum(a != '-' or b != '-' for a, b in zip(seq1, seq2)) elif method == 'pid2': denom = sum(a != '-' and b != '-' for a, b in zip(seq1, seq2)) elif method == 'pid3': denom = min(len(seq1.replace('-', '')), len(seq2.replace('-', ''))) elif method == 'pid4': denom = (len(seq1.replace('-', '')) + len(seq2.replace('-', ''))) / 2 return matches / denom if denom > 0 else 0 alignment = AlignIO.read('alignment.fasta', 'fasta') seq1, seq2 = str(alignment[0].seq), str(alignment[1].seq) for method in ['pid1', 'pid2', 'pid3', 'pid4']: print(f'{method}: {pairwise_identity(seq1, seq2, method) * 100:.1f}%')
Identity Matrix for All Sequences
def identity_matrix(alignment): n = len(alignment) matrix = np.zeros((n, n)) for i in range(n): for j in range(i, n): seq_i = str(alignment[i].seq) seq_j = str(alignment[j].seq) ident = pairwise_identity(seq_i, seq_j) matrix[i, j] = matrix[j, i] = ident return matrix alignment = AlignIO.read('alignment.fasta', 'fasta') mat = identity_matrix(alignment) seq_ids = [r.id for r in alignment] print('Pairwise Identity Matrix:') print(f'{"":>10}', ' '.join(f'{s[:8]:>8}' for s in seq_ids)) for i, row in enumerate(mat): print(f'{seq_ids[i][:10]:>10}', ' '.join(f'{v*100:>7.1f}%' for v in row))
Conservation Score
Goal: Quantify per-column and overall alignment conservation to identify conserved and variable regions.
Approach: Calculate the fraction of the most common residue at each column, optionally ignoring gaps, and smooth with a sliding window.
Per-Column Conservation
def column_conservation(alignment, col_idx, ignore_gaps=True): column = alignment[:, col_idx] if ignore_gaps: column = column.replace('-', '') if not column: return 0.0 counts = Counter(column) most_common_count = counts.most_common(1)[0][1] return most_common_count / len(column) alignment = AlignIO.read('alignment.fasta', 'fasta') for i in range(min(20, alignment.get_alignment_length())): cons = column_conservation(alignment, i) print(f'Column {i}: {cons*100:.0f}% conserved')
Average Conservation Across Alignment
def average_conservation(alignment, ignore_gaps=True): scores = [] for col_idx in range(alignment.get_alignment_length()): scores.append(column_conservation(alignment, col_idx, ignore_gaps)) return sum(scores) / len(scores) avg_cons = average_conservation(alignment) print(f'Average conservation: {avg_cons*100:.1f}%')
Conservation Profile
def conservation_profile(alignment, window=10): profile = [] for i in range(alignment.get_alignment_length()): start = max(0, i - window // 2) end = min(alignment.get_alignment_length(), i + window // 2) scores = [column_conservation(alignment, j) for j in range(start, end)] profile.append(sum(scores) / len(scores)) return profile profile = conservation_profile(alignment, window=10)
Substitution Counts
Goal: Tabulate observed substitution frequencies from the alignment for evolutionary analysis or custom scoring matrices.
Approach: Enumerate all pairwise non-gap character comparisons at each column and tally substitution pairs.
Count Substitutions from Alignment
def substitution_counts(alignment): from collections import defaultdict counts = defaultdict(int) for col_idx in range(alignment.get_alignment_length()): column = alignment[:, col_idx] chars = [c for c in column if c != '-'] for i, c1 in enumerate(chars): for c2 in chars[i+1:]: if c1 != c2: pair = tuple(sorted([c1, c2])) counts[pair] += 1 return dict(counts) subs = substitution_counts(alignment) print('Substitution counts:') for pair, count in sorted(subs.items(), key=lambda x: -x[1])[:10]: print(f' {pair[0]}<->{pair[1]}: {count}')
Build Substitution Matrix from MSA
def build_substitution_matrix(alignment): from collections import defaultdict matrix = defaultdict(lambda: defaultdict(int)) for col_idx in range(alignment.get_alignment_length()): column = alignment[:, col_idx] chars = [c for c in column if c != '-'] for c1 in chars: for c2 in chars: matrix[c1][c2] += 1 return {k: dict(v) for k, v in matrix.items()} sub_matrix = build_substitution_matrix(alignment)
Using Alignment.substitutions (Pairwise Alignments)
For pairwise alignments created with
PairwiseAligner, use the built-in .substitutions property:
from Bio.Align import PairwiseAligner aligner = PairwiseAligner(mode='global', match_score=1, mismatch_score=-1) alignments = aligner.align(seq1, seq2) substitutions = alignments[0].substitutions # Returns Array with substitution counts print(substitutions)
Information Content
Goal: Measure column variability using Shannon entropy and derive information content for identifying functionally important positions.
Approach: Compute Shannon entropy from character frequencies per column; information content is max entropy minus observed entropy.
Shannon Entropy Per Column
import math def shannon_entropy(column, ignore_gaps=True): if ignore_gaps: column = column.replace('-', '') if not column: return 0.0 counts = Counter(column) total = len(column) entropy = 0.0 for count in counts.values(): p = count / total if p > 0: entropy -= p * math.log2(p) return entropy alignment = AlignIO.read('alignment.fasta', 'fasta') for i in range(min(20, alignment.get_alignment_length())): column = alignment[:, i] ent = shannon_entropy(column) print(f'Column {i}: entropy = {ent:.2f} bits')
Information Content (Max Entropy - Observed Entropy)
def information_content(column, alphabet_size=4): max_entropy = math.log2(alphabet_size) # 4 for DNA, 20 for protein observed_entropy = shannon_entropy(column) return max_entropy - observed_entropy # DNA alignment for i in range(min(20, alignment.get_alignment_length())): column = alignment[:, i] ic = information_content(column, alphabet_size=4) print(f'Column {i}: IC = {ic:.2f} bits')
Gap Statistics
Goal: Summarize gap distribution across the alignment to assess alignment quality and identify problematic regions.
Approach: Calculate gap fractions per column and aggregate statistics including total gaps, gap-free columns, and gappiest sequence/column.
Gap Fraction Per Column
def gap_profile(alignment): profile = [] for col_idx in range(alignment.get_alignment_length()): column = alignment[:, col_idx] gap_fraction = column.count('-') / len(alignment) profile.append(gap_fraction) return profile gaps = gap_profile(alignment) avg_gaps = sum(gaps) / len(gaps) print(f'Average gap fraction: {avg_gaps*100:.1f}%')
Gap Statistics Summary
def gap_statistics(alignment): num_seqs = len(alignment) num_cols = alignment.get_alignment_length() total_positions = num_seqs * num_cols total_gaps = sum(str(r.seq).count('-') for r in alignment) gaps_per_seq = [str(r.seq).count('-') for r in alignment] gaps_per_col = [alignment[:, i].count('-') for i in range(num_cols)] return { 'total_gaps': total_gaps, 'gap_fraction': total_gaps / total_positions, 'gappiest_seq': max(range(num_seqs), key=lambda i: gaps_per_seq[i]), 'gappiest_col': max(range(num_cols), key=lambda i: gaps_per_col[i]), 'gap_free_cols': sum(1 for g in gaps_per_col if g == 0), } stats = gap_statistics(alignment) print(f"Total gaps: {stats['total_gaps']}") print(f"Gap fraction: {stats['gap_fraction']*100:.1f}%") print(f"Gap-free columns: {stats['gap_free_cols']}")
Alignment Quality Metrics
Goal: Score alignment quality using sum-of-pairs or simple match/mismatch/gap scoring across all columns.
Approach: For each column, score all pairwise residue comparisons and sum across the alignment.
Overall Alignment Score
def alignment_score(alignment, match=1, mismatch=-1, gap=-2): total_score = 0 for col_idx in range(alignment.get_alignment_length()): column = alignment[:, col_idx] for i, c1 in enumerate(column): for c2 in column[i+1:]: if c1 == '-' or c2 == '-': total_score += gap elif c1 == c2: total_score += match else: total_score += mismatch return total_score score = alignment_score(alignment) print(f'Alignment score: {score}')
Sum of Pairs Score
def sum_of_pairs(alignment, substitution_matrix=None): if substitution_matrix is None: substitution_matrix = substitution_matrices.load('BLOSUM62') total = 0 for col_idx in range(alignment.get_alignment_length()): column = alignment[:, col_idx] for i, c1 in enumerate(column): for c2 in column[i+1:]: if c1 != '-' and c2 != '-': total += substitution_matrix.get((c1, c2), 0) return total
Position-Specific Score Matrix (PSSM)
Goal: Build a position-specific score matrix (PSSM) from the alignment for motif analysis or sequence scoring.
Approach: Count non-gap character frequencies at each column, producing a list of per-position dictionaries.
def position_specific_score_matrix(alignment): pssm = [] for col_idx in range(alignment.get_alignment_length()): column = alignment[:, col_idx] counts = Counter(column) if '-' in counts: del counts['-'] pssm.append(dict(counts)) return pssm alignment = AlignIO.read('alignment.fasta', 'fasta') pssm = position_specific_score_matrix(alignment) for i, row in enumerate(pssm[:10]): print(f'Position {i}: {row}')
Alignment Quality Assessment
When to Worry About Alignment Quality
| Warning Sign | Implication | Action |
|---|---|---|
| Average pairwise identity <25% (protein) | Twilight zone; alignment may be unreliable | Use GUIDANCE2 to assess; consider structural alignment |
| >30% of columns have >50% gaps | Possible non-homologous sequences or misalignment | Remove outlier sequences and re-align |
| Identity varies dramatically across regions | Domain architecture mismatch | Align domains separately |
| Conservation pattern absent in expected functional regions | Alignment error or non-homology | Verify with BLAST that sequences are truly homologous |
Quantifying Alignment Uncertainty
For critical downstream analyses (phylogenetics, selection analysis), alignment uncertainty should be quantified rather than assumed. Two approaches:
- GUIDANCE2: Generates ~400 perturbed alignments varying guide trees, gap penalties, and co-optimal solutions. Column confidence = frequency of that column arrangement across perturbations. Default reliability threshold: 0.93.
- MUSCLE5 ensemble: Generates alignments with perturbed HMM parameters. Column confidence = fraction of ensemble members supporting that column. Faster than GUIDANCE2.
Alignment uncertainty propagates directly into evolutionary inference. Different aligners can produce phylogenies supporting fundamentally different biological conclusions. For selection analysis (dN/dS), misaligned codons create artificial nonsynonymous differences, leading to false positive selection signals.
Note on Bio.Align.AlignInfo
The
AlignInfo.SummaryInfo class is deprecated in recent Biopython versions. Use the custom functions in this skill instead:
- For PSSM: use
aboveposition_specific_score_matrix() - For information content: use
function earlier in this skillinformation_content() - For consensus: see msa-parsing skill
Quick Reference: Metrics
| Metric | Description | Range |
|---|---|---|
| Identity | Fraction of identical residues | 0-1 |
| Conservation | Most common residue frequency | 0-1 |
| Shannon Entropy | Variability measure | 0 to log2(alphabet) |
| Information Content | Max entropy - observed entropy | 0 to log2(alphabet) |
| Gap Fraction | Proportion of gaps | 0-1 |
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Empty column after gap removal | Check for gap-only columns |
| Character not in substitution matrix | Handle gaps separately |
| Negative IC | Wrong alphabet size | Use 4 for DNA, 20 for protein |
Related Skills
- multiple-alignment - Run MSA tools and quantify alignment confidence with ensembles
- msa-parsing - Parse, filter, trim, and assess alignment quality
- alignment-io - Read/write alignment files
- pairwise-alignment - Create and score pairwise alignments
- sequence-manipulation/sequence-properties - Sequence-level statistics