Medical-research-skills biopython-alignment
Sequence alignment and alignment file processing with Biopython (Bio.Align/Bio.AlignIO), triggered when you need global/local pairwise alignment, MSA read/write/format conversion, or alignment statistics/filtering.
install
source · Clone the upstream repo
git clone https://github.com/aipoch/medical-research-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Data Analysis/biopython-alignment" ~/.claude/skills/aipoch-medical-research-skills-biopython-alignment && rm -rf "$T"
manifest:
scientific-skills/Data Analysis/biopython-alignment/SKILL.mdsource content
biopython-alignment
When to Use
- You need global alignment between two protein (or nucleotide) sequences and want a reproducible score and aligned strings.
- You need local alignment to find the best matching fragment/subsequence between two DNA/RNA/protein sequences.
- You need to read, write, or convert multiple sequence alignment (MSA) files (e.g., FASTA/Clustal/Stockholm) using Biopython I/O.
- You want to compute alignment statistics (e.g., identity, coverage, conservation per column) and filter alignments by thresholds.
- You need to apply substitution matrices (e.g., BLOSUM62) and tune gap penalties for biologically meaningful scoring.
Key Features
- Pairwise alignment via
(global and local modes).Bio.Align.PairwiseAligner - Alignment scoring with configurable match/mismatch and gap penalties.
- Protein substitution matrices via
(e.g., BLOSUM/PAM).Bio.Align.substitution_matrices - MSA parsing and serialization via
(read/write/format conversion).Bio.AlignIO - Basic alignment statistics: identity, aligned length, coverage, and MSA column conservation.
Dependencies
biopython>=1.81numpy>=1.21
Example Usage
# -*- coding: utf-8 -*- """ Runnable examples for: 1) Global protein alignment 2) Local DNA alignment (best fragment) 3) MSA parsing + column conservation Requires: biopython, numpy """ from __future__ import annotations from io import StringIO import numpy as np from Bio.Align import PairwiseAligner from Bio.Align import substitution_matrices from Bio import AlignIO def global_protein_alignment(seq_a: str, seq_b: str) -> None: matrix = substitution_matrices.load("BLOSUM62") aligner = PairwiseAligner() aligner.mode = "global" aligner.substitution_matrix = matrix aligner.open_gap_score = -10.0 aligner.extend_gap_score = -0.5 alignments = aligner.align(seq_a, seq_b) best = alignments[0] print("=== Global protein alignment (best) ===") print("Score:", best.score) print(best) def local_dna_alignment_best_fragment(seq_a: str, seq_b: str) -> None: aligner = PairwiseAligner() aligner.mode = "local" aligner.match_score = 2.0 aligner.mismatch_score = -1.0 aligner.open_gap_score = -2.0 aligner.extend_gap_score = -0.5 best = aligner.align(seq_a, seq_b)[0] # Extract the aligned fragment coordinates from the first aligned block. # aligned is a tuple: (aligned_coords_in_seq_a, aligned_coords_in_seq_b) a_blocks, b_blocks = best.aligned a_start, a_end = a_blocks[0] b_start, b_end = b_blocks[0] print("=== Local DNA alignment (best) ===") print("Score:", best.score) print(best) print("Best fragment in seq_a:", seq_a[a_start:a_end], f"(coords {a_start}:{a_end})") print("Best fragment in seq_b:", seq_b[b_start:b_end], f"(coords {b_start}:{b_end})") def msa_column_conservation(fasta_text: str) -> None: handle = StringIO(fasta_text) msa = AlignIO.read(handle, "fasta") # MultipleSeqAlignment # Convert to a 2D array of characters: shape (n_seqs, aln_len) arr = np.array([list(str(rec.seq)) for rec in msa], dtype="U1") n_seqs, aln_len = arr.shape # Conservation per column: fraction of the most common non-gap character. # Treat '-' as gap; ignore gaps when computing the most common residue. conservation = [] for j in range(aln_len): col = arr[:, j] col = col[col != "-"] if col.size == 0: conservation.append(0.0) continue values, counts = np.unique(col, return_counts=True) conservation.append(float(counts.max() / counts.sum())) print("=== MSA column conservation ===") print("n_seqs:", n_seqs, "aln_len:", aln_len) print("conservation:", [round(x, 3) for x in conservation]) def main() -> None: # 1) Global alignment (protein) seq_a = "MKTAYIAKQRQISFVKSHFSRQDILD" seq_b = "MKLAYIAKQRQISFVKSHFTRQDILN" global_protein_alignment(seq_a, seq_b) # 2) Local alignment (DNA) seq_a = "ATGCGTACGTTAGC" seq_b = "GGGATGCGTACGAAAC" local_dna_alignment_best_fragment(seq_a, seq_b) # 3) MSA conservation (FASTA) fasta_text = ">s1\nACGTACGT\n>s2\nACGTTCGT\n>s3\nACGTACGA\n" msa_column_conservation(fasta_text) if __name__ == "__main__": main()
Implementation Details
- Pairwise alignment engine: uses
, which performs dynamic programming alignment under the selected mode:Bio.Align.PairwiseAligner
: aligns full-length sequences end-to-end.mode="global"
: finds the highest-scoring matching region (best subsequence pair).mode="local"
- Scoring configuration:
- For proteins, prefer
(e.g.,substitution_matrix
) plus gap penalties (BLOSUM62
,open_gap_score
).extend_gap_score - For nucleotides, a simple scheme is common:
,match_score
, and gap penalties.mismatch_score
- For proteins, prefer
- Selecting the best alignment:
returns an iterable of alignments sorted by score; usealigner.align(a, b)
for the top-scoring result.[0] - Local “best fragment” extraction:
returns aligned coordinate blocks for each sequence.alignment.aligned- The first block
typically corresponds to the highest-scoring contiguous aligned region; slice the original sequences with these coordinates to obtain the fragment.(start, end)
- MSA I/O and statistics:
parses an alignment into aBio.AlignIO.read(handle, fmt)
.MultipleSeqAlignment- Column conservation can be computed as:
.max_count(non-gap residues in column) / total_non_gap_count(column)
- Operational conventions (recommended):
- Store runtime configuration in
and invoke scripts asconfig/task_config.json
.python scripts/<task_name>.py - Avoid stacking many CLI
parameters; keep parameters in the config file.-- - Always specify
for file I/O; for JSON output useencoding="utf-8"
.ensure_ascii=False
- Store runtime configuration in