Medical-research-skills biopython-alignment

Sequence alignment and alignment file processing with Biopython (Bio.Align/Bio.AlignIO), triggered when you need global/local pairwise alignment, MSA read/write/format conversion, or alignment statistics/filtering.

install
source · Clone the upstream repo
git clone https://github.com/aipoch/medical-research-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Data Analysis/biopython-alignment" ~/.claude/skills/aipoch-medical-research-skills-biopython-alignment && rm -rf "$T"
manifest: scientific-skills/Data Analysis/biopython-alignment/SKILL.md
source content

Source: https://github.com/aipoch/medical-research-skills

biopython-alignment

When to Use

  • You need global alignment between two protein (or nucleotide) sequences and want a reproducible score and aligned strings.
  • You need local alignment to find the best matching fragment/subsequence between two DNA/RNA/protein sequences.
  • You need to read, write, or convert multiple sequence alignment (MSA) files (e.g., FASTA/Clustal/Stockholm) using Biopython I/O.
  • You want to compute alignment statistics (e.g., identity, coverage, conservation per column) and filter alignments by thresholds.
  • You need to apply substitution matrices (e.g., BLOSUM62) and tune gap penalties for biologically meaningful scoring.

Key Features

  • Pairwise alignment via
    Bio.Align.PairwiseAligner
    (global and local modes).
  • Alignment scoring with configurable match/mismatch and gap penalties.
  • Protein substitution matrices via
    Bio.Align.substitution_matrices
    (e.g., BLOSUM/PAM).
  • MSA parsing and serialization via
    Bio.AlignIO
    (read/write/format conversion).
  • Basic alignment statistics: identity, aligned length, coverage, and MSA column conservation.

Dependencies

  • biopython>=1.81
  • numpy>=1.21

Example Usage

# -*- coding: utf-8 -*-
"""
Runnable examples for:
1) Global protein alignment
2) Local DNA alignment (best fragment)
3) MSA parsing + column conservation

Requires: biopython, numpy
"""

from __future__ import annotations

from io import StringIO
import numpy as np

from Bio.Align import PairwiseAligner
from Bio.Align import substitution_matrices
from Bio import AlignIO


def global_protein_alignment(seq_a: str, seq_b: str) -> None:
    matrix = substitution_matrices.load("BLOSUM62")

    aligner = PairwiseAligner()
    aligner.mode = "global"
    aligner.substitution_matrix = matrix
    aligner.open_gap_score = -10.0
    aligner.extend_gap_score = -0.5

    alignments = aligner.align(seq_a, seq_b)
    best = alignments[0]

    print("=== Global protein alignment (best) ===")
    print("Score:", best.score)
    print(best)


def local_dna_alignment_best_fragment(seq_a: str, seq_b: str) -> None:
    aligner = PairwiseAligner()
    aligner.mode = "local"
    aligner.match_score = 2.0
    aligner.mismatch_score = -1.0
    aligner.open_gap_score = -2.0
    aligner.extend_gap_score = -0.5

    best = aligner.align(seq_a, seq_b)[0]

    # Extract the aligned fragment coordinates from the first aligned block.
    # aligned is a tuple: (aligned_coords_in_seq_a, aligned_coords_in_seq_b)
    a_blocks, b_blocks = best.aligned
    a_start, a_end = a_blocks[0]
    b_start, b_end = b_blocks[0]

    print("=== Local DNA alignment (best) ===")
    print("Score:", best.score)
    print(best)
    print("Best fragment in seq_a:", seq_a[a_start:a_end], f"(coords {a_start}:{a_end})")
    print("Best fragment in seq_b:", seq_b[b_start:b_end], f"(coords {b_start}:{b_end})")


def msa_column_conservation(fasta_text: str) -> None:
    handle = StringIO(fasta_text)
    msa = AlignIO.read(handle, "fasta")  # MultipleSeqAlignment

    # Convert to a 2D array of characters: shape (n_seqs, aln_len)
    arr = np.array([list(str(rec.seq)) for rec in msa], dtype="U1")
    n_seqs, aln_len = arr.shape

    # Conservation per column: fraction of the most common non-gap character.
    # Treat '-' as gap; ignore gaps when computing the most common residue.
    conservation = []
    for j in range(aln_len):
        col = arr[:, j]
        col = col[col != "-"]
        if col.size == 0:
            conservation.append(0.0)
            continue
        values, counts = np.unique(col, return_counts=True)
        conservation.append(float(counts.max() / counts.sum()))

    print("=== MSA column conservation ===")
    print("n_seqs:", n_seqs, "aln_len:", aln_len)
    print("conservation:", [round(x, 3) for x in conservation])


def main() -> None:
    # 1) Global alignment (protein)
    seq_a = "MKTAYIAKQRQISFVKSHFSRQDILD"
    seq_b = "MKLAYIAKQRQISFVKSHFTRQDILN"
    global_protein_alignment(seq_a, seq_b)

    # 2) Local alignment (DNA)
    seq_a = "ATGCGTACGTTAGC"
    seq_b = "GGGATGCGTACGAAAC"
    local_dna_alignment_best_fragment(seq_a, seq_b)

    # 3) MSA conservation (FASTA)
    fasta_text = ">s1\nACGTACGT\n>s2\nACGTTCGT\n>s3\nACGTACGA\n"
    msa_column_conservation(fasta_text)


if __name__ == "__main__":
    main()

Implementation Details

  • Pairwise alignment engine: uses
    Bio.Align.PairwiseAligner
    , which performs dynamic programming alignment under the selected mode:
    • mode="global"
      : aligns full-length sequences end-to-end.
    • mode="local"
      : finds the highest-scoring matching region (best subsequence pair).
  • Scoring configuration:
    • For proteins, prefer
      substitution_matrix
      (e.g.,
      BLOSUM62
      ) plus gap penalties (
      open_gap_score
      ,
      extend_gap_score
      ).
    • For nucleotides, a simple scheme is common:
      match_score
      ,
      mismatch_score
      , and gap penalties.
  • Selecting the best alignment:
    aligner.align(a, b)
    returns an iterable of alignments sorted by score; use
    [0]
    for the top-scoring result.
  • Local “best fragment” extraction:
    • alignment.aligned
      returns aligned coordinate blocks for each sequence.
    • The first block
      (start, end)
      typically corresponds to the highest-scoring contiguous aligned region; slice the original sequences with these coordinates to obtain the fragment.
  • MSA I/O and statistics:
    • Bio.AlignIO.read(handle, fmt)
      parses an alignment into a
      MultipleSeqAlignment
      .
    • Column conservation can be computed as:
      max_count(non-gap residues in column) / total_non_gap_count(column)
      .
  • Operational conventions (recommended):
    • Store runtime configuration in
      config/task_config.json
      and invoke scripts as
      python scripts/<task_name>.py
      .
    • Avoid stacking many CLI
      --
      parameters; keep parameters in the config file.
    • Always specify
      encoding="utf-8"
      for file I/O; for JSON output use
      ensure_ascii=False
      .