LLMs-Universal-Life-Science-and-Clinical-Skills- pairwise-alignment

<!--

install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Sequence_Analysis/alignment/pairwise-alignment" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-pairwise-alignment && rm -rf "$T"
manifest: Skills/Sequence_Analysis/alignment/pairwise-alignment/SKILL.md
source content
<!-- # COPYRIGHT NOTICE # This file is part of the "Universal Biomedical Skills" project. # Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu> # All Rights Reserved. # # This code is proprietary and confidential. # Unauthorized copying of this file, via any medium is strictly prohibited. # # Provenance: Authenticated by MD BABU MIA -->

name: bio-alignment-pairwise description: Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences. tool_type: python primary_tool: Bio.Align measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

  • read_file
  • run_shell_command

Pairwise Sequence Alignment

Align two sequences using dynamic programming algorithms (Needleman-Wunsch for global, Smith-Waterman for local).

Required Import

from Bio.Align import PairwiseAligner
from Bio.Seq import Seq
from Bio import SeqIO

Core Concepts

ModeAlgorithmUse Case
global
Needleman-WunschFull-length alignment, similar-length sequences
local
Smith-WatermanFind best matching regions, different-length sequences

Creating an Aligner

# Basic aligner with defaults
aligner = PairwiseAligner()

# Configure mode and scoring
aligner = PairwiseAligner(mode='global', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)

# For protein alignment with substitution matrix
from Bio.Align import substitution_matrices
aligner = PairwiseAligner(mode='global', substitution_matrix=substitution_matrices.load('BLOSUM62'))

Performing Alignments

seq1 = Seq('ACCGGTAACGTAG')
seq2 = Seq('ACCGTTAACGAAG')

# Get all optimal alignments
alignments = aligner.align(seq1, seq2)
print(f'Found {len(alignments)} optimal alignments')
print(alignments[0])  # Print first alignment

# Get score only (faster for large sequences)
score = aligner.score(seq1, seq2)

Alignment Output Format

target            0 ACCGGTAACGTAG 13
                  0 |||||.||||.|| 13
query             0 ACCGTTAACGAAG 13

Accessing Alignment Data

alignment = alignments[0]

# Basic properties
print(alignment.score)                    # Alignment score
print(alignment.shape)                    # (num_seqs, alignment_length)
print(len(alignment))                     # Alignment length

# Get aligned sequences with gaps
target_aligned = alignment[0, :]          # First sequence (target) with gaps
query_aligned = alignment[1, :]           # Second sequence (query) with gaps

# Get coordinate mapping
print(alignment.aligned)                  # Array of aligned segment coordinates
print(alignment.coordinates)              # Full coordinate array

Alignment Counts (Identities, Mismatches, Gaps)

alignment = alignments[0]
counts = alignment.counts()

print(f'Identities: {counts.identities}')
print(f'Mismatches: {counts.mismatches}')
print(f'Gaps: {counts.gaps}')

# Calculate percent identity
total_aligned = counts.identities + counts.mismatches
percent_identity = counts.identities / total_aligned * 100
print(f'Percent identity: {percent_identity:.1f}%')

Common Scoring Configurations

DNA/RNA Alignment

aligner = PairwiseAligner(mode='global', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)

Protein Alignment

from Bio.Align import substitution_matrices
blosum62 = substitution_matrices.load('BLOSUM62')
aligner = PairwiseAligner(mode='global', substitution_matrix=blosum62, open_gap_score=-11, extend_gap_score=-1)

Local Alignment (Find Best Region)

aligner = PairwiseAligner(mode='local', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)

Semiglobal (Overlap/Extension)

# Allow free end gaps on query (useful for primer alignment)
aligner = PairwiseAligner(mode='global')
aligner.query_left_open_gap_score = 0
aligner.query_left_extend_gap_score = 0
aligner.query_right_open_gap_score = 0
aligner.query_right_extend_gap_score = 0

Available Substitution Matrices

from Bio.Align import substitution_matrices
print(substitution_matrices.load())  # List all available matrices

# Common matrices
blosum62 = substitution_matrices.load('BLOSUM62')  # General protein
blosum80 = substitution_matrices.load('BLOSUM80')  # Closely related proteins
pam250 = substitution_matrices.load('PAM250')      # Distantly related proteins

Working with SeqRecord Objects

from Bio import SeqIO

records = list(SeqIO.parse('sequences.fasta', 'fasta'))
seq1, seq2 = records[0].seq, records[1].seq

aligner = PairwiseAligner(mode='global', match_score=1, mismatch_score=-1)
alignments = aligner.align(seq1, seq2)

Iterating Over Multiple Alignments

# Limit number of alignments returned (memory efficient)
aligner.max_alignments = 100

for i, alignment in enumerate(alignments):
    print(f'Alignment {i+1}: score={alignment.score}')
    if i >= 4:
        break

Substitution Matrix from Alignment

alignment = alignments[0]
substitutions = alignment.substitutions

# View as array (rows=target, cols=query)
print(substitutions)

# Access specific substitution counts
# substitutions['A', 'T'] gives count of A aligned to T

Export Alignment to Different Formats

alignment = alignments[0]

# Various output formats
print(format(alignment, 'fasta'))     # FASTA format
print(format(alignment, 'clustal'))   # Clustal format
print(format(alignment, 'psl'))       # PSL format (BLAT)
print(format(alignment, 'sam'))       # SAM format

Quick Reference: Scoring Parameters

ParameterDescriptionTypical DNATypical Protein
match_score
Score for identical bases1-2Use matrix
mismatch_score
Penalty for mismatches-1 to -3Use matrix
open_gap_score
Cost to start a gap-5 to -15-10 to -12
extend_gap_score
Cost per gap extension-0.5 to -2-0.5 to -1
substitution_matrix
Scoring matrixN/ABLOSUM62

Common Errors

ErrorCauseSolution
OverflowError
Too many optimal alignmentsSet
aligner.max_alignments
Low scoresWrong scoring schemeUse substitution matrix for proteins
No alignments in local modeScores all negativeEnsure
match_score
> 0

Decision Tree: Choosing Alignment Mode

Need full-length comparison?
├── Yes → Use mode='global'
│   └── Sequences similar length?
│       ├── Yes → Standard global
│       └── No → Consider semiglobal (free end gaps)
└── No → Use mode='local'
    └── Find best matching regions only

Related Skills

  • alignment-io - Save alignments to files in various formats
  • msa-parsing - Work with multiple sequence alignments
  • msa-statistics - Calculate identity, similarity metrics
  • sequence-manipulation/motif-search - Pattern matching in sequences
<!-- AUTHOR_SIGNATURE: 9a7f3c2e-MD-BABU-MIA-2026-MSSM-SECURE -->