SciAgent-Skills biopython-sequence-analysis

Biopython toolkit for sequence analysis workflows: parse FASTA/FASTQ/GenBank/GFF with SeqIO, query NCBI databases via Entrez (esearch/efetch/elink), run remote and local BLAST with result parsing, perform pairwise and multiple sequence alignment (PairwiseAligner, MUSCLE/ClustalW), and build/visualize phylogenetic trees (Phylo module). Use for gene family studies, phylogenomics, comparative genomics, and programmatic NCBI pipelines. For PCR design, restriction digestion, and cloning workflows use biopython-molecular-biology; for SAM/BAM alignments use pysam.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/biopython-sequence-analysis" ~/.claude/skills/jaechang-hits-sciagent-skills-biopython-sequence-analysis && rm -rf "$T"
manifest: skills/genomics-bioinformatics/biopython-sequence-analysis/SKILL.md
source content

Biopython: Sequence Analysis Toolkit

Overview

Biopython provides a comprehensive suite of modules for sequence-centric bioinformatics: reading and writing every major biological file format (FASTA, FASTQ, GenBank, GFF), querying NCBI databases programmatically, running BLAST searches and parsing results, aligning sequences pairwise or in multiple-sequence alignments, and building and visualizing phylogenetic trees. This skill focuses on analysis workflows — from NCBI data retrieval through alignment to phylogenetic inference.

For PCR primer design, restriction enzyme digestion, cloning simulation, protein structure analysis (Bio.PDB), and molecular weight/Tm calculations, see biopython-molecular-biology.

When to Use

  • Download a gene family from NCBI Nucleotide/Protein, align sequences, and construct a phylogenetic tree
  • Parse GenBank or GFF3 annotation files and extract CDS sequences for a set of features
  • Run a BLAST search against NCBI
    nt
    or
    nr
    , filter significant hits, and fetch their full sequences
  • Compute pairwise sequence identities or score alignments with BLOSUM62/PAM250 matrices
  • Index a large multi-FASTA or FASTQ file with
    SeqIO.index()
    for random-access retrieval without loading all sequences into RAM
  • Convert between sequence formats (FASTA ↔ GenBank ↔ FASTQ ↔ PHYLIP) in a single call
  • Traverse, root, prune, and annotate a Newick or Nexus phylogenetic tree programmatically
  • Use pysam instead when working with SAM/BAM/CRAM alignment files and mapped reads
  • Use scikit-bio instead when you need ecological diversity metrics (UniFrac, beta diversity) or ordination methods such as PCoA
  • Use gget instead for quick gene lookups and one-liner NCBI queries without writing an Entrez pipeline

Prerequisites

  • Python packages:
    biopython
    ,
    numpy
    ,
    matplotlib
  • Optional tools: MUSCLE or ClustalW installed locally (for
    Bio.Align.Applications
    wrappers)
  • NCBI access: Set
    Entrez.email
    before any E-utilities call; obtain a free API key at https://www.ncbi.nlm.nih.gov/account/ for 10 req/s (default is 3 req/s)
  • Local BLAST: BLAST+ installed separately (
    conda install -c bioconda blast
    ) for offline searches

Check before installing: The tool may already be available in the current environment (e.g., inside a

pixi
/
conda
env). Run
command -v python
first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via
pixi run python
rather than bare
python
.

pip install biopython numpy matplotlib
conda install -c bioconda blast  # optional, for local BLAST

Quick Start

from Bio import SeqIO, Entrez
from Bio.SeqUtils import gc_fraction

# Fetch a GenBank record and display basic stats
Entrez.email = "your.email@example.com"
handle = Entrez.efetch(db="nucleotide", id="NM_007294", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()

print(f"ID: {record.id}")
print(f"Length: {len(record.seq)} bp")
print(f"GC content: {gc_fraction(record.seq)*100:.1f}%")
print(f"Features: {len(record.features)}")
print(f"First feature: {record.features[0].type} at {record.features[0].location}")
# ID: NM_007294.4
# Length: 7207 bp
# GC content: 47.3%
# Features: 9

Core API

Module 1: SeqIO — File Parsing and Format Conversion

Bio.SeqIO
reads and writes every major sequence format (FASTA, FASTQ, GenBank, EMBL, PHYLIP, Nexus).
SeqIO.parse()
returns an iterator of
SeqRecord
objects;
SeqIO.index()
builds an on-disk or in-memory dictionary for large files.

from Bio import SeqIO

# Parse FASTA and FASTQ
fasta_records = list(SeqIO.parse("sequences.fasta", "fasta"))
print(f"FASTA: {len(fasta_records)} sequences")

# FASTQ: access quality scores
for rec in SeqIO.parse("reads.fastq", "fastq"):
    quals = rec.letter_annotations["phred_quality"]
    avg_q = sum(quals) / len(quals)
    print(f"  {rec.id}: {len(rec.seq)} bp, mean Q={avg_q:.1f}")
    break  # show one example

# Parse GenBank with feature access
for rec in SeqIO.parse("chromosome.gb", "genbank"):
    cdss = [f for f in rec.features if f.type == "CDS"]
    print(f"{rec.id}: {len(cdss)} CDS features")
    for cds in cdss[:3]:
        gene = cds.qualifiers.get("gene", ["unknown"])[0]
        print(f"  {gene}: {cds.location}")
from Bio import SeqIO

# SeqIO.index() for random access without loading all records
# Useful for large reference FASTA files (genomes, nr database subsets)
idx = SeqIO.index("large_genome.fasta", "fasta")
print(f"Index contains {len(idx)} sequences")

# Retrieve specific sequences by ID in O(1)
target = idx["chr1"]
print(f"chr1: {len(target.seq):,} bp")
region = target.seq[1_000_000:1_001_000]
print(f"Region [1M-1M+1kb]: {region[:60]}...")

# Format conversion: GenBank → FASTA in one call
n = SeqIO.convert("annotation.gb", "genbank", "sequences.fasta", "fasta")
print(f"Converted {n} records to FASTA")

Module 2: Seq and SeqRecord — Sequence Objects and Feature Annotations

Bio.Seq
represents a biological sequence with in-place operations.
Bio.SeqRecord
wraps a
Seq
with an ID, description, and a dictionary of feature annotations. Features use
FeatureLocation
with strand (+1, -1).

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio.SeqUtils import gc_fraction, MeltingTemp

dna = Seq("ATGAAACCCGGGTTTTAA")
print(f"GC content: {gc_fraction(dna)*100:.1f}%")
print(f"Reverse complement: {dna.reverse_complement()}")
print(f"Transcript: {dna.transcribe()}")
print(f"Translation: {dna.translate()}")
print(f"Translation (to stop): {dna.translate(to_stop=True)}")

# Tm calculation for a short primer
primer = Seq("ATGAAACCCGGG")
tm = MeltingTemp.Tm_Wallace(primer)
print(f"Tm (Wallace): {tm:.1f}°C")
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation

# Build a SeqRecord with annotated features
gene_seq = Seq("ATGAAACCCGGGTTTTAAATCGATCG" * 10)
record = SeqRecord(gene_seq, id="MY_GENE_001", description="synthetic example gene")

# Annotate a CDS feature
cds = SeqFeature(
    FeatureLocation(0, 18, strand=+1),
    type="CDS",
    qualifiers={"gene": ["myGene"], "product": ["hypothetical protein"]}
)
record.features.append(cds)

# Extract and translate the feature
cds_seq = cds.location.extract(record.seq)
protein = cds_seq.translate(to_stop=True)
print(f"CDS: {cds_seq}")
print(f"Protein: {protein}")

# Slice record preserving feature annotations
sub = record[0:60]
print(f"Subrecord: {len(sub.seq)} bp, {len(sub.features)} features")

Module 3: Entrez — Programmatic NCBI Database Access

Bio.Entrez
wraps the NCBI E-utilities API:
esearch
finds records matching a query,
efetch
retrieves full records,
elink
finds related records across databases, and
esummary
returns document summaries. Always set
Entrez.email
and respect the 3 req/s rate limit (10 req/s with API key).

from Bio import Entrez, SeqIO
import time

Entrez.email = "your.email@example.com"
# Entrez.api_key = "YOUR_API_KEY"  # for 10 req/s

# esearch: find IDs matching a query
handle = Entrez.esearch(db="nucleotide", term="BRCA1[Gene] AND Homo sapiens[Organism] AND mRNA[Filter]", retmax=10)
search_results = Entrez.read(handle)
handle.close()

print(f"Total hits: {search_results['Count']}")
ids = search_results["IdList"]
print(f"Retrieved IDs: {ids}")

# efetch: download full GenBank records
for acc_id in ids[:3]:
    handle = Entrez.efetch(db="nucleotide", id=acc_id, rettype="gb", retmode="text")
    record = SeqIO.read(handle, "genbank")
    handle.close()
    print(f"  {record.id}: {len(record.seq)} bp — {record.description[:60]}")
    time.sleep(0.4)  # stay within rate limit
from Bio import Entrez
import time

Entrez.email = "your.email@example.com"

# elink: cross-database links (e.g., PubMed article → related nucleotide sequences)
handle = Entrez.elink(dbfrom="pubmed", db="nucleotide", id="29087512")
link_results = Entrez.read(handle)
handle.close()

linked_ids = []
for linkset in link_results:
    for db_links in linkset.get("LinkSetDb", []):
        if db_links["DbTo"] == "nucleotide":
            linked_ids = [lnk["Id"] for lnk in db_links["Link"]]
            break

print(f"Nucleotide sequences linked to PubMed 29087512: {len(linked_ids)}")
print(f"First IDs: {linked_ids[:5]}")

# esummary: lightweight metadata without downloading full records
if linked_ids:
    handle = Entrez.esummary(db="nucleotide", id=",".join(linked_ids[:5]))
    summaries = Entrez.read(handle)
    handle.close()
    for doc in summaries:
        print(f"  {doc['AccessionVersion']}: {doc['Title'][:70]}")

Module 4: BLAST — Remote and Local Sequence Similarity Search

Bio.Blast.NCBIWWW.qblast()
submits queries to NCBI BLAST servers and returns XML handles.
Bio.Blast.NCBIXML.parse()
(or
.read()
for single queries) yields
Blast
objects with
alignments
and
hsps
. For large-scale searches, use local BLAST+ via
subprocess
.

from Bio.Blast import NCBIWWW, NCBIXML
from Bio.Seq import Seq

# Remote BLASTP against Swiss-Prot (small, reviewed database)
query = Seq("MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSY")
result_handle = NCBIWWW.qblast("blastp", "swissprot", str(query), hitlist_size=10)

# Parse results
blast_record = NCBIXML.read(result_handle)
print(f"Query: {blast_record.query}")
print(f"Database: {blast_record.database}")
print(f"Hits: {len(blast_record.alignments)}")

E_VALUE_THRESH = 1e-5
for alignment in blast_record.alignments[:5]:
    for hsp in alignment.hsps:
        if hsp.expect < E_VALUE_THRESH:
            identity_pct = hsp.identities / hsp.align_length * 100
            print(f"\n  Hit: {alignment.title[:70]}")
            print(f"  E-value: {hsp.expect:.2e}, Identity: {identity_pct:.1f}%, Score: {hsp.score}")
            print(f"  Query:  {hsp.query[:60]}")
            print(f"  Match:  {hsp.match[:60]}")
            print(f"  Sbjct:  {hsp.sbjct[:60]}")
import subprocess
from Bio.Blast import NCBIXML
from Bio import SeqIO

# Local BLAST+ via subprocess (faster for large batches)
# Requires: makeblastdb and blastp installed (conda install -c bioconda blast)

# Step 1: Build a local database from a FASTA file
subprocess.run(
    ["makeblastdb", "-in", "ref_proteins.fasta", "-dbtype", "prot", "-out", "ref_db"],
    check=True
)

# Step 2: Run blastp against local database
result = subprocess.run(
    ["blastp", "-query", "query.fasta", "-db", "ref_db",
     "-outfmt", "5",           # XML output for NCBIXML parsing
     "-evalue", "1e-5",
     "-num_threads", "4",
     "-out", "blast_results.xml"],
    check=True
)

# Step 3: Parse XML results
with open("blast_results.xml") as fh:
    for blast_rec in NCBIXML.parse(fh):
        print(f"Query: {blast_rec.query_id}")
        for aln in blast_rec.alignments[:3]:
            hsp = aln.hsps[0]
            print(f"  Hit: {aln.hit_id}, E={hsp.expect:.2e}, Id={hsp.identities}/{hsp.align_length}")

Module 5: Phylo — Tree Parsing, Manipulation, and Visualization

Bio.Phylo
reads Newick, Nexus, PhyloXML, and NeXML formats. Trees are represented as
Clade
objects with nested children. Key methods:
find_clades()
(generator over nodes),
common_ancestor()
,
distance()
,
root_with_outgroup()
, and
prune()
. Visualization uses matplotlib.

from Bio import Phylo
import io

# Parse a Newick tree from a string
newick = "((Homo_sapiens:0.01, Pan_troglodytes:0.012):0.08, (Mus_musculus:0.15, Rattus_norvegicus:0.14):0.12, Drosophila_melanogaster:0.85);"
tree = Phylo.read(io.StringIO(newick), "newick")

print(f"Number of terminals: {tree.count_terminals()}")
print(f"Total branch length: {tree.total_branch_length():.3f}")
print(f"Terminals: {[t.name for t in tree.get_terminals()]}")

# Root with outgroup and compute distances
tree.root_with_outgroup("Drosophila_melanogaster")
human = tree.find_any("Homo_sapiens")
chimp = tree.find_any("Pan_troglodytes")
mouse = tree.find_any("Mus_musculus")
print(f"\nDistance Human–Chimp: {tree.distance(human, chimp):.4f}")
print(f"Distance Human–Mouse: {tree.distance(human, mouse):.4f}")

# Traverse all internal nodes
for clade in tree.find_clades(order="level"):
    if not clade.is_terminal():
        children = [c.name or "internal" for c in clade.clades]
        print(f"  Internal node → {children}")
from Bio import Phylo
import matplotlib.pyplot as plt
import io

newick = "((Homo_sapiens:0.01, Pan_troglodytes:0.012):0.08, (Mus_musculus:0.15, Rattus_norvegicus:0.14):0.12, Drosophila_melanogaster:0.85);"
tree = Phylo.read(io.StringIO(newick), "newick")
tree.root_with_outgroup("Drosophila_melanogaster")
tree.ladderize()   # sort clades by size for a clean layout

# Annotate bootstrap support (if present in node labels)
for clade in tree.find_clades():
    if clade.confidence is not None:
        clade.name = f"{clade.confidence:.0f}"

fig, ax = plt.subplots(figsize=(8, 5))
Phylo.draw(tree, axes=ax, do_show=False, show_confidence=True)
ax.set_title("Primate + Outgroup Phylogeny")
plt.tight_layout()
plt.savefig("phylogeny.png", dpi=150, bbox_inches="tight")
print("Saved phylogeny.png")

Module 6: PairwiseAligner — Pairwise Sequence Alignment

Bio.Align.PairwiseAligner
replaces the legacy
pairwise2
module (deprecated since Biopython 1.80). It supports local and global alignment, gap open/extend penalties, and any substitution matrix from
Bio.Align.substitution_matrices
.

from Bio.Align import PairwiseAligner, substitution_matrices

# Global protein alignment with BLOSUM62
aligner = PairwiseAligner()
aligner.mode = "global"
aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
aligner.open_gap_score = -11
aligner.extend_gap_score = -1

seq1 = "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSY"
seq2 = "MTEYKLVVVGAVGVGKSALTIQLIQNHFVDEYDPTIEDSY"  # G12V variant

alignments = aligner.align(seq1, seq2)
best = alignments[0]
print(f"Score: {best.score:.1f}")
print(f"Number of alignments: {aligner.score(seq1, seq2):.0f} (score only, faster)")
print(best)   # formatted alignment string

identity = sum(a == b for a, b in zip(*best) if a != "-") / best.shape[1]
print(f"Identity: {identity*100:.1f}%")
from Bio.Align import PairwiseAligner, substitution_matrices

# Local DNA alignment
aligner = PairwiseAligner()
aligner.mode = "local"
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.open_gap_score = -5
aligner.extend_gap_score = -0.5

query = "ACGTACGTACGT"
subject = "TTTTACGTACGTACGTTTTT"
alignments = list(aligner.align(query, subject))
print(f"Local alignments found: {len(alignments)}")
best = alignments[0]
print(f"Score: {best.score}")
print(f"Aligned region in subject: [{best.coordinates[1][0]}:{best.coordinates[1][-1]}]")
print(best)

# Batch pairwise identity matrix
seqs = ["ACGTACGT", "ACGTATGT", "TTGTACGT", "ACGTACGG"]
n = len(seqs)
aligner2 = PairwiseAligner(mode="global", match_score=1, mismatch_score=-1)
print("\nPairwise identity matrix:")
for i in range(n):
    row = []
    for j in range(n):
        score = aligner2.score(seqs[i], seqs[j])
        max_len = max(len(seqs[i]), len(seqs[j]))
        row.append(f"{score/max_len:.2f}")
    print("  " + "  ".join(row))

Key Concepts

SeqRecord and Feature Coordinates

SeqRecord
stores sequence metadata and a list of
SeqFeature
objects. Features use zero-based, half-open
FeatureLocation(start, end, strand)
— the same convention as Python slicing. Use
feature.location.extract(record.seq)
to obtain the strand-correct subsequence. Compound locations (e.g., spliced exons) use
CompoundLocation
.

from Bio import SeqIO

# Extract all CDS protein translations from a GenBank record
for rec in SeqIO.parse("gene.gb", "genbank"):
    for feat in rec.features:
        if feat.type == "CDS":
            gene = feat.qualifiers.get("gene", ["?"])[0]
            product = feat.qualifiers.get("product", ["unknown"])[0]
            cds_nt = feat.location.extract(rec.seq)
            protein = cds_nt.translate(to_stop=True)
            print(f"{gene} ({product}): {len(protein)} aa — {protein[:10]}...")

Phylo Clade Objects

A

Clade
is both a node and the subtree rooted at that node. Terminal clades (leaves) have
.name
set; internal clades may have
.confidence
(bootstrap) and
.branch_length
. Use
tree.find_clades()
with
terminal=True/False
to filter. The root is
tree.root
(also a
Clade
).
Phylo.read()
returns a
Tree
wrapper;
tree.root
gives the root
Clade
.

Common Workflows

Workflow 1: Gene Family Phylogeny — NCBI Download → MUSCLE Alignment → Tree

Goal: Download all BRCA1 orthologs from Vertebrata, align with MUSCLE, and build a neighbor-joining tree.

import subprocess
import time
from Bio import Entrez, SeqIO, Phylo, AlignIO
from Bio.Align import MultipleSeqAlignment
import matplotlib.pyplot as plt

Entrez.email = "your.email@example.com"

# Step 1: Search NCBI Protein for BRCA1 orthologs
handle = Entrez.esearch(
    db="protein",
    term="BRCA1[Gene] AND Vertebrata[Organism] AND RefSeq[Filter]",
    retmax=20
)
results = Entrez.read(handle); handle.close()
ids = results["IdList"]
print(f"Found {results['Count']} sequences, using {len(ids)}")

# Step 2: Fetch sequences in FASTA format
handle = Entrez.efetch(db="protein", id=",".join(ids), rettype="fasta", retmode="text")
with open("brca1_orthologs.fasta", "w") as fh:
    fh.write(handle.read())
handle.close()
print(f"Saved {len(ids)} sequences to brca1_orthologs.fasta")
time.sleep(1)

# Step 3: Align with MUSCLE (must be installed: conda install -c bioconda muscle)
subprocess.run(
    ["muscle", "-align", "brca1_orthologs.fasta", "-output", "brca1_aligned.fasta"],
    check=True
)
alignment = AlignIO.read("brca1_aligned.fasta", "fasta")
print(f"Alignment: {len(alignment)} sequences × {alignment.get_alignment_length()} columns")

# Step 4: Build a neighbor-joining tree using Bio.Phylo + distance matrix
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

calculator = DistanceCalculator("identity")
dm = calculator.get_distance(alignment)
constructor = DistanceTreeConstructor(calculator, method="nj")
tree = constructor.build_tree(alignment)

# Step 5: Root and save
tree.root_at_midpoint()
tree.ladderize()
Phylo.write(tree, "brca1_nj.nwk", "newick")
print("Saved NJ tree to brca1_nj.nwk")

# Step 6: Visualize
fig, ax = plt.subplots(figsize=(10, 8))
Phylo.draw(tree, axes=ax, do_show=False)
ax.set_title("BRCA1 Ortholog NJ Tree")
plt.tight_layout()
plt.savefig("brca1_tree.png", dpi=150, bbox_inches="tight")
print("Saved brca1_tree.png")

Workflow 2: BLAST Pipeline — FASTA Query → Remote BLAST → Fetch Top Hits

Goal: Take a query FASTA, BLAST against NCBI

nr
, filter significant hits, retrieve full GenBank records for the top matches, and write a summary CSV.

import csv
import time
from Bio import SeqIO, Entrez
from Bio.Blast import NCBIWWW, NCBIXML

Entrez.email = "your.email@example.com"

# Step 1: Load query sequence
query_record = SeqIO.read("query.fasta", "fasta")
print(f"Query: {query_record.id} ({len(query_record.seq)} aa)")

# Step 2: Remote BLAST against nr protein database
print("Submitting BLAST job (may take 1-5 minutes)...")
result_handle = NCBIWWW.qblast(
    "blastp", "nr", str(query_record.seq),
    hitlist_size=20,
    expect=1e-5,
    word_size=6,
    matrix_name="BLOSUM62"
)

# Step 3: Parse results and filter
E_THRESH = 1e-5
MIN_IDENTITY = 50.0
top_hits = []

blast_record = NCBIXML.read(result_handle)
for alignment in blast_record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect > E_THRESH:
            continue
        identity_pct = hsp.identities / hsp.align_length * 100
        if identity_pct < MIN_IDENTITY:
            continue
        accession = alignment.accession
        top_hits.append({
            "accession": accession,
            "title": alignment.title[:80],
            "length": alignment.length,
            "score": hsp.score,
            "evalue": hsp.expect,
            "identity_pct": round(identity_pct, 1),
            "coverage": round(hsp.align_length / blast_record.query_length * 100, 1),
        })

print(f"Filtered to {len(top_hits)} significant hits")

# Step 4: Fetch full GenBank records for top 5 hits
accessions = [h["accession"] for h in top_hits[:5]]
time.sleep(1)
handle = Entrez.efetch(db="protein", id=",".join(accessions), rettype="gb", retmode="text")
gb_records = list(SeqIO.parse(handle, "genbank"))
handle.close()
SeqIO.write(gb_records, "top_blast_hits.gb", "genbank")
print(f"Saved {len(gb_records)} full GenBank records to top_blast_hits.gb")

# Step 5: Write CSV summary
with open("blast_summary.csv", "w", newline="") as fh:
    writer = csv.DictWriter(fh, fieldnames=top_hits[0].keys())
    writer.writeheader()
    writer.writerows(top_hits)
print(f"Saved blast_summary.csv ({len(top_hits)} hits)")

Workflow 3: FASTQ Quality Filter and Format Pipeline

Goal: Read a FASTQ file, filter reads by mean quality and length, and output a FASTA for downstream alignment.

from Bio import SeqIO
from Bio.SeqUtils import gc_fraction

input_fastq = "raw_reads.fastq"
output_fasta = "filtered_reads.fasta"
MIN_QUAL = 20     # mean Phred quality
MIN_LEN = 100     # minimum read length
MAX_LEN = 300     # maximum read length

def passes_qc(record, min_q=MIN_QUAL, min_l=MIN_LEN, max_l=MAX_LEN):
    quals = record.letter_annotations["phred_quality"]
    mean_q = sum(quals) / len(quals)
    length = len(record.seq)
    return mean_q >= min_q and min_l <= length <= max_l

kept = 0
total = 0
with open(output_fasta, "w") as out_fh:
    for record in SeqIO.parse(input_fastq, "fastq"):
        total += 1
        if passes_qc(record):
            # Convert FASTQ → FASTA (drops quality scores)
            SeqIO.write(record, out_fh, "fasta")
            kept += 1

print(f"Total reads: {total:,}")
print(f"Passed QC: {kept:,} ({kept/total*100:.1f}%)")
print(f"Saved to: {output_fasta}")

Key Parameters

ParameterModule / FunctionDefaultRange / OptionsEffect
retmax
Entrez.esearch()
20
1
10000
Max IDs returned per search; use history server for >10000
hitlist_size
NCBIWWW.qblast()
50
1
5000
Max BLAST alignments returned
expect
NCBIWWW.qblast()
10.0
1e-100
1000
E-value cutoff for BLAST hit reporting
matrix_name
NCBIWWW.qblast()
"BLOSUM62"
"BLOSUM45"
,
"BLOSUM80"
,
"PAM250"
Substitution matrix; BLOSUM62 for general use
mode
PairwiseAligner
"global"
"global"
,
"local"
Needleman-Wunsch (global) vs Smith-Waterman (local)
open_gap_score
PairwiseAligner
-1
Negative floatPenalty for opening a gap; increase magnitude to penalize more
extend_gap_score
PairwiseAligner
0
Negative floatPer-residue gap extension penalty
method
DistanceTreeConstructor
"nj"
"nj"
,
"upgma"
NJ (unrooted) vs UPGMA (rooted, assumes clock)
format
Phylo.read/write()
required
"newick"
,
"nexus"
,
"phyloxml"
Tree file format
rettype
Entrez.efetch()
required
"fasta"
,
"gb"
,
"xml"
Record format returned; pair with matching SeqIO format string

Best Practices

  1. Always set

    Entrez.email
    and respect rate limits. NCBI blocks IPs that exceed 3 req/s without a key. Add
    time.sleep(0.4)
    between requests in loops, or use
    Entrez.api_key
    for 10 req/s.

    from Bio import Entrez
    import time
    Entrez.email = "your.email@institution.edu"
    Entrez.api_key = "YOUR_API_KEY"    # from https://www.ncbi.nlm.nih.gov/account/
    # In any batch loop: time.sleep(0.11)  # ~9 req/s with key
    
  2. Use

    SeqIO.index()
    for large FASTA files instead of
    list(SeqIO.parse(...))
    . Loading all records into a list consumes O(N) memory;
    index()
    reads only the byte offsets and fetches records on demand.

    # Bad for large files: loads everything into RAM
    # records = {r.id: r for r in SeqIO.parse("genome.fasta", "fasta")}
    
    # Good: on-disk index, O(1) access
    from Bio import SeqIO
    idx = SeqIO.index("genome.fasta", "fasta")
    rec = idx["chr22"]          # fetches only this record
    
  3. Use

    PairwiseAligner
    not the legacy
    pairwise2
    .
    Bio.pairwise2
    is deprecated since Biopython 1.80 and will be removed.
    PairwiseAligner
    is faster, supports substitution matrices directly, and returns
    Alignment
    objects with coordinate arrays.

  4. Close Entrez handles immediately after reading. Handles are HTTP connections; leaving them open risks timeouts and resource exhaustion.

    handle = Entrez.efetch(db="nucleotide", id="NM_007294", rettype="gb", retmode="text")
    record = SeqIO.read(handle, "genbank")
    handle.close()    # do not skip this
    
  5. Prefer local BLAST for batch searches. Remote

    NCBIWWW.qblast()
    is suitable for ad hoc queries but can queue for minutes on NCBI servers. For screening >100 sequences, build a local BLAST+ database with
    makeblastdb
    and call
    blastp
    /
    blastn
    via
    subprocess
    with
    -outfmt 5
    (XML) for
    NCBIXML
    parsing.

  6. Root trees before measuring distances.

    Phylo.distance()
    measures the sum of branch lengths along the path between two nodes. On an unrooted tree, the path is still unique, but midpoint-rooting or outgroup-rooting makes biological sense for visualizations and clade assertions.

  7. Strip alignment gaps before building SeqRecord collections. BLAST and alignment results may include gap characters (

    -
    ).
    Seq
    operations like
    .translate()
    will raise errors on gapped sequences; strip with
    seq.replace("-", "")
    or use
    ungap()
    .

Common Recipes

Recipe: Batch Entrez Fetch with History Server

When to use: Downloading more than 500 records from NCBI — avoids URL length limits and keeps search results server-side.

from Bio import Entrez, SeqIO
import time

Entrez.email = "your.email@example.com"

# Search and store results on NCBI history server
handle = Entrez.esearch(db="nucleotide", term="16S rRNA[Gene] AND Bacteria[Organism]",
                        retmax=500, usehistory="y")
results = Entrez.read(handle); handle.close()
webenv = results["WebEnv"]
query_key = results["QueryKey"]
count = int(results["Count"])
print(f"Total: {count} sequences")

# Fetch in batches of 200
batch_size = 200
all_records = []
for start in range(0, min(count, 1000), batch_size):
    handle = Entrez.efetch(
        db="nucleotide", rettype="fasta", retmode="text",
        retstart=start, retmax=batch_size,
        webenv=webenv, query_key=query_key
    )
    batch = list(SeqIO.parse(handle, "fasta"))
    handle.close()
    all_records.extend(batch)
    print(f"  Fetched {len(all_records)}/{min(count, 1000)}")
    time.sleep(0.5)

SeqIO.write(all_records, "16S_sequences.fasta", "fasta")
print(f"Saved {len(all_records)} sequences")

Recipe: Parse GFF3 with Sequence Features

When to use: Extract gene/CDS sequences from a GFF3 annotation paired with a reference FASTA.

from Bio import SeqIO

# Load genome FASTA into indexed dict
genome = SeqIO.to_dict(SeqIO.parse("genome.fasta", "fasta"))

# Parse GFF3 manually (Biopython does not have a native GFF3 parser;
# use the gffutils package for complex queries, or parse directly for simple cases)
genes = []
with open("annotation.gff3") as fh:
    for line in fh:
        if line.startswith("#") or not line.strip():
            continue
        fields = line.strip().split("\t")
        if len(fields) < 9 or fields[2] != "CDS":
            continue
        chrom, _, feat_type, start, end, _, strand, _, attrs = fields
        start, end = int(start) - 1, int(end)   # GFF3 is 1-based
        if chrom not in genome:
            continue
        seq = genome[chrom].seq[start:end]
        if strand == "-":
            seq = seq.reverse_complement()
        attr_dict = dict(kv.split("=") for kv in attrs.strip().split(";") if "=" in kv)
        gene_id = attr_dict.get("gene_id", attr_dict.get("ID", "unknown"))
        genes.append((gene_id, seq, strand))

print(f"Extracted {len(genes)} CDS features")
for gid, seq, strand in genes[:3]:
    print(f"  {gid} ({strand}): {len(seq)} bp — protein: {seq.translate(to_stop=True)[:8]}...")

Recipe: Compute a Pairwise Identity Matrix from a Multiple Alignment

When to use: Quickly assess sequence diversity within an alignment file (FASTA, PHYLIP, Clustal) and identify outlier sequences.

from Bio import AlignIO
import numpy as np

alignment = AlignIO.read("aligned_sequences.fasta", "fasta")
n = len(alignment)
names = [rec.id for rec in alignment]
length = alignment.get_alignment_length()

# Build identity matrix
identity_matrix = np.zeros((n, n))
for i in range(n):
    for j in range(n):
        if i == j:
            identity_matrix[i, j] = 1.0
            continue
        matches = sum(
            a == b and a != "-"
            for a, b in zip(str(alignment[i].seq), str(alignment[j].seq))
        )
        aligned_cols = sum(a != "-" and b != "-"
                           for a, b in zip(str(alignment[i].seq), str(alignment[j].seq)))
        identity_matrix[i, j] = matches / aligned_cols if aligned_cols else 0.0

print("Pairwise identity matrix:")
print(f"{'':20s} " + "  ".join(f"{n[:8]:>8s}" for n in names))
for i, name in enumerate(names):
    row = "  ".join(f"{identity_matrix[i,j]*100:8.1f}" for j in range(n))
    print(f"{name[:20]:20s} {row}")

# Find most divergent pair
min_id = np.min(identity_matrix[identity_matrix > 0])
idx = np.unravel_index(np.argmin(np.where(identity_matrix > 0, identity_matrix, 1)), identity_matrix.shape)
print(f"\nMost divergent pair: {names[idx[0]]} vs {names[idx[1]]} ({min_id*100:.1f}%)")

Troubleshooting

ProblemCauseSolution
urllib.error.HTTPError: 429 Too Many Requests
Exceeding NCBI rate limit (3 req/s)Add
time.sleep(0.4)
between calls, or register a free API key at https://www.ncbi.nlm.nih.gov/account/ for 10 req/s
RuntimeError: Too many requests were made without pausing
NCBI detects rapid-fire requestsSet
Entrez.email
(required), reduce loop frequency, batch IDs into comma-joined strings in a single
efetch
call
Bio.Application.ApplicationError: blastall not found
Legacy
blastall
replaced by BLAST+
Replace
NcbiblastpCommandline
with
subprocess.run(["blastp", ...])
and BLAST+ tools
AttributeError: 'PairwiseAligner' object has no attribute 'align'
Biopython < 1.78Upgrade:
pip install --upgrade biopython
(PairwiseAligner requires ≥ 1.72; stable from 1.78)
Bio.pairwise2
deprecation warning
Using the legacy pairwise2 APIReplace with
from Bio.Align import PairwiseAligner
(removed in Biopython 1.84+)
ValueError: Sequence contains letters not in the alphabet
Non-standard characters (e.g.,
N
, ambiguity codes) in sequence before
translate()
Strip or replace ambiguous bases; use
seq.translate(table=1)
which handles ambiguous codons
Empty BLAST results /
StopIteration
from
NCBIXML.read()
Empty or malformed XML; network timeout; query too shortCheck query sequence length (>10 aa recommended); switch from
.read()
to
.parse()
and check
blast_record.alignments
length
Phylo.draw()
hangs or produces blank plot
Missing matplotlib backendCall
import matplotlib; matplotlib.use("Agg")
before importing Phylo for headless environments
TreeConstruction
distance matrix dimension mismatch
Alignment contains duplicate IDsDeduplicate SeqRecord IDs before constructing the alignment:
{r.id: r for r in records}.values()

Related Skills

  • biopython-molecular-biology — restriction digestion, PCR primer design, protein structure analysis (Bio.PDB), molecular weight and Tm calculations; complement to this skill
  • pysam-genomic-files — SAM/BAM/CRAM alignment file access, pileup, and region queries; use when working with read alignments
  • scikit-bio — ecological diversity statistics (UniFrac, Bray-Curtis), ordination (PCoA), and microbiome analysis; more specialized than Biopython's phylogenetics for diversity analyses
  • gget-genomic-databases — one-liner gene lookups, BLAST, and database queries without setting up an Entrez pipeline
  • etetoolkit — advanced phylogenetic tree visualization, annotation with NCBI taxonomy, and publication-quality tree rendering

References