Medical-research-skills biopython
A comprehensive toolbox for computational molecular biology; use it when you need programmatic sequence/structure parsing, batch bioinformatics pipelines, or automated NCBI/BLAST workflows.
install
source · Clone the upstream repo
git clone https://github.com/aipoch/medical-research-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Data Analysis/biopython" ~/.claude/skills/aipoch-medical-research-skills-biopython && rm -rf "$T"
manifest:
scientific-skills/Data Analysis/biopython/SKILL.mdsource content
When to Use
Use this skill when you need to:
- Batch-process DNA/RNA/protein sequences (translation, reverse complement, statistics) as part of a custom pipeline.
- Parse, validate, convert, or stream large bioinformatics files (FASTA/FASTQ/GenBank/PDB/mmCIF) without loading everything into memory.
- Programmatically query and download records from NCBI (GenBank, PubMed, Gene, Protein) via
, respecting rate limits.Bio.Entrez - Automate BLAST searches (web or local) and parse results to extract top hits and metadata.
- Build or manipulate phylogenetic trees from alignments or distance matrices (e.g., NJ trees) for downstream analysis.
Note: For quick one-off queries, tools like gget may be more convenient; for multi-service API aggregation, bioservices may be a better fit.
Key Features
- Sequence objects and utilities:
,Bio.Seq
,Bio.SeqRecord
(GC fraction, molecular weight, translation, etc.).Bio.SeqUtils - File I/O and format conversion:
,Bio.SeqIO
for FASTA/FASTQ/GenBank and alignment formats.Bio.AlignIO - NCBI access:
forBio.Entrez
,esearch
,efetch
, and structured parsing viaelink
.Entrez.read - BLAST:
for remote BLAST andBio.Blast.NCBIWWW
for XML parsing.Bio.Blast.NCBIXML - Structural bioinformatics:
for PDB/mmCIF parsing, hierarchy traversal, and geometry calculations.Bio.PDB - Phylogenetics:
andBio.Phylo
for tree I/O, distances, and construction.Bio.Phylo.TreeConstruction
Reference guides (if present in this repository) can be consulted for deeper module-specific patterns:
references/sequence_io.mdreferences/alignment.mdreferences/databases.mdreferences/blast.mdreferences/structure.mdreferences/phylogenetics.mdreferences/advanced.md
Dependencies
- Python >= 3.8 (Biopython 1.85 supports Python 3)
biopython==1.85
(required by Biopython)numpy>=1.20
Install:
python -m pip install "biopython==1.85" "numpy>=1.20"
Example Usage
A complete, runnable example that:
- parses a FASTA file,
- computes GC fraction,
- runs a remote BLAST (optional),
- fetches the top hit from NCBI,
- prints basic results.
Create
example_biopython_pipeline.py:
from __future__ import annotations import os import time from typing import Optional from Bio import Entrez, SeqIO from Bio.SeqUtils import gc_fraction # Optional BLAST (remote). Comment out if you do not want network calls. from Bio.Blast import NCBIWWW, NCBIXML def configure_entrez() -> None: """ NCBI requires an email. An API key increases rate limits. Set these via environment variables to avoid hardcoding secrets. """ email = os.environ.get("NCBI_EMAIL") if not email: raise RuntimeError("Set NCBI_EMAIL env var (required by NCBI). Example: export NCBI_EMAIL='you@org.org'") Entrez.email = email api_key = os.environ.get("NCBI_API_KEY") if api_key: Entrez.api_key = api_key def read_first_fasta_record(path: str): with open(path, "r", encoding="utf-8") as handle: return next(SeqIO.parse(handle, "fasta")) def blast_top_accession(sequence: str, program: str = "blastn", database: str = "nt") -> Optional[str]: """ Remote BLAST can be slow and rate-limited. For large-scale BLAST, prefer local BLAST+. """ result_handle = NCBIWWW.qblast(program, database, sequence) blast_record = NCBIXML.read(result_handle) if not blast_record.alignments: return None # Many BLAST titles include multiple identifiers; accession is usually available directly. return blast_record.alignments[0].accession def fetch_fasta_by_accession(accession: str) -> str: with Entrez.efetch(db="nucleotide", id=accession, rettype="fasta", retmode="text") as handle: return handle.read() def main() -> None: configure_entrez() record = read_first_fasta_record("input.fasta") seq = record.seq print(f"ID: {record.id}") print(f"Length: {len(seq)}") print(f"GC fraction: {gc_fraction(seq):.2%}") # Be polite to NCBI services in batch workflows. time.sleep(0.34) top_acc = blast_top_accession(str(seq)) if not top_acc: print("No BLAST hits found.") return print(f"Top BLAST accession: {top_acc}") time.sleep(0.34) fasta_text = fetch_fasta_by_accession(top_acc) print("Top hit FASTA:") print(fasta_text) if __name__ == "__main__": main()
Run:
export NCBI_EMAIL="your.email@example.com" # export NCBI_API_KEY="your_ncbi_api_key" # optional python example_biopython_pipeline.py
Provide an
input.fasta in the same directory, e.g.:
>demo ATCGATCGATCGATCGATCG
Implementation Details
- Streaming I/O for large datasets: Prefer iterator-based parsing (
) to avoid loading entire files into memory. UseSeqIO.parse
only when exactly one record is expected.SeqIO.read - Entrez configuration and rate limits:
- Always set
(NCBI requirement).Entrez.email - Optionally set
to increase request limits.Entrez.api_key - In batch jobs, add delays (e.g.,
as a conservative baseline) and implement retries for transient HTTP failures.time.sleep(0.34)
- Always set
- BLAST considerations:
is convenient but can be slow and is not ideal for high-throughput workloads.NCBIWWW.qblast(...)- Parse results with
(single record) orNCBIXML.read(...)
(multiple records).NCBIXML.parse(...) - Filter hits by HSP metrics (e-value, identity) by iterating
.alignment.hsps
- Sequence statistics and transformations:
- Use
for GC fraction (returns 0–1).Bio.SeqUtils.gc_fraction(seq) - Use
with the correct genetic code table for reproducibility.seq.translate(table=...)
- Use
- Structure parsing (if used):
- Use
to suppress warnings when appropriate.Bio.PDB.PDBParser(QUIET=True) - Navigate the SMCRA hierarchy (Structure → Model → Chain → Residue → Atom) for robust traversal and geometry calculations.
- Use
- Reproducibility:
- Record key parameters (file formats, translation table, BLAST program/database, e-value thresholds, NCBI query terms).
- Cache downloaded records when iterating to avoid repeated network calls.