Medical-research-skills biopython

A comprehensive toolbox for computational molecular biology; use it when you need programmatic sequence/structure parsing, batch bioinformatics pipelines, or automated NCBI/BLAST workflows.

install

source · Clone the upstream repo

git clone https://github.com/aipoch/medical-research-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Data Analysis/biopython" ~/.claude/skills/aipoch-medical-research-skills-biopython && rm -rf "$T"

manifest: scientific-skills/Data Analysis/biopython/SKILL.md

source content

Source: https://github.com/aipoch/medical-research-skills

When to Use

Use this skill when you need to:

Batch-process DNA/RNA/protein sequences (translation, reverse complement, statistics) as part of a custom pipeline.
Parse, validate, convert, or stream large bioinformatics files (FASTA/FASTQ/GenBank/PDB/mmCIF) without loading everything into memory.
Programmatically query and download records from NCBI (GenBank, PubMed, Gene, Protein) via
```
Bio.Entrez
```
, respecting rate limits.
Automate BLAST searches (web or local) and parse results to extract top hits and metadata.
Build or manipulate phylogenetic trees from alignments or distance matrices (e.g., NJ trees) for downstream analysis.

Note: For quick one-off queries, tools like gget may be more convenient; for multi-service API aggregation, bioservices may be a better fit.

Key Features

Sequence objects and utilities:
```
Bio.Seq
```
,
```
Bio.SeqRecord
```
,
```
Bio.SeqUtils
```
(GC fraction, molecular weight, translation, etc.).
File I/O and format conversion:
```
Bio.SeqIO
```
,
```
Bio.AlignIO
```
for FASTA/FASTQ/GenBank and alignment formats.
NCBI access:
```
Bio.Entrez
```
for
```
esearch
```
,
```
efetch
```
,
```
elink
```
, and structured parsing via
```
Entrez.read
```
.
BLAST:
```
Bio.Blast.NCBIWWW
```
for remote BLAST and
```
Bio.Blast.NCBIXML
```
for XML parsing.
Structural bioinformatics:
```
Bio.PDB
```
for PDB/mmCIF parsing, hierarchy traversal, and geometry calculations.
Phylogenetics:
```
Bio.Phylo
```
and
```
Bio.Phylo.TreeConstruction
```
for tree I/O, distances, and construction.

Reference guides (if present in this repository) can be consulted for deeper module-specific patterns:

```
references/sequence_io.md
```
```
references/alignment.md
```
```
references/databases.md
```
```
references/blast.md
```
```
references/structure.md
```
```
references/phylogenetics.md
```
```
references/advanced.md
```

Dependencies

Python >= 3.8 (Biopython 1.85 supports Python 3)
```
biopython==1.85
```
```
numpy>=1.20
```
(required by Biopython)

Install:

python -m pip install "biopython==1.85" "numpy>=1.20"

Example Usage

A complete, runnable example that:

parses a FASTA file,
computes GC fraction,
runs a remote BLAST (optional),
fetches the top hit from NCBI,
prints basic results.

Create

example_biopython_pipeline.py

from __future__ import annotations

import os
import time
from typing import Optional

from Bio import Entrez, SeqIO
from Bio.SeqUtils import gc_fraction

# Optional BLAST (remote). Comment out if you do not want network calls.
from Bio.Blast import NCBIWWW, NCBIXML


def configure_entrez() -> None:
    """
    NCBI requires an email. An API key increases rate limits.
    Set these via environment variables to avoid hardcoding secrets.
    """
    email = os.environ.get("NCBI_EMAIL")
    if not email:
        raise RuntimeError("Set NCBI_EMAIL env var (required by NCBI). Example: export NCBI_EMAIL='you@org.org'")
    Entrez.email = email

    api_key = os.environ.get("NCBI_API_KEY")
    if api_key:
        Entrez.api_key = api_key


def read_first_fasta_record(path: str):
    with open(path, "r", encoding="utf-8") as handle:
        return next(SeqIO.parse(handle, "fasta"))


def blast_top_accession(sequence: str, program: str = "blastn", database: str = "nt") -> Optional[str]:
    """
    Remote BLAST can be slow and rate-limited. For large-scale BLAST, prefer local BLAST+.
    """
    result_handle = NCBIWWW.qblast(program, database, sequence)
    blast_record = NCBIXML.read(result_handle)

    if not blast_record.alignments:
        return None

    # Many BLAST titles include multiple identifiers; accession is usually available directly.
    return blast_record.alignments[0].accession


def fetch_fasta_by_accession(accession: str) -> str:
    with Entrez.efetch(db="nucleotide", id=accession, rettype="fasta", retmode="text") as handle:
        return handle.read()


def main() -> None:
    configure_entrez()

    record = read_first_fasta_record("input.fasta")
    seq = record.seq

    print(f"ID: {record.id}")
    print(f"Length: {len(seq)}")
    print(f"GC fraction: {gc_fraction(seq):.2%}")

    # Be polite to NCBI services in batch workflows.
    time.sleep(0.34)

    top_acc = blast_top_accession(str(seq))
    if not top_acc:
        print("No BLAST hits found.")
        return

    print(f"Top BLAST accession: {top_acc}")

    time.sleep(0.34)
    fasta_text = fetch_fasta_by_accession(top_acc)
    print("Top hit FASTA:")
    print(fasta_text)


if __name__ == "__main__":
    main()

Run:

export NCBI_EMAIL="your.email@example.com"
# export NCBI_API_KEY="your_ncbi_api_key"  # optional
python example_biopython_pipeline.py

Provide an

input.fasta

in the same directory, e.g.:

>demo
ATCGATCGATCGATCGATCG

Implementation Details

Streaming I/O for large datasets: Prefer iterator-based parsing (
```
SeqIO.parse
```
) to avoid loading entire files into memory. Use
```
SeqIO.read
```
only when exactly one record is expected.
Entrez configuration and rate limits:
- Always set
```
Entrez.email
```
  (NCBI requirement).
- Optionally set
```
Entrez.api_key
```
  to increase request limits.
- In batch jobs, add delays (e.g.,
```
time.sleep(0.34)
```
  as a conservative baseline) and implement retries for transient HTTP failures.
BLAST considerations:
- ```
NCBIWWW.qblast(...)
```
  is convenient but can be slow and is not ideal for high-throughput workloads.
- Parse results with
```
NCBIXML.read(...)
```
  (single record) or
```
NCBIXML.parse(...)
```
  (multiple records).
- Filter hits by HSP metrics (e-value, identity) by iterating
```
alignment.hsps
```
  .
Sequence statistics and transformations:
- Use
```
Bio.SeqUtils.gc_fraction(seq)
```
  for GC fraction (returns 0–1).
- Use
```
seq.translate(table=...)
```
  with the correct genetic code table for reproducibility.
Structure parsing (if used):
- Use
```
Bio.PDB.PDBParser(QUIET=True)
```
  to suppress warnings when appropriate.
- Navigate the SMCRA hierarchy (Structure → Model → Chain → Residue → Atom) for robust traversal and geometry calculations.
Reproducibility:
- Record key parameters (file formats, translation table, BLAST program/database, e-value thresholds, NCBI query terms).
- Cache downloaded records when iterating to avoid repeated network calls.