Medical-research-skills biopython-entrez
Use Bio.Entrez to access NCBI databases (e.g., PubMed/GenBank) for searching, fetching summaries, and downloading records when your workflow needs to call the NCBI E-utilities API over the network.
install
source · Clone the upstream repo
git clone https://github.com/aipoch/medical-research-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Evidence Insight/biopython-entrez" ~/.claude/skills/aipoch-medical-research-skills-biopython-entrez && rm -rf "$T"
manifest:
scientific-skills/Evidence Insight/biopython-entrez/SKILL.mdsource content
When to Use
- You need to search PubMed for articles by keyword, author, journal, or date range and then retrieve metadata or abstracts.
- You want to download GenBank records (e.g., nucleotide/protein sequences) in batch given accession IDs or search queries.
- You need to convert identifiers or discover related records across NCBI databases (e.g., PubMed ↔ PMC, Gene ↔ Protein) via cross-links.
- You must retrieve lightweight summaries (titles, IDs, basic metadata) before deciding which full records to fetch.
- You are integrating NCBI E-utilities into an automated pipeline and need API key usage and rate-limit-aware requests.
Key Features
- Supports core NCBI E-utilities via
:Bio.Entrez
,esearch
,efetch
,esummary
.elink - Query-based searching and ID list retrieval for downstream batch operations.
- Batch downloading of records in common formats (e.g., GenBank, FASTA, XML).
- API key configuration and rate-limit-friendly request patterns.
- XML response parsing using Biopython’s Entrez parsers for structured results.
- Standardized configuration and invocation conventions:
- Write runtime configuration to
.config/task_config.json - Invoke tasks via
.python scripts/<task_name>.py - Avoid stacking many CLI
parameters; prefer config files.-- - Use explicit UTF-8 encoding for file I/O and
for JSON output.ensure_ascii=False
- Write runtime configuration to
Dependencies
biopython>=1.80
Example Usage
The following example is a complete, runnable script that:
- searches PubMed, 2) retrieves summaries for the top results, and 3) writes output to JSON.
1) Create
:config/task_config.json
{ "email": "your-email@example.com", "api_key": "", "db": "pubmed", "term": "CRISPR Cas9 2020[PDAT]", "retmax": 5, "out_json": "outputs/pubmed_summaries.json" }
2) Create
:scripts/pubmed_summaries.py
import json import os import time from typing import Any, Dict, List from Bio import Entrez def load_config(path: str) -> Dict[str, Any]: with open(path, "r", encoding="utf-8") as f: return json.load(f) def ensure_parent_dir(path: str) -> None: parent = os.path.dirname(path) if parent: os.makedirs(parent, exist_ok=True) def main() -> None: cfg = load_config("config/task_config.json") Entrez.email = cfg["email"] api_key = cfg.get("api_key") or "" if api_key: Entrez.api_key = api_key db = cfg.get("db", "pubmed") term = cfg["term"] retmax = int(cfg.get("retmax", 20)) out_json = cfg.get("out_json", "outputs/pubmed_summaries.json") # 1) ESearch: get IDs with Entrez.esearch(db=db, term=term, retmax=retmax, usehistory="n") as handle: search_result = Entrez.read(handle) id_list: List[str] = search_result.get("IdList", []) if not id_list: ensure_parent_dir(out_json) with open(out_json, "w", encoding="utf-8") as f: json.dump({"query": term, "count": 0, "items": []}, f, ensure_ascii=False, indent=2) return # Be polite with NCBI: small delay (especially without API key) time.sleep(0.34 if api_key else 0.5) # 2) ESummary: get summaries for IDs with Entrez.esummary(db=db, id=",".join(id_list), retmode="xml") as handle: summary_result = Entrez.read(handle) items = [] for docsum in summary_result: items.append({ "id": str(docsum.get("Id", "")), "title": str(docsum.get("Title", "")), "pubdate": str(docsum.get("PubDate", "")), "source": str(docsum.get("Source", "")), "authors": [str(a.get("Name", "")) for a in docsum.get("AuthorList", [])], }) payload = { "query": term, "count": len(items), "items": items, } ensure_parent_dir(out_json) with open(out_json, "w", encoding="utf-8") as f: json.dump(payload, f, ensure_ascii=False, indent=2) if __name__ == "__main__": main()
3) Run:
python scripts/pubmed_summaries.py
Implementation Details
-
Core E-utilities mapping
: builds a query against an NCBI database and returns matching IDs (and optionally WebEnv/QueryKey for history-based batching).ESearch
: returns lightweight document summaries for a list of IDs.ESummary
: downloads full records (e.g., GenBank/FASTA/XML) for IDs; chooseEFetch
/rettype
based on the target database.retmode
: discovers cross-database relationships (e.g., PubMed → PMC, Gene → Protein).ELink
-
Batching strategy
- Prefer
to obtain IDs, then callESearch
/ESummary
in chunks (e.g., 100–500 IDs per request depending on payload size).EFetch - For large jobs, consider
inusehistory="y"
and then fetch viaESearch
/WebEnv
to avoid very long ID lists.QueryKey
- Prefer
-
Rate limiting and API key
- NCBI enforces request limits; using an API key increases allowed throughput.
- Implement a small delay between requests and retry on transient network errors (HTTP 429/5xx) with backoff.
-
Parsing
- Use
for structured parsing of XML responses into Python objects.Entrez.read(handle) - For raw text formats (e.g., FASTA), use
and write to disk withhandle.read()
where applicable.encoding="utf-8"
- Use
-
Configuration and I/O conventions
- Store runtime parameters in
as an intermediate artifact.config/task_config.json - Avoid complex CLI flags; keep scripts callable as
.python scripts/<task_name>.py - Always specify
for file I/O and useencoding="utf-8"
for JSON outputs.ensure_ascii=False
- Store runtime parameters in
-
Reference
- See
for database notes and selection guidance.references/databases.md
- See