BioSkills bio-entrez-fetch

Retrieve records from NCBI databases using Biopython Bio.Entrez. Use when downloading sequences, fetching GenBank records, getting document summaries, or parsing NCBI data into Biopython objects.

install

source · Clone the upstream repo

git clone https://github.com/GPTomics/bioSkills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/database-access/entrez-fetch" ~/.claude/skills/gptomics-bioskills-bio-entrez-fetch && rm -rf "$T"

manifest: database-access/entrez-fetch/SKILL.md

source content

Version Compatibility

Reference examples tested with: BioPython 1.83+, Entrez Direct 21.0+

Before using code patterns, verify installed versions match. If versions differ:

Python:
```
pip show <package>
```
then
```
help(module.function)
```
to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Entrez Fetch

"Download a sequence from NCBI" → Retrieve a record by accession from an NCBI database and parse it into a usable object.

Python:
```
Entrez.efetch()
```
+
```
SeqIO.read()
```
(BioPython)

CLI:

efetch -db nucleotide -id NM_007294 -format fasta

(Entrez Direct)

R:
```
entrez_fetch()
```
(rentrez)

Retrieve records from NCBI databases using Biopython's Entrez module (EFetch, ESummary utilities).

Required Setup

from Bio import Entrez

Entrez.email = 'your.email@example.com'  # Required by NCBI
Entrez.api_key = 'your_api_key'          # Optional, raises rate limit 3->10 req/sec

Core Functions

Entrez.efetch() - Retrieve Full Records

Fetch complete records in various formats from any NCBI database.

# Fetch GenBank record by ID
handle = Entrez.efetch(db='nucleotide', id='NM_007294', rettype='gb', retmode='text')
genbank_text = handle.read()
handle.close()

# Fetch FASTA sequence
handle = Entrez.efetch(db='nucleotide', id='NM_007294', rettype='fasta', retmode='text')
fasta_text = handle.read()
handle.close()

# Fetch multiple records
handle = Entrez.efetch(db='nucleotide', id='NM_007294,NM_000059', rettype='fasta', retmode='text')

Key Parameters:

Parameter	Description	Example
`db`	Database name	`'nucleotide'` , `'protein'` , `'pubmed'`
`id`	Record ID(s)	`'NM_007294'` or `'123,456,789'`
`rettype`	Return type	`'fasta'` , `'gb'` , `'abstract'`
`retmode`	Return mode	`'text'` , `'xml'`
`retstart`	Start index	`0`
`retmax`	Max records	`20`
`WebEnv`	History server session	From esearch
`query_key`	History server query	From esearch

Common Return Types by Database

Nucleotide/Protein:

rettype	retmode	Description
`'fasta'`	`'text'`	FASTA sequence
`'gb'`	`'text'`	GenBank flat file
`'gp'`	`'text'`	GenPept flat file (protein)
`'gbwithparts'`	`'text'`	GenBank with contig sequences
`'seqid'`	`'text'`	Seq-id only
`'acc'`	`'text'`	Accession only

PubMed:

rettype	retmode	Description
`'abstract'`	`'text'`	Abstract text
`'medline'`	`'text'`	MEDLINE format
`'xml'`	`'xml'`	Full PubMed XML

Gene:

rettype	retmode	Description
`'gene_table'`	`'text'`	Gene table format
`'xml'`	`'xml'`	Full gene XML

Entrez.esummary() - Document Summaries

Get brief summaries without downloading full records. Faster than efetch.

# Get summary for nucleotide record
handle = Entrez.esummary(db='nucleotide', id='NM_007294')
record = Entrez.read(handle)
handle.close()

summary = record[0]  # First (only) record
print(f"Title: {summary['Title']}")
print(f"Length: {summary['Length']}")
print(f"Organism: {summary['Organism']}")

Common Summary Fields:

# Nucleotide/Protein
summary['Title']          # Record title/description
summary['Caption']        # Short identifier
summary['Length']         # Sequence length
summary['Organism']       # Source organism
summary['TaxId']          # Taxonomy ID
summary['AccessionVersion']  # Full accession.version

# PubMed
summary['Title']          # Article title
summary['AuthorList']     # Authors
summary['Source']         # Journal
summary['PubDate']        # Publication date
summary['DOI']            # Digital Object Identifier

Parsing with Biopython

Parse into SeqRecord Objects

from Bio import Entrez, SeqIO

Entrez.email = 'your.email@example.com'

# Parse GenBank into SeqRecord
handle = Entrez.efetch(db='nucleotide', id='NM_007294', rettype='gb', retmode='text')
record = SeqIO.read(handle, 'genbank')
handle.close()

print(f"ID: {record.id}")
print(f"Length: {len(record.seq)}")
print(f"Features: {len(record.features)}")

# Parse FASTA into SeqRecord
handle = Entrez.efetch(db='nucleotide', id='NM_007294', rettype='fasta', retmode='text')
record = SeqIO.read(handle, 'fasta')
handle.close()

Parse Multiple Records

# Fetch multiple as FASTA
handle = Entrez.efetch(db='nucleotide', id='NM_007294,NM_000059,NM_000546', rettype='fasta', retmode='text')
records = list(SeqIO.parse(handle, 'fasta'))
handle.close()

for record in records:
    print(f"{record.id}: {len(record.seq)} bp")

Parse XML with Entrez.read()

# For structured data, use XML mode
handle = Entrez.efetch(db='gene', id='672', retmode='xml')
records = Entrez.read(handle)
handle.close()

# Navigate nested structure
gene = records[0]
print(f"Gene: {gene['Entrezgene_gene']['Gene-ref']['Gene-ref_locus']}")

Code Patterns

Fetch Sequence by Accession

from Bio import Entrez, SeqIO

Entrez.email = 'your.email@example.com'

def fetch_sequence(accession, db='nucleotide'):
    handle = Entrez.efetch(db=db, id=accession, rettype='fasta', retmode='text')
    record = SeqIO.read(handle, 'fasta')
    handle.close()
    return record

seq = fetch_sequence('NM_007294')
print(f"{seq.id}: {seq.seq[:50]}...")

Fetch GenBank with Features

def fetch_genbank(accession):
    handle = Entrez.efetch(db='nucleotide', id=accession, rettype='gb', retmode='text')
    record = SeqIO.read(handle, 'genbank')
    handle.close()
    return record

gb = fetch_genbank('NM_007294')
for feature in gb.features:
    if feature.type == 'CDS':
        print(f"CDS: {feature.location}")
        print(f"Product: {feature.qualifiers.get('product', ['?'])[0]}")

Fetch PubMed Abstract

def fetch_abstract(pmid):
    handle = Entrez.efetch(db='pubmed', id=pmid, rettype='abstract', retmode='text')
    abstract = handle.read()
    handle.close()
    return abstract

abstract = fetch_abstract('35412348')
print(abstract)

Get Record Summaries

def get_summaries(db, ids):
    if isinstance(ids, list):
        ids = ','.join(ids)
    handle = Entrez.esummary(db=db, id=ids)
    records = Entrez.read(handle)
    handle.close()
    return records

summaries = get_summaries('nucleotide', ['NM_007294', 'NM_000059'])
for s in summaries:
    print(f"{s['Caption']}: {s['Title'][:50]}... ({s['Length']} bp)")

Search Then Fetch

Goal: Find records matching a query and download their sequences in one workflow.

Approach: Search with

esearch

to get IDs, then batch-fetch with

efetch

and parse into SeqRecord objects.

Reference (BioPython 1.83+):

handle = Entrez.esearch(db='nucleotide', term='human[orgn] AND insulin[gene] AND mRNA[fkey]', retmax=5)
search_results = Entrez.read(handle)
handle.close()

ids = search_results['IdList']

handle = Entrez.efetch(db='nucleotide', id=','.join(ids), rettype='fasta', retmode='text')
records = list(SeqIO.parse(handle, 'fasta'))
handle.close()

for record in records:
    print(f"{record.id}: {len(record.seq)} bp")

Fetch Protein by Gene ID

Goal: Retrieve protein sequences for a gene, navigating from gene symbol to protein database.

Approach: Search the gene database by symbol, use

elink

to find linked protein IDs, then batch-fetch the protein sequences.

Reference (BioPython 1.83+):

handle = Entrez.esearch(db='gene', term='BRCA1[sym] AND human[orgn]')
result = Entrez.read(handle)
handle.close()
gene_id = result['IdList'][0]

handle = Entrez.elink(dbfrom='gene', db='protein', id=gene_id)
links = Entrez.read(handle)
handle.close()

protein_ids = [link['Id'] for link in links[0]['LinkSetDb'][0]['Link'][:3]]

handle = Entrez.efetch(db='protein', id=','.join(protein_ids), rettype='fasta', retmode='text')
proteins = list(SeqIO.parse(handle, 'fasta'))
handle.close()

Save Fetched Records to File

def download_sequences(ids, output_file, db='nucleotide', format='fasta'):
    handle = Entrez.efetch(db=db, id=','.join(ids), rettype=format, retmode='text')
    with open(output_file, 'w') as out:
        out.write(handle.read())
    handle.close()

download_sequences(['NM_007294', 'NM_000059'], 'brca_genes.fasta')

Common Errors

Error	Cause	Solution
`HTTPError 400`	Invalid ID or parameters	Verify ID exists, check rettype
`HTTPError 429`	Rate limit exceeded	Add delays or use API key
Empty result	Record doesn't exist	Verify accession in web browser
`ValueError` in SeqIO	Wrong format specified	Match rettype with SeqIO format
`ExpatError`	XML parsing error	Use `retmode='text'` instead

Decision Tree

Need to retrieve NCBI records?
├── Need full sequence?
│   └── Use efetch with rettype='fasta'
├── Need sequence + annotations?
│   └── Use efetch with rettype='gb' (GenBank)
├── Just need metadata (length, organism)?
│   └── Use esummary (faster)
├── Need PubMed abstract?
│   └── Use efetch with rettype='abstract'
├── Need structured data for parsing?
│   └── Use efetch with retmode='xml' + Entrez.read()
├── Downloading many records?
│   └── See batch-downloads skill
└── Need records from multiple databases?
    └── See entrez-link skill first

Related Skills

entrez-search - Find record IDs before fetching
entrez-link - Find related records in other databases
batch-downloads - Download large numbers of records efficiently
sequence-io/read-sequences - Parse downloaded sequences with SeqIO