BioSkills bio-entrez-fetch
Retrieve records from NCBI databases using Biopython Bio.Entrez. Use when downloading sequences, fetching GenBank records, getting document summaries, or parsing NCBI data into Biopython objects.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/database-access/entrez-fetch" ~/.claude/skills/gptomics-bioskills-bio-entrez-fetch && rm -rf "$T"
database-access/entrez-fetch/SKILL.mdVersion Compatibility
Reference examples tested with: BioPython 1.83+, Entrez Direct 21.0+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function)
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Entrez Fetch
"Download a sequence from NCBI" → Retrieve a record by accession from an NCBI database and parse it into a usable object.
- Python:
+Entrez.efetch()
(BioPython)SeqIO.read() - CLI:
(Entrez Direct)efetch -db nucleotide -id NM_007294 -format fasta - R:
(rentrez)entrez_fetch()
Retrieve records from NCBI databases using Biopython's Entrez module (EFetch, ESummary utilities).
Required Setup
from Bio import Entrez Entrez.email = 'your.email@example.com' # Required by NCBI Entrez.api_key = 'your_api_key' # Optional, raises rate limit 3->10 req/sec
Core Functions
Entrez.efetch() - Retrieve Full Records
Fetch complete records in various formats from any NCBI database.
# Fetch GenBank record by ID handle = Entrez.efetch(db='nucleotide', id='NM_007294', rettype='gb', retmode='text') genbank_text = handle.read() handle.close() # Fetch FASTA sequence handle = Entrez.efetch(db='nucleotide', id='NM_007294', rettype='fasta', retmode='text') fasta_text = handle.read() handle.close() # Fetch multiple records handle = Entrez.efetch(db='nucleotide', id='NM_007294,NM_000059', rettype='fasta', retmode='text')
Key Parameters:
| Parameter | Description | Example |
|---|---|---|
| Database name | , , |
| Record ID(s) | or |
| Return type | , , |
| Return mode | , |
| Start index | |
| Max records | |
| History server session | From esearch |
| History server query | From esearch |
Common Return Types by Database
Nucleotide/Protein:
| rettype | retmode | Description |
|---|---|---|
| | FASTA sequence |
| | GenBank flat file |
| | GenPept flat file (protein) |
| | GenBank with contig sequences |
| | Seq-id only |
| | Accession only |
PubMed:
| rettype | retmode | Description |
|---|---|---|
| | Abstract text |
| | MEDLINE format |
| | Full PubMed XML |
Gene:
| rettype | retmode | Description |
|---|---|---|
| | Gene table format |
| | Full gene XML |
Entrez.esummary() - Document Summaries
Get brief summaries without downloading full records. Faster than efetch.
# Get summary for nucleotide record handle = Entrez.esummary(db='nucleotide', id='NM_007294') record = Entrez.read(handle) handle.close() summary = record[0] # First (only) record print(f"Title: {summary['Title']}") print(f"Length: {summary['Length']}") print(f"Organism: {summary['Organism']}")
Common Summary Fields:
# Nucleotide/Protein summary['Title'] # Record title/description summary['Caption'] # Short identifier summary['Length'] # Sequence length summary['Organism'] # Source organism summary['TaxId'] # Taxonomy ID summary['AccessionVersion'] # Full accession.version # PubMed summary['Title'] # Article title summary['AuthorList'] # Authors summary['Source'] # Journal summary['PubDate'] # Publication date summary['DOI'] # Digital Object Identifier
Parsing with Biopython
Parse into SeqRecord Objects
from Bio import Entrez, SeqIO Entrez.email = 'your.email@example.com' # Parse GenBank into SeqRecord handle = Entrez.efetch(db='nucleotide', id='NM_007294', rettype='gb', retmode='text') record = SeqIO.read(handle, 'genbank') handle.close() print(f"ID: {record.id}") print(f"Length: {len(record.seq)}") print(f"Features: {len(record.features)}") # Parse FASTA into SeqRecord handle = Entrez.efetch(db='nucleotide', id='NM_007294', rettype='fasta', retmode='text') record = SeqIO.read(handle, 'fasta') handle.close()
Parse Multiple Records
# Fetch multiple as FASTA handle = Entrez.efetch(db='nucleotide', id='NM_007294,NM_000059,NM_000546', rettype='fasta', retmode='text') records = list(SeqIO.parse(handle, 'fasta')) handle.close() for record in records: print(f"{record.id}: {len(record.seq)} bp")
Parse XML with Entrez.read()
# For structured data, use XML mode handle = Entrez.efetch(db='gene', id='672', retmode='xml') records = Entrez.read(handle) handle.close() # Navigate nested structure gene = records[0] print(f"Gene: {gene['Entrezgene_gene']['Gene-ref']['Gene-ref_locus']}")
Code Patterns
Fetch Sequence by Accession
from Bio import Entrez, SeqIO Entrez.email = 'your.email@example.com' def fetch_sequence(accession, db='nucleotide'): handle = Entrez.efetch(db=db, id=accession, rettype='fasta', retmode='text') record = SeqIO.read(handle, 'fasta') handle.close() return record seq = fetch_sequence('NM_007294') print(f"{seq.id}: {seq.seq[:50]}...")
Fetch GenBank with Features
def fetch_genbank(accession): handle = Entrez.efetch(db='nucleotide', id=accession, rettype='gb', retmode='text') record = SeqIO.read(handle, 'genbank') handle.close() return record gb = fetch_genbank('NM_007294') for feature in gb.features: if feature.type == 'CDS': print(f"CDS: {feature.location}") print(f"Product: {feature.qualifiers.get('product', ['?'])[0]}")
Fetch PubMed Abstract
def fetch_abstract(pmid): handle = Entrez.efetch(db='pubmed', id=pmid, rettype='abstract', retmode='text') abstract = handle.read() handle.close() return abstract abstract = fetch_abstract('35412348') print(abstract)
Get Record Summaries
def get_summaries(db, ids): if isinstance(ids, list): ids = ','.join(ids) handle = Entrez.esummary(db=db, id=ids) records = Entrez.read(handle) handle.close() return records summaries = get_summaries('nucleotide', ['NM_007294', 'NM_000059']) for s in summaries: print(f"{s['Caption']}: {s['Title'][:50]}... ({s['Length']} bp)")
Search Then Fetch
Goal: Find records matching a query and download their sequences in one workflow.
Approach: Search with
esearch to get IDs, then batch-fetch with efetch and parse into SeqRecord objects.
Reference (BioPython 1.83+):
handle = Entrez.esearch(db='nucleotide', term='human[orgn] AND insulin[gene] AND mRNA[fkey]', retmax=5) search_results = Entrez.read(handle) handle.close() ids = search_results['IdList'] handle = Entrez.efetch(db='nucleotide', id=','.join(ids), rettype='fasta', retmode='text') records = list(SeqIO.parse(handle, 'fasta')) handle.close() for record in records: print(f"{record.id}: {len(record.seq)} bp")
Fetch Protein by Gene ID
Goal: Retrieve protein sequences for a gene, navigating from gene symbol to protein database.
Approach: Search the gene database by symbol, use
elink to find linked protein IDs, then batch-fetch the protein sequences.
Reference (BioPython 1.83+):
handle = Entrez.esearch(db='gene', term='BRCA1[sym] AND human[orgn]') result = Entrez.read(handle) handle.close() gene_id = result['IdList'][0] handle = Entrez.elink(dbfrom='gene', db='protein', id=gene_id) links = Entrez.read(handle) handle.close() protein_ids = [link['Id'] for link in links[0]['LinkSetDb'][0]['Link'][:3]] handle = Entrez.efetch(db='protein', id=','.join(protein_ids), rettype='fasta', retmode='text') proteins = list(SeqIO.parse(handle, 'fasta')) handle.close()
Save Fetched Records to File
def download_sequences(ids, output_file, db='nucleotide', format='fasta'): handle = Entrez.efetch(db=db, id=','.join(ids), rettype=format, retmode='text') with open(output_file, 'w') as out: out.write(handle.read()) handle.close() download_sequences(['NM_007294', 'NM_000059'], 'brca_genes.fasta')
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Invalid ID or parameters | Verify ID exists, check rettype |
| Rate limit exceeded | Add delays or use API key |
| Empty result | Record doesn't exist | Verify accession in web browser |
in SeqIO | Wrong format specified | Match rettype with SeqIO format |
| XML parsing error | Use instead |
Decision Tree
Need to retrieve NCBI records? ├── Need full sequence? │ └── Use efetch with rettype='fasta' ├── Need sequence + annotations? │ └── Use efetch with rettype='gb' (GenBank) ├── Just need metadata (length, organism)? │ └── Use esummary (faster) ├── Need PubMed abstract? │ └── Use efetch with rettype='abstract' ├── Need structured data for parsing? │ └── Use efetch with retmode='xml' + Entrez.read() ├── Downloading many records? │ └── See batch-downloads skill └── Need records from multiple databases? └── See entrez-link skill first
Related Skills
- entrez-search - Find record IDs before fetching
- entrez-link - Find related records in other databases
- batch-downloads - Download large numbers of records efficiently
- sequence-io/read-sequences - Parse downloaded sequences with SeqIO