Claude-skill-registry bio-batch-downloads
Download large datasets from NCBI efficiently using history server, batching, and rate limiting. Use when performing bulk sequence downloads, handling large query results, or production-scale data retrieval.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/batch-downloads" ~/.claude/skills/majiayu000-claude-skill-registry-bio-batch-downloads && rm -rf "$T"
manifest:
skills/data/batch-downloads/SKILL.mdsource content
Batch Downloads
Download large numbers of records from NCBI efficiently using the history server, batching, and proper rate limiting.
Required Setup
from Bio import Entrez import time Entrez.email = 'your.email@example.com' # Required by NCBI Entrez.api_key = 'your_api_key' # Recommended for large downloads
Rate Limits
| Authentication | Requests/Second | Delay Between |
|---|---|---|
| Email only | 3 | 0.34 seconds |
| Email + API key | 10 | 0.1 seconds |
Get an API key at: https://www.ncbi.nlm.nih.gov/account/settings/
History Server
The history server stores search results on NCBI servers, enabling efficient batch retrieval without re-sending large ID lists.
How It Works
- Search with
usehistory='y' - Get
(session ID) andWebEnv
(result set ID)query_key - Fetch results in batches using these identifiers
- Results stay available for ~15 minutes
# Search with history handle = Entrez.esearch(db='nucleotide', term='human[orgn] AND mRNA[fkey]', usehistory='y') search = Entrez.read(handle) handle.close() webenv = search['WebEnv'] query_key = search['QueryKey'] total = int(search['Count']) print(f"Found {total} records, stored in history")
Core Pattern: Batch Download
from Bio import Entrez, SeqIO import time Entrez.email = 'your.email@example.com' def batch_download(db, term, output_file, rettype='fasta', batch_size=500): # Search with history handle = Entrez.esearch(db=db, term=term, usehistory='y') search = Entrez.read(handle) handle.close() webenv = search['WebEnv'] query_key = search['QueryKey'] total = int(search['Count']) print(f"Downloading {total} records...") with open(output_file, 'w') as out: for start in range(0, total, batch_size): print(f" Fetching {start+1}-{min(start+batch_size, total)}...") handle = Entrez.efetch( db=db, rettype=rettype, retmode='text', retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key ) out.write(handle.read()) handle.close() time.sleep(0.34) # Rate limiting (no API key) print(f"Saved to {output_file}")
Code Patterns
Download All Search Results
from Bio import Entrez import time Entrez.email = 'your.email@example.com' Entrez.api_key = 'your_api_key' # Optional def download_search_results(db, term, output_file, rettype='fasta', batch_size=500): # Search with history server handle = Entrez.esearch(db=db, term=term, usehistory='y', retmax=0) search = Entrez.read(handle) handle.close() webenv = search['WebEnv'] query_key = search['QueryKey'] total = int(search['Count']) if total == 0: print("No records found") return delay = 0.1 if Entrez.api_key else 0.34 with open(output_file, 'w') as out: for start in range(0, total, batch_size): end = min(start + batch_size, total) print(f"Downloading {start+1}-{end} of {total}") attempts = 3 for attempt in range(attempts): try: handle = Entrez.efetch(db=db, rettype=rettype, retmode='text', retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key) out.write(handle.read()) handle.close() break except Exception as e: if attempt < attempts - 1: print(f" Retry {attempt+1}: {e}") time.sleep(5) else: raise time.sleep(delay) print(f"Downloaded {total} records to {output_file}") download_search_results('nucleotide', 'human[orgn] AND insulin[gene] AND mRNA[fkey]', 'insulin_mrna.fasta')
Download by ID List
def download_by_ids(db, ids, output_file, rettype='fasta', batch_size=200): total = len(ids) delay = 0.1 if Entrez.api_key else 0.34 with open(output_file, 'w') as out: for start in range(0, total, batch_size): batch = ids[start:start+batch_size] print(f"Downloading {start+1}-{start+len(batch)} of {total}") handle = Entrez.efetch(db=db, id=','.join(batch), rettype=rettype, retmode='text') out.write(handle.read()) handle.close() time.sleep(delay) print(f"Downloaded {total} records to {output_file}") # Example with list of IDs ids = ['NM_007294', 'NM_000059', 'NM_000546', 'NM_001126112', 'NM_004985'] download_by_ids('nucleotide', ids, 'genes.fasta')
Post IDs to History (EPost)
For very large ID lists, post them to the history server first:
def post_and_download(db, ids, output_file, rettype='fasta', batch_size=500): # Post IDs to history server handle = Entrez.epost(db=db, id=','.join(ids)) result = Entrez.read(handle) handle.close() webenv = result['WebEnv'] query_key = result['QueryKey'] total = len(ids) delay = 0.1 if Entrez.api_key else 0.34 with open(output_file, 'w') as out: for start in range(0, total, batch_size): end = min(start + batch_size, total) print(f"Fetching {start+1}-{end} of {total}") handle = Entrez.efetch(db=db, rettype=rettype, retmode='text', retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key) out.write(handle.read()) handle.close() time.sleep(delay) print(f"Downloaded {total} records")
Download GenBank with Progress
from Bio import Entrez, SeqIO from io import StringIO import time def download_genbank_records(term, output_file, batch_size=100): Entrez.email = 'your.email@example.com' # Search handle = Entrez.esearch(db='nucleotide', term=term, usehistory='y') search = Entrez.read(handle) handle.close() webenv, query_key = search['WebEnv'], search['QueryKey'] total = int(search['Count']) records = [] for start in range(0, total, batch_size): print(f"Fetching {start+1}-{min(start+batch_size, total)} of {total}") handle = Entrez.efetch(db='nucleotide', rettype='gb', retmode='text', retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key) batch_records = list(SeqIO.parse(handle, 'genbank')) handle.close() records.extend(batch_records) time.sleep(0.34) SeqIO.write(records, output_file, 'genbank') print(f"Saved {len(records)} GenBank records") return records
Download with Retry Logic
import time from urllib.error import HTTPError def robust_download(db, term, output_file, rettype='fasta', batch_size=500, max_retries=3): handle = Entrez.esearch(db=db, term=term, usehistory='y') search = Entrez.read(handle) handle.close() webenv, query_key = search['WebEnv'], search['QueryKey'] total = int(search['Count']) delay = 0.1 if Entrez.api_key else 0.34 with open(output_file, 'w') as out: for start in range(0, total, batch_size): for retry in range(max_retries): try: handle = Entrez.efetch(db=db, rettype=rettype, retmode='text', retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key) data = handle.read() handle.close() if data.strip(): out.write(data) break except HTTPError as e: if e.code == 429: # Rate limit wait = 10 * (retry + 1) print(f"Rate limited, waiting {wait}s...") time.sleep(wait) elif retry == max_retries - 1: raise else: time.sleep(5) time.sleep(delay) print(f"Downloaded to {output_file}")
Stream to File (Memory Efficient)
def stream_download(db, term, output_file, rettype='fasta', batch_size=1000): handle = Entrez.esearch(db=db, term=term, usehistory='y') search = Entrez.read(handle) handle.close() webenv, query_key = search['WebEnv'], search['QueryKey'] total = int(search['Count']) downloaded = 0 with open(output_file, 'w') as out: for start in range(0, total, batch_size): handle = Entrez.efetch(db=db, rettype=rettype, retmode='text', retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key) # Stream chunks to file while True: chunk = handle.read(8192) if not chunk: break out.write(chunk) handle.close() downloaded = min(start + batch_size, total) print(f"Progress: {downloaded}/{total} ({100*downloaded/total:.1f}%)") time.sleep(0.34)
Batch Size Guidelines
| Database | rettype | Recommended Batch |
|---|---|---|
| nucleotide | fasta | 500-1000 |
| nucleotide | gb | 100-200 |
| protein | fasta | 500-1000 |
| protein | gp | 100-200 |
| pubmed | abstract | 1000-2000 |
| pubmed | xml | 200-500 |
Smaller batches for GenBank/XML (more data per record).
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Rate limit exceeded | Increase delay, use API key |
| Invalid WebEnv/query_key | Session expired, re-search |
| Incomplete data | Connection interrupted | Add retry logic |
| Memory error | Batch too large | Reduce batch_size |
| Empty response | No more records | Check total vs start |
Decision Tree
Need to download many NCBI records? ├── Have search query? │ └── Use esearch with usehistory='y', then batch efetch ├── Have list of IDs? │ ├── < 200 IDs? → Direct efetch with comma-separated IDs │ └── >= 200 IDs? → Use epost, then batch efetch ├── Need records as Biopython objects? │ └── Parse each batch with SeqIO ├── Downloading > 10,000 records? │ └── Use streaming to avoid memory issues └── Getting rate limited? └── Get API key, add retry logic
Related Skills
- entrez-search - Build search queries for batch downloads
- entrez-fetch - Single record fetching
- sra-data - For raw sequencing data (use SRA toolkit instead)