BioSkills bio-entrez-search
Search NCBI databases using Biopython Bio.Entrez. Use when finding records by keyword, building complex search queries, discovering database structure, or getting global query counts across databases.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/database-access/entrez-search" ~/.claude/skills/gptomics-bioskills-bio-entrez-search && rm -rf "$T"
database-access/entrez-search/SKILL.mdVersion Compatibility
Reference examples tested with: BioPython 1.83+, Entrez Direct 21.0+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function)
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Entrez Search
Search NCBI databases using Biopython's Entrez module (ESearch, EInfo, EGQuery utilities).
"Search NCBI for records" → Query any NCBI database by keyword, organism, or field-qualified terms and retrieve matching record IDs.
- Python:
(BioPython)Entrez.esearch(db=..., term=...) - CLI:
(Entrez Direct)esearch -db nucleotide -query "term"
Required Setup
from Bio import Entrez Entrez.email = 'your.email@example.com' # Required by NCBI Entrez.api_key = 'your_api_key' # Optional, raises rate limit 3->10 req/sec
Core Functions
Entrez.esearch() - Search a Database
Search any NCBI database and get matching record IDs.
handle = Entrez.esearch(db='nucleotide', term='human[orgn] AND BRCA1[gene]') record = Entrez.read(handle) handle.close() print(f"Found {record['Count']} records") print(f"IDs: {record['IdList']}") # First 20 IDs by default
Key Parameters:
| Parameter | Description | Default |
|---|---|---|
| Database to search | Required |
| Search query | Required |
| Max IDs to return | 20 |
| Starting index (pagination) | 0 |
| Store results on server | 'n' |
| Sort order | database-specific |
| Date field to search | 'pdat' |
| Records from last N days | None |
| Start date (YYYY/MM/DD) | None |
| End date (YYYY/MM/DD) | None |
ESearch Result Fields:
record['Count'] # Total matching records (string) record['IdList'] # List of record IDs record['RetMax'] # Number of IDs returned record['RetStart'] # Starting index record['QueryKey'] # For history server (if usehistory='y') record['WebEnv'] # For history server (if usehistory='y') record['TranslationSet'] # Query translations applied record['QueryTranslation'] # Final translated query
Entrez.einfo() - Database Information
Get information about available databases or specific database fields.
# List all available databases handle = Entrez.einfo() record = Entrez.read(handle) handle.close() print(record['DbList']) # ['pubmed', 'protein', 'nucleotide', ...] # Get info about specific database handle = Entrez.einfo(db='nucleotide') record = Entrez.read(handle) handle.close() print(f"Description: {record['DbInfo']['Description']}") print(f"Record count: {record['DbInfo']['Count']}") # List searchable fields for field in record['DbInfo']['FieldList']: print(f"{field['Name']}: {field['Description']}")
Database Info Fields:
record['DbInfo']['DbName'] # Database name record['DbInfo']['Description'] # Database description record['DbInfo']['Count'] # Total records in database record['DbInfo']['LastUpdate'] # Last update date record['DbInfo']['FieldList'] # Searchable fields record['DbInfo']['LinkList'] # Available links to other databases
Entrez.egquery() - Global Query
Search across all NCBI databases simultaneously.
handle = Entrez.egquery(term='CRISPR') record = Entrez.read(handle) handle.close() for result in record['eGQueryResult']: if int(result['Count']) > 0: print(f"{result['DbName']}: {result['Count']} records")
Search Query Syntax
NCBI uses a specific query syntax:
Field Tags
# Search specific fields using [field_name] term = 'BRCA1[gene]' # Gene name field term = 'human[orgn]' # Organism field term = 'Homo sapiens[ORGN]' # Full organism name term = 'NM_007294[accn]' # Accession number term = 'Smith J[auth]' # Author (PubMed) term = 'Nature[jour]' # Journal (PubMed) term = '1000:5000[slen]' # Sequence length range term = 'mRNA[fkey]' # Feature key
Boolean Operators
term = 'BRCA1 AND human' # Both terms term = 'cancer OR tumor' # Either term term = 'human NOT mouse' # Exclude term term = '(BRCA1 OR BRCA2) AND human' # Grouping
Date Ranges
# Using date parameters handle = Entrez.esearch( db='pubmed', term='CRISPR', datetype='pdat', # Publication date mindate='2023/01/01', maxdate='2024/12/31' ) # Or in query string term = 'CRISPR AND 2024[pdat]' term = 'CRISPR AND 2023:2024[pdat]'
Wildcards and Phrases
term = 'immun*' # Wildcard term = '"breast cancer"[title]' # Exact phrase
Common Databases
| Database | value | Common Fields |
|---|---|---|
| PubMed | | , , , |
| Nucleotide | | , , , |
| Protein | | , , , |
| Gene | | , , |
| SRA | | , , |
| Taxonomy | | , , |
| Assembly | | , , |
Code Patterns
Basic Search with Pagination
from Bio import Entrez Entrez.email = 'your.email@example.com' def search_ncbi(db, term, max_results=100): handle = Entrez.esearch(db=db, term=term, retmax=max_results) record = Entrez.read(handle) handle.close() return record['IdList'], int(record['Count']) ids, total = search_ncbi('nucleotide', 'human[orgn] AND insulin[gene]') print(f'Retrieved {len(ids)} of {total} total records')
Paginated Search for Large Results
Goal: Retrieve all matching record IDs when the result set exceeds the default return limit.
Approach: First query with retmax=0 to get the total count, then page through results in batches using retstart offsets.
def search_all_ids(db, term, batch_size=10000): all_ids = [] handle = Entrez.esearch(db=db, term=term, retmax=0) record = Entrez.read(handle) handle.close() total = int(record['Count']) for start in range(0, total, batch_size): handle = Entrez.esearch(db=db, term=term, retstart=start, retmax=batch_size) record = Entrez.read(handle) handle.close() all_ids.extend(record['IdList']) return all_ids
Search with History Server (for Large Results)
Goal: Store search results on the NCBI server for efficient subsequent batch fetching without re-sending IDs.
Approach: Run esearch with usehistory='y' to get a WebEnv session key and QueryKey, then pass those to efetch for server-side retrieval.
# Store results on NCBI server for subsequent fetching handle = Entrez.esearch(db='nucleotide', term='human[orgn] AND mRNA[fkey]', usehistory='y') record = Entrez.read(handle) handle.close() webenv = record['WebEnv'] query_key = record['QueryKey'] total = int(record['Count']) # Use webenv and query_key with efetch for batch downloads # See batch-downloads skill for details
Recent Records Only
# Records from last 30 days handle = Entrez.esearch(db='pubmed', term='CRISPR', reldate=30, datetype='pdat') record = Entrez.read(handle) handle.close()
Get Available Fields for a Database
def get_search_fields(db): handle = Entrez.einfo(db=db) record = Entrez.read(handle) handle.close() return [(f['Name'], f['Description']) for f in record['DbInfo']['FieldList']] fields = get_search_fields('nucleotide') for name, desc in fields[:10]: print(f'{name}: {desc}')
Check Query Translation
handle = Entrez.esearch(db='nucleotide', term='human BRCA1') record = Entrez.read(handle) handle.close() # See how NCBI interpreted your query print(f"Your query was translated to: {record['QueryTranslation']}") # e.g., '"homo sapiens"[Organism] AND BRCA1[All Fields]'
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Rate limit exceeded | Add delays or use API key |
| Invalid query syntax | Check field names and operators |
| Empty IdList | No matches or typo | Check QueryTranslation field |
| Missing email | Set |
Decision Tree
Need to search NCBI? ├── Finding records in one database? │ └── Use Entrez.esearch() ├── Search across all databases? │ └── Use Entrez.egquery() ├── Need database field names? │ └── Use Entrez.einfo(db='database') ├── List all available databases? │ └── Use Entrez.einfo() (no db argument) ├── Results > 10,000 records? │ └── Use usehistory='y', then batch fetch └── Need to fetch actual records? └── See entrez-fetch skill
Related Skills
- entrez-fetch - Retrieve full records after searching
- entrez-link - Find related records in other databases
- batch-downloads - Download large result sets efficiently
- geo-data - Search GEO expression datasets (specialized search)
- blast-searches - Search by sequence similarity instead of keywords