SciAgent-Skills gene-database
Query NCBI Gene via E-utilities for curated gene records across 1M+ taxa. Retrieve official gene symbols, aliases, RefSeq accessions, summary descriptions, genomic coordinates, GO annotations, and interaction data. Use for gene ID resolution, cross-species queries, and gene function summaries. For sequence retrieval use Ensembl; for expression data use geo-database.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/gene-database" ~/.claude/skills/jaechang-hits-sciagent-skills-gene-database && rm -rf "$T"
skills/genomics-bioinformatics/gene-database/SKILL.mdNCBI Gene Database
Overview
NCBI Gene is the authoritative curated database for gene-centric information, covering 1M+ genes across hundreds of thousands of taxa. Each gene record includes the official symbol, aliases, full name, functional summary, genomic coordinates (GRCh38/GRCh37), RefSeq accessions, GO annotations, interaction partners, and links to related databases. Access is free via E-utilities REST API (no API key required, though recommended).
When to Use
- Resolving gene aliases and synonyms to the current official HGNC/NCBI symbol
- Fetching the NCBI Gene ID (integer) for a gene symbol for downstream API calls (e.g., dbSNP, ClinVar, GEO)
- Retrieving curated gene summaries and function descriptions programmatically
- Pulling RefSeq mRNA (NM_) and protein (NP_) accessions associated with a gene
- Querying GO functional annotations (Biological Process, Molecular Function, Cellular Component)
- Cross-species gene queries using the same Gene ID space
- For expression profiles across conditions use
; for variant annotations usegeo-database
orclinvar-databaseensembl-database
Prerequisites
- Python packages:
,requests
(stdlib),xml.etree.ElementTree
(optional)pandas - Data requirements: gene symbols, NCBI Gene IDs, or tax IDs
- Environment: internet connection; NCBI email required (set
parameter)email - Rate limits: 3 req/s unauthenticated; 10 req/s with free NCBI API key
pip install requests pandas
Quick Start
import requests EMAIL = "your@email.com" BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils" def gene_search(query, retmax=5): r = requests.get(f"{BASE}/esearch.fcgi", params={"db": "gene", "term": query, "retmax": retmax, "retmode": "json", "email": EMAIL}) r.raise_for_status() return r.json()["esearchresult"]["idlist"] # Find human BRCA1 gene ID ids = gene_search("BRCA1[sym] AND Homo sapiens[orgn]") print(f"Gene IDs for BRCA1: {ids}") # → ['672']
Core API
Query 1: Search by Symbol, Name, or Function
Use ESearch with field tags for precise queries.
import requests EMAIL = "your@email.com" BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils" # Exact symbol match for human gene r = requests.get(f"{BASE}/esearch.fcgi", params={"db": "gene", "email": EMAIL, "retmode": "json", "term": "TP53[sym] AND Homo sapiens[orgn] AND alive[prop]"}) ids = r.json()["esearchresult"]["idlist"] print(f"TP53 Gene ID: {ids}") # → ['7157']
# Search by function keyword r = requests.get(f"{BASE}/esearch.fcgi", params={"db": "gene", "email": EMAIL, "retmode": "json", "term": "CRISPR[title] AND Homo sapiens[orgn]", "retmax": 5}) ids = r.json()["esearchresult"]["idlist"] print(f"CRISPR-related gene IDs: {ids}")
Query 2: Fetch Gene Summary (JSON/ESummary)
Retrieve key metadata fields for a list of Gene IDs.
import requests EMAIL = "your@email.com" BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils" def esummary_gene(gene_ids): r = requests.post(f"{BASE}/esummary.fcgi", data={"db": "gene", "id": ",".join(gene_ids), "retmode": "json", "email": EMAIL}) r.raise_for_status() return r.json()["result"] result = esummary_gene(["672", "675", "7157"]) # BRCA1, BRCA2, TP53 for uid in result.get("uids", []): g = result[uid] print(f"\n{g.get('name')} (ID {uid})") print(f" Official symbol : {g.get('nomenclaturesymbol', g.get('name'))}") print(f" Chr location : {g.get('maplocation')}") print(f" Summary (first 100): {g.get('summary', '')[:100]}...") print(f" Aliases: {g.get('otheraliases', 'none')}")
Query 3: Fetch Full Gene Record (XML)
Retrieve the complete gene record in XML for RefSeq accessions, GO terms, and interaction data.
import requests import xml.etree.ElementTree as ET EMAIL = "your@email.com" BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils" def efetch_gene_xml(gene_id): r = requests.get(f"{BASE}/efetch.fcgi", params={"db": "gene", "id": gene_id, "rettype": "gene_table", "retmode": "text", "email": EMAIL}) r.raise_for_status() return r.text # Get gene table (tab-delimited overview) table = efetch_gene_xml("672") print(table[:500])
# XML for RefSeq accession extraction r = requests.get(f"{BASE}/efetch.fcgi", params={"db": "gene", "id": "672", "rettype": "xml", "retmode": "xml", "email": EMAIL}) root = ET.fromstring(r.text) # Extract RefSeq mRNA accessions for ref in root.iter("Gene-commentary"): acc = ref.find("Gene-commentary_accession") ver = ref.find("Gene-commentary_version") typ = ref.find("Gene-commentary_type") if acc is not None and acc.text and acc.text.startswith("NM_"): print(f"RefSeq mRNA: {acc.text}.{ver.text if ver is not None else ''}")
Query 4: Batch Symbol-to-ID Mapping
Map a list of gene symbols to NCBI Gene IDs efficiently.
import requests, time EMAIL = "your@email.com" BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils" def symbols_to_ids(symbols, organism="Homo sapiens"): """Map gene symbols to NCBI Gene IDs. Returns dict {symbol: gene_id}.""" mapping = {} for sym in symbols: r = requests.get(f"{BASE}/esearch.fcgi", params={"db": "gene", "email": EMAIL, "retmode": "json", "term": f"{sym}[sym] AND {organism}[orgn] AND alive[prop]"}) ids = r.json()["esearchresult"]["idlist"] mapping[sym] = ids[0] if ids else None time.sleep(0.1) return mapping genes = ["EGFR", "KRAS", "BRAF", "PIK3CA", "PTEN"] id_map = symbols_to_ids(genes) for sym, gid in id_map.items(): print(f"{sym:10s} → Gene ID {gid}")
Query 5: GO Annotation Retrieval
Parse GO terms from the gene XML record.
import requests import xml.etree.ElementTree as ET EMAIL = "your@email.com" BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils" r = requests.get(f"{BASE}/efetch.fcgi", params={"db": "gene", "id": "7157", "rettype": "xml", "retmode": "xml", "email": EMAIL}) root = ET.fromstring(r.text) # Extract GO annotations go_terms = [] for ref in root.iter("Gene-commentary"): heading = ref.find("Gene-commentary_heading") label = ref.find("Gene-commentary_label") if heading is not None and "Gene Ontology" in heading.text: if label is not None: go_terms.append(label.text) print(f"TP53 GO terms ({len(go_terms)} found):") for term in go_terms[:10]: print(f" {term}")
Query 6: Cross-Species Gene Query
Find orthologs across species using NCBI Gene IDs.
import requests, time EMAIL = "your@email.com" BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils" def find_ortholog(human_gene_id, target_organism): """Find ortholog Gene ID in target species via NCBI Gene homologs.""" r = requests.get(f"{BASE}/elink.fcgi", params={"dbfrom": "gene", "db": "gene", "id": human_gene_id, "linkname": "gene_gene_homolog", "retmode": "json", "email": EMAIL}) r.raise_for_status() linksets = r.json().get("linksets", []) if not linksets: return [] homolog_ids = [str(l["id"]) for l in linksets[0].get("linksetdbs", [{}])[0].get("links", [])] return homolog_ids[:10] # Human TP53 (7157) homologs homolog_ids = find_ortholog("7157", "Mus musculus") print(f"Homolog Gene IDs for TP53: {homolog_ids}")
Key Concepts
NCBI Gene ID vs. HGNC ID vs. Ensembl ID
NCBI Gene IDs are integers assigned per gene per organism (e.g., human TP53 = 7157). These are distinct from HGNC IDs (e.g., HGNC:11998) and Ensembl IDs (ENSG00000141510). Many downstream NCBI databases (ClinVar, dbSNP, GEO) use NCBI Gene IDs internally.
alive[prop]
Filter
alive[prop]NCBI Gene records for discontinued genes have
status=discontinued. Always add AND alive[prop] to symbol queries to exclude retired entries and avoid retrieving stale data.
Common Workflows
Workflow 1: Build a Gene Annotation Table
Goal: For a list of gene symbols, retrieve Gene ID, official name, chromosomal location, and description.
import requests, time, pandas as pd EMAIL = "your@email.com" BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils" def search_gene(sym, organism="Homo sapiens"): r = requests.get(f"{BASE}/esearch.fcgi", params={"db": "gene", "email": EMAIL, "retmode": "json", "term": f"{sym}[sym] AND {organism}[orgn] AND alive[prop]"}) ids = r.json()["esearchresult"]["idlist"] return ids[0] if ids else None def batch_summary(gene_ids): r = requests.post(f"{BASE}/esummary.fcgi", data={"db": "gene", "id": ",".join(gene_ids), "retmode": "json", "email": EMAIL}) return r.json()["result"] symbols = ["BRCA1", "BRCA2", "TP53", "EGFR", "MYC", "KRAS", "PTEN"] # Step 1: Symbol → Gene ID id_map = {} for sym in symbols: gid = search_gene(sym) id_map[sym] = gid time.sleep(0.12) # Step 2: Batch summary valid_ids = [v for v in id_map.values() if v] result = batch_summary(valid_ids) rows = [] sym_to_id = {v: k for k, v in id_map.items() if v} for uid in result.get("uids", []): g = result[uid] rows.append({ "symbol": sym_to_id.get(uid, g.get("name")), "gene_id": uid, "full_name": g.get("description"), "chr_location": g.get("maplocation"), "summary": g.get("summary", "")[:200], }) df = pd.DataFrame(rows) df.to_csv("gene_annotations.csv", index=False) print(df[["symbol", "gene_id", "full_name", "chr_location"]].to_string(index=False))
Workflow 2: Find All Genes in a Pathway Keyword
Goal: Retrieve all human genes associated with a biological keyword from the NCBI Gene summary field.
import requests, time, pandas as pd EMAIL = "your@email.com" BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils" keyword = "DNA mismatch repair" r = requests.get(f"{BASE}/esearch.fcgi", params={"db": "gene", "email": EMAIL, "retmode": "json", "retmax": 50, "term": f"{keyword}[title/abstract] AND Homo sapiens[orgn] AND alive[prop]"}) ids = r.json()["esearchresult"]["idlist"] print(f"Found {len(ids)} genes related to '{keyword}'") # Fetch summaries r2 = requests.post(f"{BASE}/esummary.fcgi", data={"db": "gene", "id": ",".join(ids), "retmode": "json", "email": EMAIL}) result = r2.json()["result"] rows = [] for uid in result.get("uids", []): g = result[uid] rows.append({"gene_id": uid, "symbol": g.get("name"), "description": g.get("description"), "location": g.get("maplocation")}) df = pd.DataFrame(rows) print(df.to_string(index=False)) df.to_csv(f"{keyword.replace(' ', '_')}_genes.csv", index=False)
Key Parameters
| Parameter | Module | Default | Range / Options | Effect |
|---|---|---|---|---|
| ESearch | | – | Max records returned |
| ESearch/ESummary | | , | Response format |
| EFetch | depends | , , | Record format for full fetch |
field tag | ESearch | — | gene symbol | Match exact official symbol only |
field tag | ESearch | — | organism name or tax ID | Filter by taxonomy |
| ESearch | — | boolean flag | Exclude discontinued gene records |
Best Practices
-
Always add
: Discontinued gene records remain in the database. Without this filter, symbol searches may return outdated records.alive[prop] -
Use Gene IDs in pipelines: Downstream NCBI databases (ClinVar, dbSNP, GEO) accept Gene IDs; avoid re-searching by symbol in each call.
-
Use ESummary for metadata, EFetch for full records: ESummary returns JSON with all common fields; EFetch XML is needed only for RefSeq accessions, GO terms, or interaction links.
-
Register for a free API key: Triple your rate limit (3 → 10 req/s) at https://www.ncbi.nlm.nih.gov/account/. Pass as
parameter.api_key -
Batch with ESummary: POST up to 200 Gene IDs per call to ESummary instead of querying one at a time.
Common Recipes
Recipe: Gene ID to RefSeq NM Accession
When to use: Get the canonical mRNA accession for a protein-coding gene.
import requests, re EMAIL = "your@email.com" GENE_ID = "672" # BRCA1 r = requests.get( "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi", params={"db": "gene", "id": GENE_ID, "rettype": "gene_table", "retmode": "text", "email": EMAIL} ) nm_accessions = re.findall(r"NM_\d+\.\d+", r.text) print(f"RefSeq mRNA accessions: {list(set(nm_accessions))}")
Recipe: Retrieve Gene Aliases
When to use: Resolve legacy/alias symbols to the current official NCBI symbol.
import requests EMAIL = "your@email.com" # P53 is an alias for TP53 r = requests.get( "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi", params={"db": "gene", "email": EMAIL, "retmode": "json", "term": "p53[sym] AND Homo sapiens[orgn] AND alive[prop]"} ) ids = r.json()["esearchresult"]["idlist"] r2 = requests.post("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi", data={"db": "gene", "id": ",".join(ids[:1]), "retmode": "json", "email": EMAIL}) g = r2.json()["result"][ids[0]] print(f"Official symbol : {g.get('nomenclaturesymbol', g.get('name'))}") print(f"Other aliases : {g.get('otheraliases')}") print(f"Designations : {g.get('otherdesignations', '')[:100]}")
Recipe: List All Genes on a Chromosome
When to use: Get all protein-coding genes on a specific human chromosome.
import requests EMAIL = "your@email.com" r = requests.get( "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi", params={"db": "gene", "email": EMAIL, "retmode": "json", "retmax": 5, "term": "17[chr] AND Homo sapiens[orgn] AND protein coding[filter] AND alive[prop]"} ) result = r.json()["esearchresult"] print(f"Protein-coding genes on chr17: {result['count']} total") print(f"Sample IDs: {result['idlist']}")
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
Empty for known symbol | Symbol is an alias, not the official term | Use or field tag; check aliases via ESummary |
| Wrong species returned | Missing organism filter | Add or target tax ID () |
| Discontinued gene returned | Missing filter | Append to all symbol queries |
rate limit | Too many requests | Add between calls; use NCBI API key |
ESummary missing key | All IDs invalid/absent | Check values are valid integers, not empty strings |
| XML parse error | Malformed XML for rare genes | Wrap ET.fromstring in try/except; retry with |
Related Skills
— Gene Expression Omnibus for retrieving expression data linked to genes found heregeo-database
— Clinical variant data indexed by NCBI Gene IDsclinvar-database
— Complementary gene annotations with VEP and comparative genomicsensembl-database
— Biopython Entrez module wraps E-utilities with typed return valuesbiopython-molecular-biology
References
- NCBI Gene database — Official homepage and search interface
- E-utilities documentation — Complete API reference for ESearch, ESummary, EFetch
- NCBI Gene field tags — Field tag reference for constructing Entrez queries
- NCBI API Key registration — Free registration for 10 req/s rate limit