SciAgent-Skills gene-database

Query NCBI Gene via E-utilities for curated gene records across 1M+ taxa. Retrieve official gene symbols, aliases, RefSeq accessions, summary descriptions, genomic coordinates, GO annotations, and interaction data. Use for gene ID resolution, cross-species queries, and gene function summaries. For sequence retrieval use Ensembl; for expression data use geo-database.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/gene-database" ~/.claude/skills/jaechang-hits-sciagent-skills-gene-database && rm -rf "$T"
manifest: skills/genomics-bioinformatics/gene-database/SKILL.md
source content

NCBI Gene Database

Overview

NCBI Gene is the authoritative curated database for gene-centric information, covering 1M+ genes across hundreds of thousands of taxa. Each gene record includes the official symbol, aliases, full name, functional summary, genomic coordinates (GRCh38/GRCh37), RefSeq accessions, GO annotations, interaction partners, and links to related databases. Access is free via E-utilities REST API (no API key required, though recommended).

When to Use

  • Resolving gene aliases and synonyms to the current official HGNC/NCBI symbol
  • Fetching the NCBI Gene ID (integer) for a gene symbol for downstream API calls (e.g., dbSNP, ClinVar, GEO)
  • Retrieving curated gene summaries and function descriptions programmatically
  • Pulling RefSeq mRNA (NM_) and protein (NP_) accessions associated with a gene
  • Querying GO functional annotations (Biological Process, Molecular Function, Cellular Component)
  • Cross-species gene queries using the same Gene ID space
  • For expression profiles across conditions use
    geo-database
    ; for variant annotations use
    clinvar-database
    or
    ensembl-database

Prerequisites

  • Python packages:
    requests
    ,
    xml.etree.ElementTree
    (stdlib),
    pandas
    (optional)
  • Data requirements: gene symbols, NCBI Gene IDs, or tax IDs
  • Environment: internet connection; NCBI email required (set
    email
    parameter)
  • Rate limits: 3 req/s unauthenticated; 10 req/s with free NCBI API key
pip install requests pandas

Quick Start

import requests

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

def gene_search(query, retmax=5):
    r = requests.get(f"{BASE}/esearch.fcgi",
                     params={"db": "gene", "term": query,
                             "retmax": retmax, "retmode": "json", "email": EMAIL})
    r.raise_for_status()
    return r.json()["esearchresult"]["idlist"]

# Find human BRCA1 gene ID
ids = gene_search("BRCA1[sym] AND Homo sapiens[orgn]")
print(f"Gene IDs for BRCA1: {ids}")  # → ['672']

Core API

Query 1: Search by Symbol, Name, or Function

Use ESearch with field tags for precise queries.

import requests

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

# Exact symbol match for human gene
r = requests.get(f"{BASE}/esearch.fcgi",
                 params={"db": "gene", "email": EMAIL, "retmode": "json",
                         "term": "TP53[sym] AND Homo sapiens[orgn] AND alive[prop]"})
ids = r.json()["esearchresult"]["idlist"]
print(f"TP53 Gene ID: {ids}")  # → ['7157']
# Search by function keyword
r = requests.get(f"{BASE}/esearch.fcgi",
                 params={"db": "gene", "email": EMAIL, "retmode": "json",
                         "term": "CRISPR[title] AND Homo sapiens[orgn]", "retmax": 5})
ids = r.json()["esearchresult"]["idlist"]
print(f"CRISPR-related gene IDs: {ids}")

Query 2: Fetch Gene Summary (JSON/ESummary)

Retrieve key metadata fields for a list of Gene IDs.

import requests

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

def esummary_gene(gene_ids):
    r = requests.post(f"{BASE}/esummary.fcgi",
                      data={"db": "gene", "id": ",".join(gene_ids),
                            "retmode": "json", "email": EMAIL})
    r.raise_for_status()
    return r.json()["result"]

result = esummary_gene(["672", "675", "7157"])  # BRCA1, BRCA2, TP53

for uid in result.get("uids", []):
    g = result[uid]
    print(f"\n{g.get('name')} (ID {uid})")
    print(f"  Official symbol : {g.get('nomenclaturesymbol', g.get('name'))}")
    print(f"  Chr location    : {g.get('maplocation')}")
    print(f"  Summary (first 100): {g.get('summary', '')[:100]}...")
    print(f"  Aliases: {g.get('otheraliases', 'none')}")

Query 3: Fetch Full Gene Record (XML)

Retrieve the complete gene record in XML for RefSeq accessions, GO terms, and interaction data.

import requests
import xml.etree.ElementTree as ET

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

def efetch_gene_xml(gene_id):
    r = requests.get(f"{BASE}/efetch.fcgi",
                     params={"db": "gene", "id": gene_id,
                             "rettype": "gene_table", "retmode": "text", "email": EMAIL})
    r.raise_for_status()
    return r.text

# Get gene table (tab-delimited overview)
table = efetch_gene_xml("672")
print(table[:500])
# XML for RefSeq accession extraction
r = requests.get(f"{BASE}/efetch.fcgi",
                 params={"db": "gene", "id": "672",
                         "rettype": "xml", "retmode": "xml", "email": EMAIL})
root = ET.fromstring(r.text)

# Extract RefSeq mRNA accessions
for ref in root.iter("Gene-commentary"):
    acc = ref.find("Gene-commentary_accession")
    ver = ref.find("Gene-commentary_version")
    typ = ref.find("Gene-commentary_type")
    if acc is not None and acc.text and acc.text.startswith("NM_"):
        print(f"RefSeq mRNA: {acc.text}.{ver.text if ver is not None else ''}")

Query 4: Batch Symbol-to-ID Mapping

Map a list of gene symbols to NCBI Gene IDs efficiently.

import requests, time

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

def symbols_to_ids(symbols, organism="Homo sapiens"):
    """Map gene symbols to NCBI Gene IDs. Returns dict {symbol: gene_id}."""
    mapping = {}
    for sym in symbols:
        r = requests.get(f"{BASE}/esearch.fcgi",
                         params={"db": "gene", "email": EMAIL, "retmode": "json",
                                 "term": f"{sym}[sym] AND {organism}[orgn] AND alive[prop]"})
        ids = r.json()["esearchresult"]["idlist"]
        mapping[sym] = ids[0] if ids else None
        time.sleep(0.1)
    return mapping

genes = ["EGFR", "KRAS", "BRAF", "PIK3CA", "PTEN"]
id_map = symbols_to_ids(genes)
for sym, gid in id_map.items():
    print(f"{sym:10s} → Gene ID {gid}")

Query 5: GO Annotation Retrieval

Parse GO terms from the gene XML record.

import requests
import xml.etree.ElementTree as ET

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

r = requests.get(f"{BASE}/efetch.fcgi",
                 params={"db": "gene", "id": "7157",
                         "rettype": "xml", "retmode": "xml", "email": EMAIL})
root = ET.fromstring(r.text)

# Extract GO annotations
go_terms = []
for ref in root.iter("Gene-commentary"):
    heading = ref.find("Gene-commentary_heading")
    label = ref.find("Gene-commentary_label")
    if heading is not None and "Gene Ontology" in heading.text:
        if label is not None:
            go_terms.append(label.text)

print(f"TP53 GO terms ({len(go_terms)} found):")
for term in go_terms[:10]:
    print(f"  {term}")

Query 6: Cross-Species Gene Query

Find orthologs across species using NCBI Gene IDs.

import requests, time

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

def find_ortholog(human_gene_id, target_organism):
    """Find ortholog Gene ID in target species via NCBI Gene homologs."""
    r = requests.get(f"{BASE}/elink.fcgi",
                     params={"dbfrom": "gene", "db": "gene",
                             "id": human_gene_id, "linkname": "gene_gene_homolog",
                             "retmode": "json", "email": EMAIL})
    r.raise_for_status()
    linksets = r.json().get("linksets", [])
    if not linksets:
        return []
    homolog_ids = [str(l["id"]) for l in linksets[0].get("linksetdbs", [{}])[0].get("links", [])]
    return homolog_ids[:10]

# Human TP53 (7157) homologs
homolog_ids = find_ortholog("7157", "Mus musculus")
print(f"Homolog Gene IDs for TP53: {homolog_ids}")

Key Concepts

NCBI Gene ID vs. HGNC ID vs. Ensembl ID

NCBI Gene IDs are integers assigned per gene per organism (e.g., human TP53 = 7157). These are distinct from HGNC IDs (e.g., HGNC:11998) and Ensembl IDs (ENSG00000141510). Many downstream NCBI databases (ClinVar, dbSNP, GEO) use NCBI Gene IDs internally.

alive[prop]
Filter

NCBI Gene records for discontinued genes have

status=discontinued
. Always add
AND alive[prop]
to symbol queries to exclude retired entries and avoid retrieving stale data.

Common Workflows

Workflow 1: Build a Gene Annotation Table

Goal: For a list of gene symbols, retrieve Gene ID, official name, chromosomal location, and description.

import requests, time, pandas as pd

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

def search_gene(sym, organism="Homo sapiens"):
    r = requests.get(f"{BASE}/esearch.fcgi",
                     params={"db": "gene", "email": EMAIL, "retmode": "json",
                             "term": f"{sym}[sym] AND {organism}[orgn] AND alive[prop]"})
    ids = r.json()["esearchresult"]["idlist"]
    return ids[0] if ids else None

def batch_summary(gene_ids):
    r = requests.post(f"{BASE}/esummary.fcgi",
                      data={"db": "gene", "id": ",".join(gene_ids),
                            "retmode": "json", "email": EMAIL})
    return r.json()["result"]

symbols = ["BRCA1", "BRCA2", "TP53", "EGFR", "MYC", "KRAS", "PTEN"]

# Step 1: Symbol → Gene ID
id_map = {}
for sym in symbols:
    gid = search_gene(sym)
    id_map[sym] = gid
    time.sleep(0.12)

# Step 2: Batch summary
valid_ids = [v for v in id_map.values() if v]
result = batch_summary(valid_ids)

rows = []
sym_to_id = {v: k for k, v in id_map.items() if v}
for uid in result.get("uids", []):
    g = result[uid]
    rows.append({
        "symbol": sym_to_id.get(uid, g.get("name")),
        "gene_id": uid,
        "full_name": g.get("description"),
        "chr_location": g.get("maplocation"),
        "summary": g.get("summary", "")[:200],
    })

df = pd.DataFrame(rows)
df.to_csv("gene_annotations.csv", index=False)
print(df[["symbol", "gene_id", "full_name", "chr_location"]].to_string(index=False))

Workflow 2: Find All Genes in a Pathway Keyword

Goal: Retrieve all human genes associated with a biological keyword from the NCBI Gene summary field.

import requests, time, pandas as pd

EMAIL = "your@email.com"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"

keyword = "DNA mismatch repair"
r = requests.get(f"{BASE}/esearch.fcgi",
                 params={"db": "gene", "email": EMAIL, "retmode": "json",
                         "retmax": 50,
                         "term": f"{keyword}[title/abstract] AND Homo sapiens[orgn] AND alive[prop]"})
ids = r.json()["esearchresult"]["idlist"]
print(f"Found {len(ids)} genes related to '{keyword}'")

# Fetch summaries
r2 = requests.post(f"{BASE}/esummary.fcgi",
                   data={"db": "gene", "id": ",".join(ids), "retmode": "json", "email": EMAIL})
result = r2.json()["result"]

rows = []
for uid in result.get("uids", []):
    g = result[uid]
    rows.append({"gene_id": uid, "symbol": g.get("name"),
                 "description": g.get("description"),
                 "location": g.get("maplocation")})

df = pd.DataFrame(rows)
print(df.to_string(index=False))
df.to_csv(f"{keyword.replace(' ', '_')}_genes.csv", index=False)

Key Parameters

ParameterModuleDefaultRange / OptionsEffect
retmax
ESearch
20
1
10000
Max records returned
retmode
ESearch/ESummary
"xml"
"json"
,
"xml"
Response format
rettype
EFetchdepends
"xml"
,
"gene_table"
,
"text"
Record format for full fetch
[sym]
field tag
ESearchgene symbolMatch exact official symbol only
[orgn]
field tag
ESearchorganism name or tax IDFilter by taxonomy
alive[prop]
ESearchboolean flagExclude discontinued gene records

Best Practices

  1. Always add

    alive[prop]
    : Discontinued gene records remain in the database. Without this filter, symbol searches may return outdated records.

  2. Use Gene IDs in pipelines: Downstream NCBI databases (ClinVar, dbSNP, GEO) accept Gene IDs; avoid re-searching by symbol in each call.

  3. Use ESummary for metadata, EFetch for full records: ESummary returns JSON with all common fields; EFetch XML is needed only for RefSeq accessions, GO terms, or interaction links.

  4. Register for a free API key: Triple your rate limit (3 → 10 req/s) at https://www.ncbi.nlm.nih.gov/account/. Pass as

    api_key
    parameter.

  5. Batch with ESummary: POST up to 200 Gene IDs per call to ESummary instead of querying one at a time.

Common Recipes

Recipe: Gene ID to RefSeq NM Accession

When to use: Get the canonical mRNA accession for a protein-coding gene.

import requests, re

EMAIL = "your@email.com"
GENE_ID = "672"  # BRCA1

r = requests.get(
    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi",
    params={"db": "gene", "id": GENE_ID, "rettype": "gene_table",
            "retmode": "text", "email": EMAIL}
)
nm_accessions = re.findall(r"NM_\d+\.\d+", r.text)
print(f"RefSeq mRNA accessions: {list(set(nm_accessions))}")

Recipe: Retrieve Gene Aliases

When to use: Resolve legacy/alias symbols to the current official NCBI symbol.

import requests

EMAIL = "your@email.com"

# P53 is an alias for TP53
r = requests.get(
    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
    params={"db": "gene", "email": EMAIL, "retmode": "json",
            "term": "p53[sym] AND Homo sapiens[orgn] AND alive[prop]"}
)
ids = r.json()["esearchresult"]["idlist"]

r2 = requests.post("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi",
                   data={"db": "gene", "id": ",".join(ids[:1]),
                         "retmode": "json", "email": EMAIL})
g = r2.json()["result"][ids[0]]
print(f"Official symbol : {g.get('nomenclaturesymbol', g.get('name'))}")
print(f"Other aliases   : {g.get('otheraliases')}")
print(f"Designations    : {g.get('otherdesignations', '')[:100]}")

Recipe: List All Genes on a Chromosome

When to use: Get all protein-coding genes on a specific human chromosome.

import requests

EMAIL = "your@email.com"

r = requests.get(
    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
    params={"db": "gene", "email": EMAIL, "retmode": "json", "retmax": 5,
            "term": "17[chr] AND Homo sapiens[orgn] AND protein coding[filter] AND alive[prop]"}
)
result = r.json()["esearchresult"]
print(f"Protein-coding genes on chr17: {result['count']} total")
print(f"Sample IDs: {result['idlist']}")

Troubleshooting

ProblemCauseSolution
Empty
idlist
for known symbol
Symbol is an alias, not the official termUse
[gene name]
or
[title]
field tag; check aliases via ESummary
Wrong species returnedMissing organism filterAdd
AND Homo sapiens[orgn]
or target tax ID (
9606[taxid]
)
Discontinued gene returnedMissing
alive[prop]
filter
Append
AND alive[prop]
to all symbol queries
HTTP 429
rate limit
Too many requestsAdd
time.sleep(0.35)
between calls; use NCBI API key
ESummary missing
uids
key
All IDs invalid/absentCheck
id
values are valid integers, not empty strings
XML parse errorMalformed XML for rare genesWrap ET.fromstring in try/except; retry with
rettype=text

Related Skills

  • geo-database
    — Gene Expression Omnibus for retrieving expression data linked to genes found here
  • clinvar-database
    — Clinical variant data indexed by NCBI Gene IDs
  • ensembl-database
    — Complementary gene annotations with VEP and comparative genomics
  • biopython-molecular-biology
    — Biopython Entrez module wraps E-utilities with typed return values

References