SciAgent-Skills gget-genomic-databases

Unified CLI/Python interface to 20+ genomic databases. Use for quick gene lookups (Ensembl search/info/seq), BLAST/BLAT sequence alignment, AlphaFold structure prediction, enrichment analysis (Enrichr), disease/drug associations (OpenTargets), single-cell data (CELLxGENE), cancer genomics (cBioPortal/COSMIC), and expression correlation (ARCHS4). Covers genomics, proteomics, and disease domains. For batch processing or advanced BLAST use biopython; for multi-database Python SDK workflows use bioservices.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/gget-genomic-databases" ~/.claude/skills/jaechang-hits-sciagent-skills-gget-genomic-databases && rm -rf "$T"
manifest: skills/genomics-bioinformatics/gget-genomic-databases/SKILL.md
source content

gget — Unified Genomic Database Access

Overview

gget is a command-line and Python package providing unified access to 20+ genomic databases and analysis methods. Query gene information, sequences, protein structures, expression data, and disease associations through a consistent interface. All modules work as both CLI tools and Python functions, returning DataFrames (Python) or JSON/CSV (CLI).

When to Use

  • Looking up gene information (names, IDs, descriptions) across species from Ensembl
  • Retrieving nucleotide or protein sequences for Ensembl gene/transcript IDs
  • Running BLAST or BLAT searches against standard reference databases
  • Predicting protein 3D structures with AlphaFold2 from amino acid sequences
  • Performing gene set enrichment analysis (GO, KEGG, disease terms) via Enrichr
  • Querying single-cell RNA-seq datasets from CELLxGENE Census
  • Finding disease and drug associations for a gene target via OpenTargets
  • Downloading Ensembl reference genomes and annotations for a species
  • Finding cancer mutations and genomic alterations via cBioPortal or COSMIC
  • Getting tissue expression and correlated genes from ARCHS4
  • For batch processing or advanced BLAST parameters, use
    biopython
    instead
  • For programmatic multi-database workflows with rate limiting, use
    bioservices
    instead

Prerequisites

  • Python packages:
    gget
  • Optional setup: Some modules require
    gget setup <module>
    before first use (alphafold, cellxgene, elm, gpt)
  • Environment: Clean virtual environment recommended to avoid dependency conflicts
  • API notes: gget queries remote databases — rate-limit large batch queries with
    time.sleep()
    . Databases update biweekly; keep gget updated. Max ~1000 Ensembl IDs per
    gget.info()
    call
pip install gget

# Optional: setup modules that need additional dependencies
gget setup alphafold   # ~4GB model parameters, requires OpenMM
gget setup cellxgene   # cellxgene-census package
gget setup elm         # local ELM database

Quick Start

import gget

# Search for genes by keyword
results = gget.search(["BRCA1", "tumor suppressor"], species="homo_sapiens")
print(f"Found {len(results)} genes")

# Get detailed gene information (Ensembl + UniProt + NCBI)
info = gget.info(["ENSG00000012048"])
print(f"Gene: {info.iloc[0]['primary_gene_name']}")

# Enrichment analysis on a gene list
enrichment = gget.enrichr(["ACE2", "AGT", "AGTR1"], database="ontology")
print(f"Enriched terms: {len(enrichment)}")

Core API

Module 1: Reference & Gene Search (ref, search, info, seq)

Query Ensembl for gene references, search by keywords, retrieve gene metadata, and fetch sequences.

import gget

# Search for genes by keyword
results = gget.search(["BRCA1", "tumor suppressor"], species="homo_sapiens")
print(f"Found {len(results)} genes")
print(results[["ensembl_id", "gene_name", "biotype"]].head())

# Get detailed gene information (Ensembl + UniProt + NCBI)
info = gget.info(["ENSG00000012048", "ENSG00000139618"])
print(f"Gene info columns: {list(info.columns)}")
import gget

# Retrieve sequences
nucleotide_seqs = gget.seq(["ENSG00000012048"])
protein_seqs = gget.seq(["ENSG00000012048"], translate=True, isoforms=True)
print(f"Retrieved {len(protein_seqs)} isoform sequences")

# Download reference genome files (specify release for reproducibility)
ref_links = gget.ref("homo_sapiens", which="gtf", release=112)
print(f"GTF download link: {ref_links}")

Module 2: Sequence Alignment (blast, blat, muscle, diamond)

BLAST/BLAT remote searches, multiple sequence alignment, and fast local alignment.

import gget
import time

# BLAST against SwissProt (remote API — add delay for batch queries)
blast_results = gget.blast(
    "MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR",
    database="swissprot", limit=10
)
print(f"Top hit: {blast_results.iloc[0]['Description']}, E-value: {blast_results.iloc[0]['e-value']}")
time.sleep(2)  # Rate-limit between BLAST queries

# BLAT — find genomic position (UCSC)
blat_results = gget.blat("ATCGATCGATCGATCGATCG", assembly="human")
print(f"Genomic location: chr{blat_results.iloc[0]['chromosome']}:{blat_results.iloc[0]['start']}")
import gget

# Multiple sequence alignment with Muscle5
aligned = gget.muscle("sequences.fasta", save=True)

# Fast local alignment with DIAMOND (local, no rate limit needed)
diamond_results = gget.diamond(
    "GGETISAWESQME",
    reference="reference.fasta",
    sensitivity="very-sensitive",
    threads=4
)
print(f"Alignments found: {len(diamond_results)}")

Module 3: Protein Structure (pdb, alphafold, elm)

Download PDB structures, predict structures with AlphaFold2, find linear motifs.

import gget

# Download PDB structure
pdb_data = gget.pdb("7S7U", save=True)

# Predict structure with AlphaFold2 (requires gget setup alphafold)
structure = gget.alphafold(
    "MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR",
    plot=True, show_sidechains=True
)
print("Structure prediction complete, PDB file saved")
import gget

# Find Eukaryotic Linear Motifs (requires gget setup elm)
ortholog_df, regex_df = gget.elm("LIAQSIGQASFV")
print(f"Ortholog motifs: {len(ortholog_df)}, Regex motifs: {len(regex_df)}")

Module 4: Expression & Correlation (archs4, cellxgene, bgee)

Gene expression, tissue expression, correlated genes, single-cell data.

import gget

# Tissue expression from ARCHS4
tissue_expr = gget.archs4("ACE2", which="tissue")
print(f"Expression across {len(tissue_expr)} tissues")

# Correlated genes from ARCHS4
correlated = gget.archs4("ACE2", which="correlation")
print(f"Top correlated gene: {correlated.iloc[0]['gene_symbol']}")
import gget

# Single-cell data from CELLxGENE (requires gget setup cellxgene)
adata = gget.cellxgene(
    gene=["ACE2", "TMPRSS2"],
    tissue="lung",
    cell_type="epithelial cell",
    census_version="2023-07-25"  # pin version for reproducibility
)
print(f"Cells: {adata.n_obs}, Genes: {adata.n_vars}")

# Orthologs and expression from Bgee
orthologs = gget.bgee("ENSG00000169194", type="orthologs")
print(f"Orthologs in {len(orthologs)} species")

Module 5: Disease & Drug Associations (opentargets, enrichr)

Disease associations, drug targets, enrichment analysis.

import gget

# Disease associations from OpenTargets
diseases = gget.opentargets("ENSG00000169194", resource="diseases", limit=10)
print(f"Associated diseases: {len(diseases)}")

# Drug associations
drugs = gget.opentargets("ENSG00000169194", resource="drugs", limit=10)
print(f"Associated drugs: {len(drugs)}")

# OpenTargets resources: diseases, drugs, tractability, pharmacogenetics,
#   expression, depmap, interactions
import gget

# Enrichment analysis via Enrichr
# Database shortcuts: 'pathway' (KEGG), 'transcription' (ChEA),
#   'ontology' (GO_BP), 'diseases_drugs' (GWAS), 'celltypes' (PanglaoDB)
enrichment = gget.enrichr(
    ["ACE2", "AGT", "AGTR1", "TMPRSS2", "DPP4"],
    database="ontology"
)
print(f"Enriched terms: {len(enrichment)}")
print(enrichment[["Term", "Adjusted P-value"]].head())

Module 6: Cancer Genomics (cbio, cosmic)

Cancer mutations, copy number alterations, and somatic mutation databases.

import gget

# Search cBioPortal studies
studies = gget.cbio_search(["breast", "lung"])
print(f"Studies found: {len(studies)}")

# Plot cancer genomics heatmap
gget.cbio_plot(
    ["msk_impact_2017"],
    ["AKT1", "ALK", "BRAF"],
    stratification="tissue",
    variation_type="mutation_occurrences"
)
import gget

# COSMIC: requires account + local database download
# First-time: gget.cosmic(searchterm="", download_cosmic=True,
#   email="user@example.com", password="xxx", cosmic_project="cancer")
cosmic_results = gget.cosmic("EGFR", cosmic_tsv_path="cosmic_data.tsv", limit=10)
print(f"COSMIC mutations: {len(cosmic_results)}")

Module 7: Mutation Generation & Utilities (mutate, setup)

Generate mutated sequences and manage module dependencies.

import gget
import pandas as pd

# Generate mutated sequences from mutation annotations
mutations_df = pd.DataFrame({
    "seq_ID": ["seq1", "seq1"],
    "mutation": ["c.4G>T", "c.10del"]
})
mutated = gget.mutate(["ATCGCTAAGCTGATCG"], mutations=mutations_df)
print(f"Generated {len(mutated)} mutated sequences")

Key Concepts

Module Overview

gget organizes 20+ modules by domain. Python interface uses

gget.<module>()
:

DomainModulesPrimary Database
Gene reference
ref
,
search
,
info
,
seq
Ensembl, UniProt, NCBI
Sequence alignment
blast
,
blat
,
muscle
,
diamond
NCBI BLAST, UCSC, local
Protein structure
pdb
,
alphafold
,
elm
RCSB PDB, AlphaFold2, ELM
Expression
archs4
,
cellxgene
,
bgee
ARCHS4, CZ CELLxGENE, Bgee
Disease/drugs
opentargets
,
enrichr
OpenTargets, Enrichr
Cancer
cbio
,
cosmic
cBioPortal, COSMIC
Utilities
mutate
,
setup
,
gpt
local / OpenAI

Output Formats

ContextDefault FormatAlternatives
PythonDataFrame or dict
json=True
for JSON;
save=True
to file
CLIJSON
-csv
for CSV;
-o file
to save
SequencesFASTA (seq, mutate)--
StructuresPDB file (pdb, alphafold)JSON alignment error data
Single-cellAnnData object (cellxgene)
meta_only=True
for metadata only
VisualizationPNG (cbio plot)
show=True
for interactive display

Enrichr Database Shortcuts

ShortcutFull Database Name
'pathway'
KEGG_2021_Human
'transcription'
ChEA_2016
'ontology'
GO_Biological_Process_2021
'diseases_drugs'
GWAS_Catalog_2019
'celltypes'
PanglaoDB_Augmented_2021

Custom libraries: pass any Enrichr library name directly (e.g.,

"Jensen_TISSUES"
).

OpenTargets Resources

ResourceDescription
diseases
Disease associations with evidence scores
drugs
Drug associations and clinical trial data
tractability
Target tractability assessment
pharmacogenetics
Pharmacogenetic variants
expression
Baseline tissue expression
depmap
DepMap gene-disease effects
interactions
Protein-protein interactions

Reproducibility

Pin database versions for consistent results across analyses:

import gget
# Pin Ensembl release
ref = gget.ref("homo_sapiens", release=112)

# Pin CELLxGENE Census version
adata = gget.cellxgene(gene=["ACE2"], census_version="2023-07-25")

# Always record gget version
print(f"gget version: {gget.__version__}")

Common Workflows

Workflow 1: Gene Discovery to Functional Analysis

Goal: Find genes of interest, get their sequences, and perform enrichment analysis.

import gget

# 1. Search for genes
results = gget.search(["GABA", "receptor"], species="homo_sapiens")
gene_ids = results["ensembl_id"].tolist()[:10]

# 2. Get detailed information
info = gget.info(gene_ids)
print(f"Retrieved info for {len(info)} genes")

# 3. Get protein sequences
sequences = gget.seq(gene_ids, translate=True)

# 4. Find correlated genes
correlated = gget.archs4(info.index[0], which="correlation")

# 5. Enrichment analysis on correlated genes
gene_list = correlated["gene_symbol"].tolist()[:50]
enrichment = gget.enrichr(gene_list, database="ontology")
print(f"Top enriched term: {enrichment.iloc[0]['Term']}")

Workflow 2: Target Validation for Drug Discovery

Goal: Investigate a gene's disease associations, druggability, and cancer mutations.

import gget

gene_id = "ENSG00000169194"  # ZBTB16

# 1. Disease associations
diseases = gget.opentargets(gene_id, resource="diseases", limit=20)

# 2. Drug associations
drugs = gget.opentargets(gene_id, resource="drugs")

# 3. Tractability assessment
tractability = gget.opentargets(gene_id, resource="tractability")

# 4. Protein interactions
interactions = gget.opentargets(gene_id, resource="interactions")
print(f"Diseases: {len(diseases)}, Drugs: {len(drugs)}, Interactions: {len(interactions)}")

# 5. Cancer genomics
gget.cbio_plot(["msk_impact_2017"], ["ZBTB16"], stratification="cancer_type")

Workflow 3: Comparative Genomics

Goal: Compare a gene across species using orthologs and sequence alignment.

import gget

# 1. Find orthologs
orthologs = gget.bgee("ENSG00000169194", type="orthologs")

# 2. Get sequences for human and mouse
human_seq = gget.seq("ENSG00000169194", translate=True)
mouse_seq = gget.seq("ENSMUSG00000026091", translate=True)

# 3. Align sequences
alignment = gget.muscle([human_seq, mouse_seq])

# 4. Get human protein structure from PDB
pdb_structure = gget.pdb("7S7U")
print("Comparative analysis complete")

Key Parameters

ParameterModule(s)DefaultRange / OptionsEffect
species
search, archs4, cellxgene, enrichr
"homo_sapiens"
Any Ensembl species; shortcuts: 'human', 'mouse'Target organism
limit
blast, opentargets, cosmic
50
/
100
1
-
1000
Maximum results returned
database
blast, enrichrvariesblast: nt/nr/swissprot/pdbaa; enrichr: shortcuts or library namesTarget database for query
which
ref, archs4variesref:
gtf
,
cdna
,
dna
,
cds
,
pep
; archs4:
correlation
,
tissue
Data type to retrieve
translate
seq
False
True
/
False
Return amino acid instead of nucleotide sequences
resource
opentargets
"diseases"
diseases, drugs, tractability, pharmacogenetics, expression, depmap, interactionsOpenTargets data type
release
ref, searchlatestInteger Ensembl release numberPin database version for reproducibility
census_version
cellxgene
"stable"
"stable"
,
"latest"
, date string
Pin CELLxGENE Census version
sensitivity
diamond, elm
"very-sensitive"
fast
to
ultra-sensitive
Alignment sensitivity vs speed
threads
diamond, elm
1
1
-
N
CPU threads for alignment
multimer_recycles
alphafold
3
3
-
20
Higher = more accurate multimer prediction

Best Practices

  1. Pin database versions for reproducibility: Use

    release=112
    for Ensembl and
    census_version="2023-07-25"
    for CELLxGENE to ensure consistent results across analyses.

  2. Rate-limit batch queries: gget queries remote APIs. Add

    time.sleep(2)
    between BLAST/BLAT queries in loops. For
    gget.info()
    , limit to ~1000 IDs per call.

  3. Keep gget updated: Databases change their structure biweekly. Run

    pip install --upgrade gget
    regularly to avoid breakage from schema changes.

  4. Use Python interface for pipelines, CLI for exploration: Python functions return DataFrames suitable for chaining. CLI with

    -csv
    is better for quick one-off lookups.

  5. Check PDB before running AlphaFold:

    gget.pdb()
    is instant; AlphaFold prediction takes minutes to hours. Always check if the structure already exists in PDB.

  6. Use database shortcuts in enrichr: The shortcuts (

    'pathway'
    ,
    'ontology'
    , etc.) map to curated Enrichr libraries. For custom analyses, pass any Enrichr library name directly.

  7. Cache cBioPortal data for repeated analyses: Use

    data_dir="./cache"
    parameter to avoid re-downloading large cancer genomics datasets.

Common Recipes

Recipe: Batch Gene Information Retrieval

When to use: Need information for many genes at once (up to ~1000 IDs per call).

import gget
import time

gene_ids = ["ENSG00000012048", "ENSG00000139618", "ENSG00000141510"]
info = gget.info(gene_ids)
info.to_csv("gene_info_batch.csv")
print(f"Saved info for {len(info)} genes")

# For >1000 genes, batch with rate limiting
all_ids = [f"ENSG{i:011d}" for i in range(2000)]
results = []
for i in range(0, len(all_ids), 500):
    batch = all_ids[i:i+500]
    results.append(gget.info(batch))
    time.sleep(1)

Recipe: Custom Enrichment with Background

When to use: Running enrichment against a custom background gene set.

import gget

# Use specific Enrichr library with background genes
enrichment = gget.enrichr(
    ["ACE2", "AGT", "AGTR1"],
    database="Jensen_TISSUES",
    background_list=["ACE2", "AGT", "AGTR1", "TP53", "BRCA1", "MYC"]
)
print(enrichment[["Term", "Adjusted P-value"]].head())

Recipe: AlphaFold Structure Prediction with Visualization

When to use: Predicting and visualizing protein structures with confidence coloring.

import gget

# Predict with visualization (PAE + 3D structure)
result = gget.alphafold(
    "MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR",
    plot=True,
    show_sidechains=True,
    relax=True  # AMBER relaxation for final structure
)
# Output: PDB file + predicted aligned error (PAE) JSON
# PAE heatmap auto-generated with plot=True

Recipe: Download Reference Genome for RNA-seq Pipeline

When to use: Setting up reference files for RNA-seq alignment pipelines.

# Download GTF and cDNA for human (specific release)
gget ref -w gtf -w cdna -d -r 112 homo_sapiens

# Download genome DNA
gget ref -w dna -d homo_sapiens

Troubleshooting

ProblemCauseSolution
ModuleNotFoundError: gget
Package not installed
pip install gget
in clean virtual environment
gget setup alphafold
fails
Python version incompatibilityUse Python 3.8-3.10; check
gget --version
Empty BLAST resultsSequence too short or no matchesTry longer sequence, different database, or
megablast_off=True
cellxgene
gene not found
Case-sensitive gene symbolsUse
'ACE2'
for human,
'Ace2'
for mouse (exact capitalization required)
gget info
timeout
Too many IDs at onceLimit to ~1000 Ensembl IDs per call; batch with
time.sleep()
Database structure changedgget databases update biweekly
pip install --upgrade gget
COSMIC authentication errorMissing or expired credentialsRe-enter email/password; check COSMIC account status
AlphaFold out of memoryProtein too long for GPU memoryUse shorter sequences or split into domains
Different results on re-runDatabase updated between runsPin versions:
release=112
for Ensembl,
census_version
for CELLxGENE

Bundled Resources

2 reference files provide extended coverage of capabilities from the original 3 reference files and 3 script files:

  1. references/module_parameters.md
    — Consolidates module_reference.md (468 lines). Covers: detailed parameter tables for all 15+ modules with types, defaults, and return value descriptions; CLI vs Python interface differences; setup requirements per module. Relocated inline: most-used module parameters (Core API code blocks), output format summary (Key Concepts table). Omitted: gget gpt module details — trivial OpenAI wrapper, not genomics-specific.

  2. references/databases_workflows.md
    — Consolidates database_info.md (301 lines) and workflows.md (815 lines). Covers: complete database directory with update frequencies and citation info, extended workflow examples (building reference indices, disease-drug pipeline, multi-species comparative analysis), data consistency and reproducibility guidance. Relocated inline: core database overview (Key Concepts table), top 3 workflows (Common Workflows), reproducibility patterns (Key Concepts). Omitted: scripts/ content (3 files, 590 lines total) — thin wrappers around gget API calls for CLI automation; core patterns absorbed into Core API and Common Workflows.

Related Skills

  • biopython — advanced BLAST parameters, batch sequence processing, GenBank record parsing
  • bioservices — programmatic multi-database queries with built-in rate limiting (UniProt, KEGG, ChEMBL)
  • anndata-data-structure — working with AnnData objects returned by
    gget.cellxgene()
  • enrichr — deeper enrichment analysis with custom gene set libraries

References