SciAgent-Skills gget-genomic-databases
Unified CLI/Python interface to 20+ genomic databases. Use for quick gene lookups (Ensembl search/info/seq), BLAST/BLAT sequence alignment, AlphaFold structure prediction, enrichment analysis (Enrichr), disease/drug associations (OpenTargets), single-cell data (CELLxGENE), cancer genomics (cBioPortal/COSMIC), and expression correlation (ARCHS4). Covers genomics, proteomics, and disease domains. For batch processing or advanced BLAST use biopython; for multi-database Python SDK workflows use bioservices.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/gget-genomic-databases" ~/.claude/skills/jaechang-hits-sciagent-skills-gget-genomic-databases && rm -rf "$T"
skills/genomics-bioinformatics/gget-genomic-databases/SKILL.mdgget — Unified Genomic Database Access
Overview
gget is a command-line and Python package providing unified access to 20+ genomic databases and analysis methods. Query gene information, sequences, protein structures, expression data, and disease associations through a consistent interface. All modules work as both CLI tools and Python functions, returning DataFrames (Python) or JSON/CSV (CLI).
When to Use
- Looking up gene information (names, IDs, descriptions) across species from Ensembl
- Retrieving nucleotide or protein sequences for Ensembl gene/transcript IDs
- Running BLAST or BLAT searches against standard reference databases
- Predicting protein 3D structures with AlphaFold2 from amino acid sequences
- Performing gene set enrichment analysis (GO, KEGG, disease terms) via Enrichr
- Querying single-cell RNA-seq datasets from CELLxGENE Census
- Finding disease and drug associations for a gene target via OpenTargets
- Downloading Ensembl reference genomes and annotations for a species
- Finding cancer mutations and genomic alterations via cBioPortal or COSMIC
- Getting tissue expression and correlated genes from ARCHS4
- For batch processing or advanced BLAST parameters, use
insteadbiopython - For programmatic multi-database workflows with rate limiting, use
insteadbioservices
Prerequisites
- Python packages:
gget - Optional setup: Some modules require
before first use (alphafold, cellxgene, elm, gpt)gget setup <module> - Environment: Clean virtual environment recommended to avoid dependency conflicts
- API notes: gget queries remote databases — rate-limit large batch queries with
. Databases update biweekly; keep gget updated. Max ~1000 Ensembl IDs pertime.sleep()
callgget.info()
pip install gget # Optional: setup modules that need additional dependencies gget setup alphafold # ~4GB model parameters, requires OpenMM gget setup cellxgene # cellxgene-census package gget setup elm # local ELM database
Quick Start
import gget # Search for genes by keyword results = gget.search(["BRCA1", "tumor suppressor"], species="homo_sapiens") print(f"Found {len(results)} genes") # Get detailed gene information (Ensembl + UniProt + NCBI) info = gget.info(["ENSG00000012048"]) print(f"Gene: {info.iloc[0]['primary_gene_name']}") # Enrichment analysis on a gene list enrichment = gget.enrichr(["ACE2", "AGT", "AGTR1"], database="ontology") print(f"Enriched terms: {len(enrichment)}")
Core API
Module 1: Reference & Gene Search (ref, search, info, seq)
Query Ensembl for gene references, search by keywords, retrieve gene metadata, and fetch sequences.
import gget # Search for genes by keyword results = gget.search(["BRCA1", "tumor suppressor"], species="homo_sapiens") print(f"Found {len(results)} genes") print(results[["ensembl_id", "gene_name", "biotype"]].head()) # Get detailed gene information (Ensembl + UniProt + NCBI) info = gget.info(["ENSG00000012048", "ENSG00000139618"]) print(f"Gene info columns: {list(info.columns)}")
import gget # Retrieve sequences nucleotide_seqs = gget.seq(["ENSG00000012048"]) protein_seqs = gget.seq(["ENSG00000012048"], translate=True, isoforms=True) print(f"Retrieved {len(protein_seqs)} isoform sequences") # Download reference genome files (specify release for reproducibility) ref_links = gget.ref("homo_sapiens", which="gtf", release=112) print(f"GTF download link: {ref_links}")
Module 2: Sequence Alignment (blast, blat, muscle, diamond)
BLAST/BLAT remote searches, multiple sequence alignment, and fast local alignment.
import gget import time # BLAST against SwissProt (remote API — add delay for batch queries) blast_results = gget.blast( "MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR", database="swissprot", limit=10 ) print(f"Top hit: {blast_results.iloc[0]['Description']}, E-value: {blast_results.iloc[0]['e-value']}") time.sleep(2) # Rate-limit between BLAST queries # BLAT — find genomic position (UCSC) blat_results = gget.blat("ATCGATCGATCGATCGATCG", assembly="human") print(f"Genomic location: chr{blat_results.iloc[0]['chromosome']}:{blat_results.iloc[0]['start']}")
import gget # Multiple sequence alignment with Muscle5 aligned = gget.muscle("sequences.fasta", save=True) # Fast local alignment with DIAMOND (local, no rate limit needed) diamond_results = gget.diamond( "GGETISAWESQME", reference="reference.fasta", sensitivity="very-sensitive", threads=4 ) print(f"Alignments found: {len(diamond_results)}")
Module 3: Protein Structure (pdb, alphafold, elm)
Download PDB structures, predict structures with AlphaFold2, find linear motifs.
import gget # Download PDB structure pdb_data = gget.pdb("7S7U", save=True) # Predict structure with AlphaFold2 (requires gget setup alphafold) structure = gget.alphafold( "MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR", plot=True, show_sidechains=True ) print("Structure prediction complete, PDB file saved")
import gget # Find Eukaryotic Linear Motifs (requires gget setup elm) ortholog_df, regex_df = gget.elm("LIAQSIGQASFV") print(f"Ortholog motifs: {len(ortholog_df)}, Regex motifs: {len(regex_df)}")
Module 4: Expression & Correlation (archs4, cellxgene, bgee)
Gene expression, tissue expression, correlated genes, single-cell data.
import gget # Tissue expression from ARCHS4 tissue_expr = gget.archs4("ACE2", which="tissue") print(f"Expression across {len(tissue_expr)} tissues") # Correlated genes from ARCHS4 correlated = gget.archs4("ACE2", which="correlation") print(f"Top correlated gene: {correlated.iloc[0]['gene_symbol']}")
import gget # Single-cell data from CELLxGENE (requires gget setup cellxgene) adata = gget.cellxgene( gene=["ACE2", "TMPRSS2"], tissue="lung", cell_type="epithelial cell", census_version="2023-07-25" # pin version for reproducibility ) print(f"Cells: {adata.n_obs}, Genes: {adata.n_vars}") # Orthologs and expression from Bgee orthologs = gget.bgee("ENSG00000169194", type="orthologs") print(f"Orthologs in {len(orthologs)} species")
Module 5: Disease & Drug Associations (opentargets, enrichr)
Disease associations, drug targets, enrichment analysis.
import gget # Disease associations from OpenTargets diseases = gget.opentargets("ENSG00000169194", resource="diseases", limit=10) print(f"Associated diseases: {len(diseases)}") # Drug associations drugs = gget.opentargets("ENSG00000169194", resource="drugs", limit=10) print(f"Associated drugs: {len(drugs)}") # OpenTargets resources: diseases, drugs, tractability, pharmacogenetics, # expression, depmap, interactions
import gget # Enrichment analysis via Enrichr # Database shortcuts: 'pathway' (KEGG), 'transcription' (ChEA), # 'ontology' (GO_BP), 'diseases_drugs' (GWAS), 'celltypes' (PanglaoDB) enrichment = gget.enrichr( ["ACE2", "AGT", "AGTR1", "TMPRSS2", "DPP4"], database="ontology" ) print(f"Enriched terms: {len(enrichment)}") print(enrichment[["Term", "Adjusted P-value"]].head())
Module 6: Cancer Genomics (cbio, cosmic)
Cancer mutations, copy number alterations, and somatic mutation databases.
import gget # Search cBioPortal studies studies = gget.cbio_search(["breast", "lung"]) print(f"Studies found: {len(studies)}") # Plot cancer genomics heatmap gget.cbio_plot( ["msk_impact_2017"], ["AKT1", "ALK", "BRAF"], stratification="tissue", variation_type="mutation_occurrences" )
import gget # COSMIC: requires account + local database download # First-time: gget.cosmic(searchterm="", download_cosmic=True, # email="user@example.com", password="xxx", cosmic_project="cancer") cosmic_results = gget.cosmic("EGFR", cosmic_tsv_path="cosmic_data.tsv", limit=10) print(f"COSMIC mutations: {len(cosmic_results)}")
Module 7: Mutation Generation & Utilities (mutate, setup)
Generate mutated sequences and manage module dependencies.
import gget import pandas as pd # Generate mutated sequences from mutation annotations mutations_df = pd.DataFrame({ "seq_ID": ["seq1", "seq1"], "mutation": ["c.4G>T", "c.10del"] }) mutated = gget.mutate(["ATCGCTAAGCTGATCG"], mutations=mutations_df) print(f"Generated {len(mutated)} mutated sequences")
Key Concepts
Module Overview
gget organizes 20+ modules by domain. Python interface uses
gget.<module>():
| Domain | Modules | Primary Database |
|---|---|---|
| Gene reference | , , , | Ensembl, UniProt, NCBI |
| Sequence alignment | , , , | NCBI BLAST, UCSC, local |
| Protein structure | , , | RCSB PDB, AlphaFold2, ELM |
| Expression | , , | ARCHS4, CZ CELLxGENE, Bgee |
| Disease/drugs | , | OpenTargets, Enrichr |
| Cancer | , | cBioPortal, COSMIC |
| Utilities | , , | local / OpenAI |
Output Formats
| Context | Default Format | Alternatives |
|---|---|---|
| Python | DataFrame or dict | for JSON; to file |
| CLI | JSON | for CSV; to save |
| Sequences | FASTA (seq, mutate) | -- |
| Structures | PDB file (pdb, alphafold) | JSON alignment error data |
| Single-cell | AnnData object (cellxgene) | for metadata only |
| Visualization | PNG (cbio plot) | for interactive display |
Enrichr Database Shortcuts
| Shortcut | Full Database Name |
|---|---|
| KEGG_2021_Human |
| ChEA_2016 |
| GO_Biological_Process_2021 |
| GWAS_Catalog_2019 |
| PanglaoDB_Augmented_2021 |
Custom libraries: pass any Enrichr library name directly (e.g.,
"Jensen_TISSUES").
OpenTargets Resources
| Resource | Description |
|---|---|
| Disease associations with evidence scores |
| Drug associations and clinical trial data |
| Target tractability assessment |
| Pharmacogenetic variants |
| Baseline tissue expression |
| DepMap gene-disease effects |
| Protein-protein interactions |
Reproducibility
Pin database versions for consistent results across analyses:
import gget # Pin Ensembl release ref = gget.ref("homo_sapiens", release=112) # Pin CELLxGENE Census version adata = gget.cellxgene(gene=["ACE2"], census_version="2023-07-25") # Always record gget version print(f"gget version: {gget.__version__}")
Common Workflows
Workflow 1: Gene Discovery to Functional Analysis
Goal: Find genes of interest, get their sequences, and perform enrichment analysis.
import gget # 1. Search for genes results = gget.search(["GABA", "receptor"], species="homo_sapiens") gene_ids = results["ensembl_id"].tolist()[:10] # 2. Get detailed information info = gget.info(gene_ids) print(f"Retrieved info for {len(info)} genes") # 3. Get protein sequences sequences = gget.seq(gene_ids, translate=True) # 4. Find correlated genes correlated = gget.archs4(info.index[0], which="correlation") # 5. Enrichment analysis on correlated genes gene_list = correlated["gene_symbol"].tolist()[:50] enrichment = gget.enrichr(gene_list, database="ontology") print(f"Top enriched term: {enrichment.iloc[0]['Term']}")
Workflow 2: Target Validation for Drug Discovery
Goal: Investigate a gene's disease associations, druggability, and cancer mutations.
import gget gene_id = "ENSG00000169194" # ZBTB16 # 1. Disease associations diseases = gget.opentargets(gene_id, resource="diseases", limit=20) # 2. Drug associations drugs = gget.opentargets(gene_id, resource="drugs") # 3. Tractability assessment tractability = gget.opentargets(gene_id, resource="tractability") # 4. Protein interactions interactions = gget.opentargets(gene_id, resource="interactions") print(f"Diseases: {len(diseases)}, Drugs: {len(drugs)}, Interactions: {len(interactions)}") # 5. Cancer genomics gget.cbio_plot(["msk_impact_2017"], ["ZBTB16"], stratification="cancer_type")
Workflow 3: Comparative Genomics
Goal: Compare a gene across species using orthologs and sequence alignment.
import gget # 1. Find orthologs orthologs = gget.bgee("ENSG00000169194", type="orthologs") # 2. Get sequences for human and mouse human_seq = gget.seq("ENSG00000169194", translate=True) mouse_seq = gget.seq("ENSMUSG00000026091", translate=True) # 3. Align sequences alignment = gget.muscle([human_seq, mouse_seq]) # 4. Get human protein structure from PDB pdb_structure = gget.pdb("7S7U") print("Comparative analysis complete")
Key Parameters
| Parameter | Module(s) | Default | Range / Options | Effect |
|---|---|---|---|---|
| search, archs4, cellxgene, enrichr | | Any Ensembl species; shortcuts: 'human', 'mouse' | Target organism |
| blast, opentargets, cosmic | / | - | Maximum results returned |
| blast, enrichr | varies | blast: nt/nr/swissprot/pdbaa; enrichr: shortcuts or library names | Target database for query |
| ref, archs4 | varies | ref: ,,,,; archs4: , | Data type to retrieve |
| seq | | / | Return amino acid instead of nucleotide sequences |
| opentargets | | diseases, drugs, tractability, pharmacogenetics, expression, depmap, interactions | OpenTargets data type |
| ref, search | latest | Integer Ensembl release number | Pin database version for reproducibility |
| cellxgene | | , , date string | Pin CELLxGENE Census version |
| diamond, elm | | to | Alignment sensitivity vs speed |
| diamond, elm | | - | CPU threads for alignment |
| alphafold | | - | Higher = more accurate multimer prediction |
Best Practices
-
Pin database versions for reproducibility: Use
for Ensembl andrelease=112
for CELLxGENE to ensure consistent results across analyses.census_version="2023-07-25" -
Rate-limit batch queries: gget queries remote APIs. Add
between BLAST/BLAT queries in loops. Fortime.sleep(2)
, limit to ~1000 IDs per call.gget.info() -
Keep gget updated: Databases change their structure biweekly. Run
regularly to avoid breakage from schema changes.pip install --upgrade gget -
Use Python interface for pipelines, CLI for exploration: Python functions return DataFrames suitable for chaining. CLI with
is better for quick one-off lookups.-csv -
Check PDB before running AlphaFold:
is instant; AlphaFold prediction takes minutes to hours. Always check if the structure already exists in PDB.gget.pdb() -
Use database shortcuts in enrichr: The shortcuts (
,'pathway'
, etc.) map to curated Enrichr libraries. For custom analyses, pass any Enrichr library name directly.'ontology' -
Cache cBioPortal data for repeated analyses: Use
parameter to avoid re-downloading large cancer genomics datasets.data_dir="./cache"
Common Recipes
Recipe: Batch Gene Information Retrieval
When to use: Need information for many genes at once (up to ~1000 IDs per call).
import gget import time gene_ids = ["ENSG00000012048", "ENSG00000139618", "ENSG00000141510"] info = gget.info(gene_ids) info.to_csv("gene_info_batch.csv") print(f"Saved info for {len(info)} genes") # For >1000 genes, batch with rate limiting all_ids = [f"ENSG{i:011d}" for i in range(2000)] results = [] for i in range(0, len(all_ids), 500): batch = all_ids[i:i+500] results.append(gget.info(batch)) time.sleep(1)
Recipe: Custom Enrichment with Background
When to use: Running enrichment against a custom background gene set.
import gget # Use specific Enrichr library with background genes enrichment = gget.enrichr( ["ACE2", "AGT", "AGTR1"], database="Jensen_TISSUES", background_list=["ACE2", "AGT", "AGTR1", "TP53", "BRCA1", "MYC"] ) print(enrichment[["Term", "Adjusted P-value"]].head())
Recipe: AlphaFold Structure Prediction with Visualization
When to use: Predicting and visualizing protein structures with confidence coloring.
import gget # Predict with visualization (PAE + 3D structure) result = gget.alphafold( "MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR", plot=True, show_sidechains=True, relax=True # AMBER relaxation for final structure ) # Output: PDB file + predicted aligned error (PAE) JSON # PAE heatmap auto-generated with plot=True
Recipe: Download Reference Genome for RNA-seq Pipeline
When to use: Setting up reference files for RNA-seq alignment pipelines.
# Download GTF and cDNA for human (specific release) gget ref -w gtf -w cdna -d -r 112 homo_sapiens # Download genome DNA gget ref -w dna -d homo_sapiens
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Package not installed | in clean virtual environment |
fails | Python version incompatibility | Use Python 3.8-3.10; check |
| Empty BLAST results | Sequence too short or no matches | Try longer sequence, different database, or |
gene not found | Case-sensitive gene symbols | Use for human, for mouse (exact capitalization required) |
timeout | Too many IDs at once | Limit to ~1000 Ensembl IDs per call; batch with |
| Database structure changed | gget databases update biweekly | |
| COSMIC authentication error | Missing or expired credentials | Re-enter email/password; check COSMIC account status |
| AlphaFold out of memory | Protein too long for GPU memory | Use shorter sequences or split into domains |
| Different results on re-run | Database updated between runs | Pin versions: for Ensembl, for CELLxGENE |
Bundled Resources
2 reference files provide extended coverage of capabilities from the original 3 reference files and 3 script files:
-
— Consolidates module_reference.md (468 lines). Covers: detailed parameter tables for all 15+ modules with types, defaults, and return value descriptions; CLI vs Python interface differences; setup requirements per module. Relocated inline: most-used module parameters (Core API code blocks), output format summary (Key Concepts table). Omitted: gget gpt module details — trivial OpenAI wrapper, not genomics-specific.references/module_parameters.md -
— Consolidates database_info.md (301 lines) and workflows.md (815 lines). Covers: complete database directory with update frequencies and citation info, extended workflow examples (building reference indices, disease-drug pipeline, multi-species comparative analysis), data consistency and reproducibility guidance. Relocated inline: core database overview (Key Concepts table), top 3 workflows (Common Workflows), reproducibility patterns (Key Concepts). Omitted: scripts/ content (3 files, 590 lines total) — thin wrappers around gget API calls for CLI automation; core patterns absorbed into Core API and Common Workflows.references/databases_workflows.md
Related Skills
- biopython — advanced BLAST parameters, batch sequence processing, GenBank record parsing
- bioservices — programmatic multi-database queries with built-in rate limiting (UniProt, KEGG, ChEMBL)
- anndata-data-structure — working with AnnData objects returned by
gget.cellxgene() - enrichr — deeper enrichment analysis with custom gene set libraries
References
- gget documentation — official docs and tutorials
- gget GitHub — source code, issues
- Luebbert, L. & Pachter, L. (2023). Efficient querying of genomic reference databases with gget. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac836