BioSkills bio-geo-data
Query NCBI Gene Expression Omnibus (GEO) for expression datasets using Biopython Bio.Entrez. Use when finding microarray/RNA-seq datasets, downloading expression data, or linking GEO series to SRA runs.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/database-access/geo-data" ~/.claude/skills/gptomics-bioskills-bio-geo-data && rm -rf "$T"
database-access/geo-data/SKILL.mdVersion Compatibility
Reference examples tested with: BioPython 1.83+, Entrez Direct 21.0+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
GEO Data
Query and access Gene Expression Omnibus datasets using Biopython's Entrez module.
"Find expression datasets in GEO" → Search GEO for microarray/RNA-seq datasets by organism, tissue, or condition, then download expression matrices or link to SRA runs.
- Python:
(BioPython)Entrez.esearch(db='gds', term=...) - CLI:
(Entrez Direct)esearch -db gds -query "term"
Required Setup
from Bio import Entrez Entrez.email = 'your.email@example.com' # Required by NCBI Entrez.api_key = 'your_api_key' # Optional
GEO Database Types
| Database | db value | Description |
|---|---|---|
| GEO DataSets | | Curated datasets (GDS*) |
| GEO Profiles | | Individual gene profiles |
GEO Record Types:
| Prefix | Type | Description |
|---|---|---|
| GSE | Series | Complete study/experiment |
| GSM | Sample | Individual sample |
| GPL | Platform | Array/sequencing platform |
| GDS | DataSet | Curated, normalized dataset |
Searching GEO
Search GEO DataSets (GDS)
from Bio import Entrez Entrez.email = 'your.email@example.com' # Search curated datasets handle = Entrez.esearch(db='gds', term='breast cancer AND Homo sapiens[orgn]', retmax=10) record = Entrez.read(handle) handle.close() print(f"Found {record['Count']} datasets") print(f"IDs: {record['IdList']}")
Search GEO Series (GSE)
# Search GEO Series via gds database # Use entry_type filter handle = Entrez.esearch(db='gds', term='RNA-seq[title] AND human[orgn] AND gse[entry_type]', retmax=10) record = Entrez.read(handle) handle.close()
Common Search Fields
| Field | Description | Example |
|---|---|---|
| Organism | |
| Dataset title | |
| Description text | |
| Platform GPL | |
| Record type | , |
| Study type | |
| PubMed ID | |
| Publication date | |
GDS Types
# Expression profiling by array term = 'expression profiling by array[gdstype] AND cancer' # RNA-seq expression term = 'expression profiling by high throughput sequencing[gdstype]' # ChIP-seq term = 'genome binding/occupancy profiling[gdstype]'
Fetching GEO Information
Get GEO DataSet Summary
# Fetch summary for GDS records handle = Entrez.esummary(db='gds', id='200024320') record = Entrez.read(handle) handle.close() summary = record[0] print(f"Accession: {summary['Accession']}") print(f"Title: {summary['title']}") print(f"Summary: {summary['summary'][:200]}...") print(f"Organism: {summary['taxon']}") print(f"Platform: {summary['GPL']}") print(f"Samples: {summary['n_samples']}")
Summary Fields
summary['Accession'] # GSE/GDS accession summary['title'] # Dataset title summary['summary'] # Description summary['taxon'] # Organism summary['GPL'] # Platform ID summary['n_samples'] # Number of samples summary['FTPLink'] # FTP download link summary['PubMedIds'] # Associated publications summary['gdsType'] # Dataset type summary['ptechType'] # Platform technology
Code Patterns
Search and List GEO Series
Goal: Find GEO experiment series matching a query and retrieve their metadata summaries.
Approach: Search the gds database with entry_type filtering, then fetch summaries for matched IDs to extract accessions, titles, organisms, and sample counts.
from Bio import Entrez Entrez.email = 'your.email@example.com' def search_geo(term, entry_type='gse', max_results=20): full_term = f'{term} AND {entry_type}[entry_type]' handle = Entrez.esearch(db='gds', term=full_term, retmax=max_results) search = Entrez.read(handle) handle.close() if not search['IdList']: return [] handle = Entrez.esummary(db='gds', id=','.join(search['IdList'])) summaries = Entrez.read(handle) handle.close() results = [] for s in summaries: results.append({ 'accession': s['Accession'], 'title': s['title'], 'organism': s['taxon'], 'samples': s['n_samples'], 'platform': s['GPL'] }) return results datasets = search_geo('breast cancer RNA-seq AND human[orgn]') for ds in datasets: print(f"{ds['accession']}: {ds['title'][:60]}... ({ds['samples']} samples)")
Find RNA-Seq Datasets
def find_rnaseq_datasets(organism, keywords, max_results=20): term = f'{keywords} AND {organism}[orgn] AND expression profiling by high throughput sequencing[gdstype] AND gse[entry_type]' handle = Entrez.esearch(db='gds', term=term, retmax=max_results) search = Entrez.read(handle) handle.close() if not search['IdList']: return [] handle = Entrez.esummary(db='gds', id=','.join(search['IdList'])) summaries = Entrez.read(handle) handle.close() return summaries datasets = find_rnaseq_datasets('Homo sapiens', 'COVID-19') for ds in datasets: print(f"{ds['Accession']}: {ds['n_samples']} samples - {ds['title'][:50]}...")
Get GSE Download Link
def get_geo_ftp(gse_accession): '''Get FTP download link for a GSE''' handle = Entrez.esearch(db='gds', term=f'{gse_accession}[accn]') search = Entrez.read(handle) handle.close() if not search['IdList']: return None handle = Entrez.esummary(db='gds', id=search['IdList'][0]) summary = Entrez.read(handle)[0] handle.close() return summary.get('FTPLink') ftp_link = get_geo_ftp('GSE123456') print(f"Download from: {ftp_link}")
Link GEO to SRA
Goal: Find the raw sequencing runs (SRA accessions) associated with a GEO series for downloading FASTQ files.
Approach: Search for the GSE accession in the gds database, use ELink to cross-reference to SRA, then fetch SRA summaries to extract run metadata.
def geo_to_sra(gse_accession): '''Find SRA runs associated with a GEO series''' # Search GEO handle = Entrez.esearch(db='gds', term=f'{gse_accession}[accn]') search = Entrez.read(handle) handle.close() if not search['IdList']: return [] # Link to SRA handle = Entrez.elink(dbfrom='gds', db='sra', id=search['IdList'][0]) links = Entrez.read(handle) handle.close() if not links[0]['LinkSetDb']: return [] sra_ids = [link['Id'] for link in links[0]['LinkSetDb'][0]['Link']] # Get SRA accessions handle = Entrez.esummary(db='sra', id=','.join(sra_ids[:50])) summaries = Entrez.read(handle) handle.close() runs = [] for s in summaries: expxml = s.get('ExpXml', '') if 'SRR' in str(expxml) or 'SRX' in str(expxml): runs.append(s) return runs sra_data = geo_to_sra('GSE123456') print(f"Found {len(sra_data)} SRA records")
Search by PubMed ID
def geo_from_pubmed(pmid): '''Find GEO datasets associated with a publication''' handle = Entrez.elink(dbfrom='pubmed', db='gds', id=pmid) links = Entrez.read(handle) handle.close() if not links[0]['LinkSetDb']: return [] gds_ids = [link['Id'] for link in links[0]['LinkSetDb'][0]['Link']] handle = Entrez.esummary(db='gds', id=','.join(gds_ids)) summaries = Entrez.read(handle) handle.close() return summaries datasets = geo_from_pubmed('35412348') for ds in datasets: print(f"{ds['Accession']}: {ds['title']}")
Download GEO Data (GEOparse)
Goal: Download a complete GEO series including sample metadata and expression data for local analysis.
Approach: Use GEOparse to fetch and parse the GSE record, then access sample metadata via the gsms dictionary and expression values via pivot_samples.
# pip install GEOparse import GEOparse # Download and parse GSE gse = GEOparse.get_GEO('GSE123456') # Access metadata print(f"Title: {gse.metadata['title'][0]}") print(f"Samples: {len(gse.gsms)}") # Get sample metadata for gsm_name, gsm in gse.gsms.items(): print(f"{gsm_name}: {gsm.metadata['title'][0]}") # Get expression table if gse.gpls: gpl_name = list(gse.gpls.keys())[0] expression_table = gse.pivot_samples('VALUE')
Download Options
Direct FTP Download
# Download entire GSE wget -r -np -nd ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/ # Download specific file types wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/suppl/*counts*.txt.gz
Series Matrix Files
import gzip import urllib.request def download_series_matrix(gse): '''Download series matrix file''' gse_prefix = gse[:len(gse)-3] + 'nnn' url = f'https://ftp.ncbi.nlm.nih.gov/geo/series/{gse_prefix}/{gse}/matrix/{gse}_series_matrix.txt.gz' filename = f'{gse}_series_matrix.txt.gz' urllib.request.urlretrieve(url, filename) return filename
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Empty results | Wrong entry_type | Add or |
| No FTPLink | Superseries or no data | Check if series has supplementary files |
| No SRA link | Microarray data | SRA only for sequencing data |
Decision Tree
Need GEO expression data? ├── Looking for curated datasets? │ └── Search gds with [entry_type]=gds ├── Looking for any experiment? │ └── Search gds with [entry_type]=gse ├── Want RNA-seq specifically? │ └── Add 'expression profiling by high throughput sequencing[gdstype]' ├── Have a publication? │ └── Link pubmed -> gds ├── Need raw sequencing data? │ └── Link gds -> sra, then use sra-data skill ├── Need processed expression matrix? │ └── Download series matrix or use GEOparse └── Need full metadata? └── Use GEOparse library
Related Skills
- entrez-search - General database searching
- entrez-link - Link GEO to SRA and other databases
- sra-data - Download raw sequencing data from linked SRA
- batch-downloads - Download multiple GEO records