BioSkills bio-expression-matrix-gene-id-mapping
Convert between gene identifier systems including Ensembl, Entrez, HGNC symbols, and UniProt. Use when mapping IDs for pathway analysis or matching different data sources.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/expression-matrix/gene-id-mapping" ~/.claude/skills/gptomics-bioskills-bio-expression-matrix-gene-id-mapping && rm -rf "$T"
expression-matrix/gene-id-mapping/SKILL.mdVersion Compatibility
Reference examples tested with: pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - R:
thenpackageVersion('<pkg>')
to verify parameters?function_name
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Gene ID Mapping
Python: mygene
Goal: Convert between gene identifier systems (Ensembl, Entrez, Symbol, UniProt) using the MyGene.info API.
Approach: Query mygene with source IDs, specifying scopes and target fields, to build an ID mapping dictionary.
"Convert my Ensembl gene IDs to gene symbols" → Query a gene annotation service to map between identifier systems, handling one-to-many mappings.
import mygene import pandas as pd mg = mygene.MyGeneInfo() # Ensembl to Symbol ensembl_ids = ['ENSG00000141510', 'ENSG00000012048', 'ENSG00000141736'] results = mg.querymany(ensembl_ids, scopes='ensembl.gene', fields='symbol', species='human') mapping = {r['query']: r.get('symbol', None) for r in results} # {'ENSG00000141510': 'TP53', 'ENSG00000012048': 'BRCA1', 'ENSG00000141736': 'ERBB2'} # Symbol to Entrez symbols = ['TP53', 'BRCA1', 'ERBB2'] results = mg.querymany(symbols, scopes='symbol', fields='entrezgene', species='human') mapping = {r['query']: r.get('entrezgene', None) for r in results} # Ensembl to multiple fields results = mg.querymany(ensembl_ids, scopes='ensembl.gene', fields=['symbol', 'entrezgene', 'uniprot'], species='human')
Python: pyensembl
Goal: Map gene identifiers using a local Ensembl database for offline, fast lookups.
Approach: Load a specific Ensembl release and query gene objects by ID or name.
from pyensembl import EnsemblRelease # Load Ensembl release (downloads automatically first time) ensembl = EnsemblRelease(110, species='human') # or 'mouse' # Gene ID to symbol gene = ensembl.gene_by_id('ENSG00000141510') print(gene.gene_name) # TP53 # Symbol to gene ID gene = ensembl.genes_by_name('TP53')[0] print(gene.gene_id) # ENSG00000141510 # Batch conversion def ensembl_to_symbol(ensembl_ids, release=110): ens = EnsemblRelease(release, species='human') mapping = {} for eid in ensembl_ids: try: gene = ens.gene_by_id(eid.split('.')[0]) # Remove version mapping[eid] = gene.gene_name except ValueError: mapping[eid] = None return mapping
Python: gseapy
import gseapy as gp # Ensembl to Symbol using Enrichr gene_list = ['ENSG00000141510', 'ENSG00000012048'] converted = gp.biomart.ensembl2name(gene_list, organism='hsapiens')
R: biomaRt
Goal: Map gene identifiers using the Ensembl BioMart web service in R.
Approach: Connect to the Ensembl BioMart and retrieve attribute mappings for a list of gene IDs.
library(biomaRt) # Connect to Ensembl ensembl <- useEnsembl(biomart='genes', dataset='hsapiens_gene_ensembl') # Ensembl to Symbol ensembl_ids <- c('ENSG00000141510', 'ENSG00000012048', 'ENSG00000141736') results <- getBM( attributes=c('ensembl_gene_id', 'hgnc_symbol', 'entrezgene_id'), filters='ensembl_gene_id', values=ensembl_ids, mart=ensembl ) # Symbol to Ensembl symbols <- c('TP53', 'BRCA1', 'ERBB2') results <- getBM( attributes=c('hgnc_symbol', 'ensembl_gene_id'), filters='hgnc_symbol', values=symbols, mart=ensembl ) # All available attributes listAttributes(ensembl)
R: org.db Packages
Goal: Map gene identifiers using Bioconductor organism annotation packages for fast local lookups.
Approach: Use mapIds from AnnotationDbi with organism-specific org.db packages.
library(org.Hs.eg.db) # Human library(AnnotationDbi) # Ensembl to Symbol ensembl_ids <- c('ENSG00000141510', 'ENSG00000012048') symbols <- mapIds(org.Hs.eg.db, keys=ensembl_ids, keytype='ENSEMBL', column='SYMBOL') # Symbol to Entrez symbols <- c('TP53', 'BRCA1') entrez <- mapIds(org.Hs.eg.db, keys=symbols, keytype='SYMBOL', column='ENTREZID') # Available keytypes keytypes(org.Hs.eg.db) # ENSEMBL, ENSEMBLPROT, ENSEMBLTRANS, ENTREZID, SYMBOL, UNIPROT, etc.
Apply Mapping to Count Matrix
Goal: Replace gene IDs in a count matrix index with a different identifier type.
Approach: Map IDs via mygene, update the DataFrame index, and aggregate duplicates by summing.
"Convert the gene IDs in my count matrix from Ensembl to symbols" → Map the row index to a new ID type, handling version suffixes and duplicate mappings by summation.
import pandas as pd import mygene def map_count_matrix_ids(counts, from_type='ensembl.gene', to_type='symbol', species='human'): '''Map gene IDs in count matrix index.''' mg = mygene.MyGeneInfo() # Remove version numbers from Ensembl IDs clean_ids = [g.split('.')[0] for g in counts.index] # Query mygene results = mg.querymany(clean_ids, scopes=from_type, fields=to_type, species=species) # Build mapping mapping = {} for r in results: if to_type in r: mapping[r['query']] = r[to_type] # Apply mapping new_index = [mapping.get(g.split('.')[0], g) for g in counts.index] counts_mapped = counts.copy() counts_mapped.index = new_index # Handle duplicates (sum) counts_mapped = counts_mapped.groupby(counts_mapped.index).sum() return counts_mapped # Usage counts_symbols = map_count_matrix_ids(counts, 'ensembl.gene', 'symbol')
R Equivalent
Goal: Replace gene IDs in an R count matrix using biomaRt with duplicate aggregation.
Approach: Query BioMart for the mapping, merge with the count matrix, and sum duplicate rows.
library(biomaRt) map_count_matrix_ids <- function(counts, from_type='ensembl_gene_id', to_type='hgnc_symbol') { ensembl <- useEnsembl(biomart='genes', dataset='hsapiens_gene_ensembl') # Remove version numbers clean_ids <- gsub('\\..*', '', rownames(counts)) # Get mapping mapping <- getBM( attributes=c(from_type, to_type), filters=from_type, values=clean_ids, mart=ensembl ) # Merge and aggregate duplicates counts$gene_id <- clean_ids merged <- merge(counts, mapping, by.x='gene_id', by.y=from_type, all.x=TRUE) merged$gene_id <- NULL # Use symbol as rowname, sum duplicates rownames(merged) <- merged[[to_type]] merged[[to_type]] <- NULL counts_mapped <- aggregate(. ~ rownames(merged), data=merged, FUN=sum) rownames(counts_mapped) <- counts_mapped[,1] counts_mapped <- counts_mapped[,-1] return(counts_mapped) }
Handle Unmapped IDs
Goal: Track and gracefully handle gene IDs that fail to map to the target identifier system.
Approach: Keep original IDs for unmapped genes and report mapping success rate.
def robust_id_mapping(gene_ids, from_type, to_type, species='human'): '''Map IDs with fallback for unmapped genes.''' import mygene mg = mygene.MyGeneInfo() clean_ids = [g.split('.')[0] for g in gene_ids] results = mg.querymany(clean_ids, scopes=from_type, fields=to_type, species=species) mapping = {} unmapped = [] for r in results: original = gene_ids[clean_ids.index(r['query'])] if to_type in r: mapping[original] = r[to_type] else: mapping[original] = original # Keep original if unmapped unmapped.append(original) print(f'Mapped: {len(gene_ids) - len(unmapped)}/{len(gene_ids)}') print(f'Unmapped: {len(unmapped)}') return mapping, unmapped
Common ID Types and Database Selection
| Type | Example | Use Case | Stability |
|---|---|---|---|
| Ensembl Gene | ENSG00000141510 | RNA-seq, GTF files | Stable across releases (versioned) |
| Ensembl Transcript | ENST00000269305 | Transcript-level analysis | Stable (versioned) |
| Entrez Gene | 7157 | NCBI databases, KEGG | Stable (never reused) |
| HGNC Symbol | TP53 | Human readable display | Changes frequently |
| UniProt | P04637 | Protein databases | Stable (versioned releases) |
| RefSeq | NM_000546 | NCBI RefSeq | Stable (versioned) |
Database Selection Guide
| Scenario | Recommended ID | Why |
|---|---|---|
| Computational key / primary index | Ensembl Gene ID | Stable, versioned, consistent with GTF |
| Pathway analysis (KEGG, Reactome) | Entrez Gene ID | Required by most pathway databases |
| GO enrichment | Entrez or Ensembl | Both supported by clusterProfiler |
| Display labels (plots, tables) | HGNC Symbol | Human-readable |
| Cross-database integration | Ensembl | Best-connected hub across databases |
| Protein-level analysis | UniProt | Primary protein database |
Best practice: use stable IDs (Ensembl or Entrez) as computational keys. Use symbols only as display labels. Always pin mappings to a specific database release and archive the cross-reference table for reproducibility.
Gene Symbol Instability
Gene symbols change regularly as nomenclature committees update names. NCBI updates daily; Bioconductor org.db packages update every 6 months. For the most current mappings, query mygene.info or download gene_info from NCBI FTP directly. Never use symbols as the primary key in a pipeline -- always join on stable IDs and add symbols as a display column.
PAR Gene Complications
Pseudo-autosomal region (PAR) genes exist on both X and Y chromosomes with identical sequences. In Ensembl GTF files, PAR genes have coordinates on both chromosomes, potentially creating duplicate entries in count matrices. Reads from PAR regions cannot be unambiguously assigned to X or Y.
# Check for PAR gene duplicates in a count matrix par_genes_human = ['SHOX', 'IL3RA', 'SLC25A6', 'P2RY8', 'AKAP17A', 'ASMT', 'DHRSX'] duplicated_ids = counts.index[counts.index.duplicated()].unique() if len(duplicated_ids) > 0: print(f'Duplicate gene entries found: {len(duplicated_ids)}') # Sum duplicates (standard approach for PAR genes) counts = counts.groupby(counts.index).sum()
Some reference genomes mask the Y-chromosome PAR to avoid double-counting. Check whether the GTF includes PAR genes on both chromosomes before building count matrices.
Cross-Species Ortholog Mapping
Goal: Map gene IDs between species for cross-species comparisons or integration.
Approach: Use Ensembl Compara (via biomaRt) to find orthologs, selecting the appropriate stringency level.
library(biomaRt) human <- useEnsembl(biomart='genes', dataset='hsapiens_gene_ensembl') mouse <- useEnsembl(biomart='genes', dataset='mmusculus_gene_ensembl') # Human to mouse one-to-one orthologs orthologs <- getLDS( attributes=c('hgnc_symbol', 'ensembl_gene_id'), filters='ensembl_gene_id', values=human_gene_ids, mart=human, attributesL=c('mgi_symbol', 'ensembl_gene_id'), martL=mouse )
| Strategy | When to use | Trade-off |
|---|---|---|
| One-to-one orthologs only | Cross-species scRNA-seq integration | Most conservative; loses genes without clear orthologs |
| Include one-to-many | Broader gene coverage needed | Must select: highest homology confidence or highest expression |
| Include many-to-many | Maximum inclusivity | Introduces ambiguity; use with caution |
For cross-species scRNA-seq integration, use only one-to-one orthologs (standard practice).
Build tx2gene for tximport
Goal: Create the transcript-to-gene mapping table required by tximport for gene-level summarization from Salmon/kallisto output.
Approach: Extract transcript-gene relationships from a GTF file or Ensembl BioMart.
# From GTF (recommended for consistency with quantification index) library(GenomicFeatures) txdb <- makeTxDbFromGFF('annotation.gtf') k <- keys(txdb, keytype='TXNAME') tx2gene <- AnnotationDbi::select(txdb, k, 'GENEID', 'TXNAME') # From BioMart library(biomaRt) mart <- useMart('ensembl', dataset='hsapiens_gene_ensembl') tx2gene <- getBM( attributes=c('ensembl_transcript_id_version', 'ensembl_gene_id_version'), mart=mart ) colnames(tx2gene) <- c('TXNAME', 'GENEID')
# Python: extract tx2gene from GTF import pandas as pd def tx2gene_from_gtf(gtf_path): '''Extract transcript-to-gene mapping from GTF.''' records = [] with open(gtf_path) as f: for line in f: if line.startswith('#') or '\ttranscript\t' not in line: continue attrs = line.strip().split('\t')[8] gene_id = [a.split('"')[1] for a in attrs.split(';') if 'gene_id' in a][0] tx_id = [a.split('"')[1] for a in attrs.split(';') if 'transcript_id' in a][0] records.append({'TXNAME': tx_id, 'GENEID': gene_id}) return pd.DataFrame(records).drop_duplicates()
Related Skills
- expression-matrix/counts-ingest - Load count data
- expression-matrix/metadata-joins - Add annotations
- rna-quantification/tximport-workflow - Uses tx2gene mapping
- pathway-analysis/go-enrichment - Requires Entrez IDs
- pathway-analysis/kegg-pathways - Requires Entrez IDs