BioSkills bio-expression-matrix-gene-id-mapping

Convert between gene identifier systems including Ensembl, Entrez, HGNC symbols, and UniProt. Use when mapping IDs for pathway analysis or matching different data sources.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/expression-matrix/gene-id-mapping" ~/.claude/skills/gptomics-bioskills-bio-expression-matrix-gene-id-mapping && rm -rf "$T"
manifest: expression-matrix/gene-id-mapping/SKILL.md
source content

Version Compatibility

Reference examples tested with: pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:

  • Python:
    pip show <package>
    then
    help(module.function)
    to check signatures
  • R:
    packageVersion('<pkg>')
    then
    ?function_name
    to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Gene ID Mapping

Python: mygene

Goal: Convert between gene identifier systems (Ensembl, Entrez, Symbol, UniProt) using the MyGene.info API.

Approach: Query mygene with source IDs, specifying scopes and target fields, to build an ID mapping dictionary.

"Convert my Ensembl gene IDs to gene symbols" → Query a gene annotation service to map between identifier systems, handling one-to-many mappings.

import mygene
import pandas as pd

mg = mygene.MyGeneInfo()

# Ensembl to Symbol
ensembl_ids = ['ENSG00000141510', 'ENSG00000012048', 'ENSG00000141736']
results = mg.querymany(ensembl_ids, scopes='ensembl.gene', fields='symbol', species='human')
mapping = {r['query']: r.get('symbol', None) for r in results}
# {'ENSG00000141510': 'TP53', 'ENSG00000012048': 'BRCA1', 'ENSG00000141736': 'ERBB2'}

# Symbol to Entrez
symbols = ['TP53', 'BRCA1', 'ERBB2']
results = mg.querymany(symbols, scopes='symbol', fields='entrezgene', species='human')
mapping = {r['query']: r.get('entrezgene', None) for r in results}

# Ensembl to multiple fields
results = mg.querymany(ensembl_ids, scopes='ensembl.gene',
    fields=['symbol', 'entrezgene', 'uniprot'], species='human')

Python: pyensembl

Goal: Map gene identifiers using a local Ensembl database for offline, fast lookups.

Approach: Load a specific Ensembl release and query gene objects by ID or name.

from pyensembl import EnsemblRelease

# Load Ensembl release (downloads automatically first time)
ensembl = EnsemblRelease(110, species='human')  # or 'mouse'

# Gene ID to symbol
gene = ensembl.gene_by_id('ENSG00000141510')
print(gene.gene_name)  # TP53

# Symbol to gene ID
gene = ensembl.genes_by_name('TP53')[0]
print(gene.gene_id)  # ENSG00000141510

# Batch conversion
def ensembl_to_symbol(ensembl_ids, release=110):
    ens = EnsemblRelease(release, species='human')
    mapping = {}
    for eid in ensembl_ids:
        try:
            gene = ens.gene_by_id(eid.split('.')[0])  # Remove version
            mapping[eid] = gene.gene_name
        except ValueError:
            mapping[eid] = None
    return mapping

Python: gseapy

import gseapy as gp

# Ensembl to Symbol using Enrichr
gene_list = ['ENSG00000141510', 'ENSG00000012048']
converted = gp.biomart.ensembl2name(gene_list, organism='hsapiens')

R: biomaRt

Goal: Map gene identifiers using the Ensembl BioMart web service in R.

Approach: Connect to the Ensembl BioMart and retrieve attribute mappings for a list of gene IDs.

library(biomaRt)

# Connect to Ensembl
ensembl <- useEnsembl(biomart='genes', dataset='hsapiens_gene_ensembl')

# Ensembl to Symbol
ensembl_ids <- c('ENSG00000141510', 'ENSG00000012048', 'ENSG00000141736')
results <- getBM(
    attributes=c('ensembl_gene_id', 'hgnc_symbol', 'entrezgene_id'),
    filters='ensembl_gene_id',
    values=ensembl_ids,
    mart=ensembl
)

# Symbol to Ensembl
symbols <- c('TP53', 'BRCA1', 'ERBB2')
results <- getBM(
    attributes=c('hgnc_symbol', 'ensembl_gene_id'),
    filters='hgnc_symbol',
    values=symbols,
    mart=ensembl
)

# All available attributes
listAttributes(ensembl)

R: org.db Packages

Goal: Map gene identifiers using Bioconductor organism annotation packages for fast local lookups.

Approach: Use mapIds from AnnotationDbi with organism-specific org.db packages.

library(org.Hs.eg.db)  # Human
library(AnnotationDbi)

# Ensembl to Symbol
ensembl_ids <- c('ENSG00000141510', 'ENSG00000012048')
symbols <- mapIds(org.Hs.eg.db, keys=ensembl_ids, keytype='ENSEMBL', column='SYMBOL')

# Symbol to Entrez
symbols <- c('TP53', 'BRCA1')
entrez <- mapIds(org.Hs.eg.db, keys=symbols, keytype='SYMBOL', column='ENTREZID')

# Available keytypes
keytypes(org.Hs.eg.db)
# ENSEMBL, ENSEMBLPROT, ENSEMBLTRANS, ENTREZID, SYMBOL, UNIPROT, etc.

Apply Mapping to Count Matrix

Goal: Replace gene IDs in a count matrix index with a different identifier type.

Approach: Map IDs via mygene, update the DataFrame index, and aggregate duplicates by summing.

"Convert the gene IDs in my count matrix from Ensembl to symbols" → Map the row index to a new ID type, handling version suffixes and duplicate mappings by summation.

import pandas as pd
import mygene

def map_count_matrix_ids(counts, from_type='ensembl.gene', to_type='symbol', species='human'):
    '''Map gene IDs in count matrix index.'''
    mg = mygene.MyGeneInfo()

    # Remove version numbers from Ensembl IDs
    clean_ids = [g.split('.')[0] for g in counts.index]

    # Query mygene
    results = mg.querymany(clean_ids, scopes=from_type, fields=to_type, species=species)

    # Build mapping
    mapping = {}
    for r in results:
        if to_type in r:
            mapping[r['query']] = r[to_type]

    # Apply mapping
    new_index = [mapping.get(g.split('.')[0], g) for g in counts.index]
    counts_mapped = counts.copy()
    counts_mapped.index = new_index

    # Handle duplicates (sum)
    counts_mapped = counts_mapped.groupby(counts_mapped.index).sum()

    return counts_mapped

# Usage
counts_symbols = map_count_matrix_ids(counts, 'ensembl.gene', 'symbol')

R Equivalent

Goal: Replace gene IDs in an R count matrix using biomaRt with duplicate aggregation.

Approach: Query BioMart for the mapping, merge with the count matrix, and sum duplicate rows.

library(biomaRt)

map_count_matrix_ids <- function(counts, from_type='ensembl_gene_id', to_type='hgnc_symbol') {
    ensembl <- useEnsembl(biomart='genes', dataset='hsapiens_gene_ensembl')

    # Remove version numbers
    clean_ids <- gsub('\\..*', '', rownames(counts))

    # Get mapping
    mapping <- getBM(
        attributes=c(from_type, to_type),
        filters=from_type,
        values=clean_ids,
        mart=ensembl
    )

    # Merge and aggregate duplicates
    counts$gene_id <- clean_ids
    merged <- merge(counts, mapping, by.x='gene_id', by.y=from_type, all.x=TRUE)
    merged$gene_id <- NULL

    # Use symbol as rowname, sum duplicates
    rownames(merged) <- merged[[to_type]]
    merged[[to_type]] <- NULL
    counts_mapped <- aggregate(. ~ rownames(merged), data=merged, FUN=sum)
    rownames(counts_mapped) <- counts_mapped[,1]
    counts_mapped <- counts_mapped[,-1]

    return(counts_mapped)
}

Handle Unmapped IDs

Goal: Track and gracefully handle gene IDs that fail to map to the target identifier system.

Approach: Keep original IDs for unmapped genes and report mapping success rate.

def robust_id_mapping(gene_ids, from_type, to_type, species='human'):
    '''Map IDs with fallback for unmapped genes.'''
    import mygene
    mg = mygene.MyGeneInfo()

    clean_ids = [g.split('.')[0] for g in gene_ids]
    results = mg.querymany(clean_ids, scopes=from_type, fields=to_type, species=species)

    mapping = {}
    unmapped = []
    for r in results:
        original = gene_ids[clean_ids.index(r['query'])]
        if to_type in r:
            mapping[original] = r[to_type]
        else:
            mapping[original] = original  # Keep original if unmapped
            unmapped.append(original)

    print(f'Mapped: {len(gene_ids) - len(unmapped)}/{len(gene_ids)}')
    print(f'Unmapped: {len(unmapped)}')

    return mapping, unmapped

Common ID Types and Database Selection

TypeExampleUse CaseStability
Ensembl GeneENSG00000141510RNA-seq, GTF filesStable across releases (versioned)
Ensembl TranscriptENST00000269305Transcript-level analysisStable (versioned)
Entrez Gene7157NCBI databases, KEGGStable (never reused)
HGNC SymbolTP53Human readable displayChanges frequently
UniProtP04637Protein databasesStable (versioned releases)
RefSeqNM_000546NCBI RefSeqStable (versioned)

Database Selection Guide

ScenarioRecommended IDWhy
Computational key / primary indexEnsembl Gene IDStable, versioned, consistent with GTF
Pathway analysis (KEGG, Reactome)Entrez Gene IDRequired by most pathway databases
GO enrichmentEntrez or EnsemblBoth supported by clusterProfiler
Display labels (plots, tables)HGNC SymbolHuman-readable
Cross-database integrationEnsemblBest-connected hub across databases
Protein-level analysisUniProtPrimary protein database

Best practice: use stable IDs (Ensembl or Entrez) as computational keys. Use symbols only as display labels. Always pin mappings to a specific database release and archive the cross-reference table for reproducibility.

Gene Symbol Instability

Gene symbols change regularly as nomenclature committees update names. NCBI updates daily; Bioconductor org.db packages update every 6 months. For the most current mappings, query mygene.info or download gene_info from NCBI FTP directly. Never use symbols as the primary key in a pipeline -- always join on stable IDs and add symbols as a display column.

PAR Gene Complications

Pseudo-autosomal region (PAR) genes exist on both X and Y chromosomes with identical sequences. In Ensembl GTF files, PAR genes have coordinates on both chromosomes, potentially creating duplicate entries in count matrices. Reads from PAR regions cannot be unambiguously assigned to X or Y.

# Check for PAR gene duplicates in a count matrix
par_genes_human = ['SHOX', 'IL3RA', 'SLC25A6', 'P2RY8', 'AKAP17A', 'ASMT', 'DHRSX']
duplicated_ids = counts.index[counts.index.duplicated()].unique()
if len(duplicated_ids) > 0:
    print(f'Duplicate gene entries found: {len(duplicated_ids)}')
    # Sum duplicates (standard approach for PAR genes)
    counts = counts.groupby(counts.index).sum()

Some reference genomes mask the Y-chromosome PAR to avoid double-counting. Check whether the GTF includes PAR genes on both chromosomes before building count matrices.

Cross-Species Ortholog Mapping

Goal: Map gene IDs between species for cross-species comparisons or integration.

Approach: Use Ensembl Compara (via biomaRt) to find orthologs, selecting the appropriate stringency level.

library(biomaRt)

human <- useEnsembl(biomart='genes', dataset='hsapiens_gene_ensembl')
mouse <- useEnsembl(biomart='genes', dataset='mmusculus_gene_ensembl')

# Human to mouse one-to-one orthologs
orthologs <- getLDS(
    attributes=c('hgnc_symbol', 'ensembl_gene_id'),
    filters='ensembl_gene_id',
    values=human_gene_ids,
    mart=human,
    attributesL=c('mgi_symbol', 'ensembl_gene_id'),
    martL=mouse
)
StrategyWhen to useTrade-off
One-to-one orthologs onlyCross-species scRNA-seq integrationMost conservative; loses genes without clear orthologs
Include one-to-manyBroader gene coverage neededMust select: highest homology confidence or highest expression
Include many-to-manyMaximum inclusivityIntroduces ambiguity; use with caution

For cross-species scRNA-seq integration, use only one-to-one orthologs (standard practice).

Build tx2gene for tximport

Goal: Create the transcript-to-gene mapping table required by tximport for gene-level summarization from Salmon/kallisto output.

Approach: Extract transcript-gene relationships from a GTF file or Ensembl BioMart.

# From GTF (recommended for consistency with quantification index)
library(GenomicFeatures)
txdb <- makeTxDbFromGFF('annotation.gtf')
k <- keys(txdb, keytype='TXNAME')
tx2gene <- AnnotationDbi::select(txdb, k, 'GENEID', 'TXNAME')

# From BioMart
library(biomaRt)
mart <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
tx2gene <- getBM(
    attributes=c('ensembl_transcript_id_version', 'ensembl_gene_id_version'),
    mart=mart
)
colnames(tx2gene) <- c('TXNAME', 'GENEID')
# Python: extract tx2gene from GTF
import pandas as pd

def tx2gene_from_gtf(gtf_path):
    '''Extract transcript-to-gene mapping from GTF.'''
    records = []
    with open(gtf_path) as f:
        for line in f:
            if line.startswith('#') or '\ttranscript\t' not in line:
                continue
            attrs = line.strip().split('\t')[8]
            gene_id = [a.split('"')[1] for a in attrs.split(';') if 'gene_id' in a][0]
            tx_id = [a.split('"')[1] for a in attrs.split(';') if 'transcript_id' in a][0]
            records.append({'TXNAME': tx_id, 'GENEID': gene_id})
    return pd.DataFrame(records).drop_duplicates()

Related Skills

  • expression-matrix/counts-ingest - Load count data
  • expression-matrix/metadata-joins - Add annotations
  • rna-quantification/tximport-workflow - Uses tx2gene mapping
  • pathway-analysis/go-enrichment - Requires Entrez IDs
  • pathway-analysis/kegg-pathways - Requires Entrez IDs