BioSkills bio-pathway-go-enrichment

Gene Ontology over-representation analysis using clusterProfiler enrichGO. Use when identifying biological functions enriched in a gene list from differential expression or other analyses. Supports all three ontologies (BP, MF, CC), multiple ID types, and customizable statistical thresholds.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/pathway-analysis/go-enrichment" ~/.claude/skills/gptomics-bioskills-bio-pathway-go-enrichment && rm -rf "$T"
manifest: pathway-analysis/go-enrichment/SKILL.md
source content

Version Compatibility

Reference examples tested with: R stats (base), clusterProfiler 4.10+

Before using code patterns, verify installed versions match. If versions differ:

  • R:
    packageVersion('<pkg>')
    then
    ?function_name
    to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

GO Over-Representation Analysis

When to Use ORA vs GSEA

ScenarioMethodWhy
Clear DE gene list with arbitrary cutoff (padj + FC)ORA, but consider GSEA insteadORA discards magnitude; GSEA uses all genes ranked by statistic
Genes from co-expression module, GWAS loci, screen hitsORANo ranking available; ORA is appropriate
All genes with DE statistics availableGSEA (gseGO)Avoids arbitrary cutoff; detects subtle coordinated changes
Very few DE genes (< 20)GSEAORA has no power with small lists
RNA-seq with known length biasGOseq (goseq package)Standard ORA ignores length bias; longer genes are more likely DE

ORA converts continuous measures into binary (significant/not), losing information. When in doubt, run both ORA and GSEA and compare.

Core Pattern

Goal: Identify enriched Gene Ontology terms in a gene list from differential expression or similar analyses.

Approach: Test for over-representation of GO terms using the hypergeometric test via clusterProfiler enrichGO.

"Run GO enrichment on my gene list" → Test whether biological process, molecular function, or cellular component terms are over-represented among significant genes.

library(clusterProfiler)
library(org.Hs.eg.db)  # Human - change for other organisms

ego <- enrichGO(
    gene = gene_list,           # Character vector of gene IDs
    OrgDb = org.Hs.eg.db,       # Organism annotation database
    keyType = 'ENTREZID',       # ID type: ENSEMBL, SYMBOL, ENTREZID, etc.
    ont = 'BP',                 # BP, MF, CC, or ALL
    pAdjustMethod = 'BH',       # p-value adjustment method
    pvalueCutoff = 0.05,
    qvalueCutoff = 0.2
)

Prepare Gene List from DE Results

Goal: Extract significant gene IDs from differential expression results and convert to the format required by enrichGO.

Approach: Filter DE results by adjusted p-value and fold change, then convert gene symbols to Entrez IDs using bitr.

library(dplyr)

de_results <- read.csv('de_results.csv')

sig_genes <- de_results %>%
    filter(padj < 0.05, abs(log2FoldChange) > 1) %>%
    pull(gene_id)

# If using gene symbols, convert to Entrez IDs
gene_ids <- bitr(sig_genes, fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)
gene_list <- gene_ids$ENTREZID

ID Conversion with bitr

Goal: Convert between gene identifier types (Ensembl, Symbol, Entrez) for compatibility with enrichment tools.

Approach: Use clusterProfiler bitr to map between ID types using organism annotation databases.

# Check available key types
keytypes(org.Hs.eg.db)

# Convert between ID types
converted <- bitr(genes, fromType = 'ENSEMBL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)

# Multiple output types
converted <- bitr(genes, fromType = 'SYMBOL', toType = c('ENTREZID', 'ENSEMBL'), OrgDb = org.Hs.eg.db)

Background Universe (Critical)

Goal: Improve enrichment specificity by restricting the background to genes actually tested in the experiment.

Approach: Pass all expressed genes (not just significant ones) as the universe parameter to enrichGO.

The background must be genes that could have appeared in the list. Getting this wrong is the single most common ORA error (95% of published analyses fail to specify an appropriate background). Using the whole genome (~20,000 genes) when only 12,000 were expressed inflates significance for tissue-specific pathways.

Experiment TypeCorrect Background
RNA-seqAll genes with detectable expression (e.g., > 1 CPM in >= N samples)
MicroarrayAll probes on the array (mapped to genes)
ProteomicsAll detected proteins
Targeted panelOnly genes on the panel
# Background = all genes that were tested (NOT the full genome)
# For DESeq2: genes with non-NA pvalue survived independent filtering
all_tested <- de_results$gene_id[!is.na(de_results$pvalue)]
universe_ids <- bitr(all_tested, fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)

ego <- enrichGO(
    gene = gene_list,
    universe = universe_ids$ENTREZID,
    OrgDb = org.Hs.eg.db,
    keyType = 'ENTREZID',
    ont = 'BP',
    pAdjustMethod = 'BH',
    pvalueCutoff = 0.05
)

Warning: clusterProfiler silently drops unannotated genes from the background. To prevent this:

options(enrichment_force_universe = TRUE)
before running enrichGO.

All Three Ontologies

# Run all ontologies at once
ego_all <- enrichGO(
    gene = gene_list,
    OrgDb = org.Hs.eg.db,
    keyType = 'ENTREZID',
    ont = 'ALL',  # BP, MF, and CC combined
    pAdjustMethod = 'BH',
    pvalueCutoff = 0.05
)

# Results include ONTOLOGY column
head(as.data.frame(ego_all))

Make Results Readable

# Convert Entrez IDs to gene symbols in results
ego_readable <- setReadable(ego, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID')

# Or use readable = TRUE directly (only works with ENTREZID input)
ego <- enrichGO(
    gene = gene_list,
    OrgDb = org.Hs.eg.db,
    keyType = 'ENTREZID',
    ont = 'BP',
    readable = TRUE  # Converts to symbols
)

Extract and Export Results

# View top results
head(ego)

# Convert to data frame
results_df <- as.data.frame(ego)

# Key columns: ID, Description, GeneRatio, BgRatio, pvalue, p.adjust, qvalue, geneID, Count

# Export to CSV
write.csv(results_df, 'go_enrichment_results.csv', row.names = FALSE)

# Filter for specific criteria
sig_terms <- results_df[results_df$p.adjust < 0.01 & results_df$Count >= 5, ]

Simplify Redundant Terms

Goal: Remove highly similar GO terms to reduce redundancy in enrichment results.

Approach: Cluster GO terms by semantic similarity and retain representative terms using the simplify function.

GO terms form a DAG (directed acyclic graph), not a flat list. If "mitotic cell cycle" is enriched, parent terms ("cell cycle", "cell cycle process") will also be enriched because they contain supersets of the same genes. Always simplify before interpretation.

# Remove redundant GO terms (keeps representative terms)
ego_simplified <- simplify(ego, cutoff = 0.7, by = 'p.adjust', select_fun = min)

# measure options: 'Wang' (default, graph-based, stable across releases),
# 'Resnik', 'Lin', 'Jiang', 'Rel' (IC-based, depend on annotation version)
ego_simplified <- simplify(ego, cutoff = 0.7, measure = 'Wang')

Limitations:

simplify()
does NOT work with
ont='ALL'
-- run BP, MF, CC separately. Cutoff 0.7 is a reasonable default; lower retains more terms, higher is more aggressive.

Different Organisms

# Mouse
library(org.Mm.eg.db)
ego_mouse <- enrichGO(gene = genes, OrgDb = org.Mm.eg.db, ont = 'BP')

# Zebrafish
library(org.Dr.eg.db)
ego_zfish <- enrichGO(gene = genes, OrgDb = org.Dr.eg.db, ont = 'BP')

# Yeast
library(org.Sc.sgd.db)
ego_yeast <- enrichGO(gene = genes, OrgDb = org.Sc.sgd.db, ont = 'BP', keyType = 'ORF')

Group GO Terms by Ancestor

Goal: Classify genes by broad GO slim categories for a high-level functional overview.

Approach: Use groupGO to assign genes to GO terms at a specific hierarchy level.

# Classify genes by GO slim categories
ggo <- groupGO(
    gene = gene_list,
    OrgDb = org.Hs.eg.db,
    ont = 'BP',
    level = 3,  # GO hierarchy level
    readable = TRUE
)

Key Parameters

ParameterDefaultDescription
generequiredVector of gene IDs
OrgDbrequiredOrganism database
keyTypeENTREZIDInput ID type
ontBPBP, MF, CC, or ALL
pvalueCutoff0.05P-value threshold
qvalueCutoff0.2Q-value (FDR) threshold
pAdjustMethodBHBH, bonferroni, etc.
universeNULLBackground genes
minGSSize10Min genes per term
maxGSSize500Max genes per term
readableFALSEConvert to symbols

Interpreting Results

Always examine effect size alongside p-values. A pathway with 500 genes can achieve p < 1e-15 with a modest 1.2x fold enrichment, while a 10-gene pathway with 4x enrichment at p = 0.01 is biologically more interesting.

  • Fold enrichment = GeneRatio / BgRatio. Values > 2 suggest strong enrichment.
  • Count: number of query genes in the term. Very large counts (> 50) may indicate overly broad terms.
  • minGSSize=10, maxGSSize=500
    filters out uninformative extremes.

Gene ID Mapping Pitfalls

  • Many-to-many mappings: one Ensembl gene can map to multiple Entrez IDs. Deduplicate after
    bitr()
    to avoid counting genes multiple times.
  • Lost genes: if > 15% of genes fail to convert, results may be unreliable. Always report the conversion rate.
  • Best practice: use the same ID type throughout the pipeline. Convert at the last step if possible.

RNA-seq Gene Length Bias

In RNA-seq, longer transcripts produce more fragments, increasing statistical power to detect DE. This systematically biases ORA toward pathways enriched in long genes (extracellular matrix, cell adhesion) and against short-gene pathways (ribosomal, mitochondrial). Standard normalization (RPKM, TMM) does NOT fix this.

For length-corrected GO enrichment, use GOseq:

library(goseq)
pwf <- nullp(de_vector, 'hg38', 'ensGene', bias.data = gene_lengths)
goseq_results <- goseq(pwf, 'hg38', 'ensGene', method = 'Wallenius')

Related Skills

  • kegg-pathways - KEGG pathway enrichment
  • gsea - Gene Set Enrichment Analysis for GO
  • enrichment-visualization - Visualize enrichment results
  • differential-expression/de-results - Generate input gene lists