Claude-skill-registry bio-pathway-gsea
Gene Set Enrichment Analysis using clusterProfiler gseGO and gseKEGG. Use when analyzing ranked gene lists to find coordinated expression changes in gene sets without arbitrary significance cutoffs. Detects subtle but coordinated expression changes.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/gsea" ~/.claude/skills/majiayu000-claude-skill-registry-bio-pathway-gsea && rm -rf "$T"
manifest:
skills/data/gsea/SKILL.mdsource content
Gene Set Enrichment Analysis (GSEA)
Core Concept
GSEA uses all genes ranked by a statistic (log2FC, signed p-value) rather than a subset of significant genes. It finds gene sets where members are enriched at the top or bottom of the ranked list.
Prepare Ranked Gene List
library(clusterProfiler) library(org.Hs.eg.db) de_results <- read.csv('de_results.csv') # Create named vector: values = statistic, names = gene IDs gene_list <- de_results$log2FoldChange names(gene_list) <- de_results$gene_id # Sort in decreasing order (REQUIRED) gene_list <- sort(gene_list, decreasing = TRUE)
Convert Gene IDs for GSEA
# Convert symbols to Entrez IDs gene_ids <- bitr(names(gene_list), fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db) # Create ranked list with Entrez IDs gene_list_entrez <- gene_list[names(gene_list) %in% gene_ids$SYMBOL] names(gene_list_entrez) <- gene_ids$ENTREZID[match(names(gene_list_entrez), gene_ids$SYMBOL)] gene_list_entrez <- sort(gene_list_entrez, decreasing = TRUE)
Alternative Ranking Statistics
# Signed p-value (recommended for detecting both up and down) gene_list <- -log10(de_results$pvalue) * sign(de_results$log2FoldChange) names(gene_list) <- de_results$gene_id gene_list <- sort(gene_list, decreasing = TRUE) # Wald statistic (from DESeq2) gene_list <- de_results$stat names(gene_list) <- de_results$gene_id gene_list <- sort(gene_list, decreasing = TRUE)
GSEA with GO
gse_go <- gseGO( geneList = gene_list_entrez, OrgDb = org.Hs.eg.db, ont = 'BP', # BP, MF, CC, or ALL minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05, verbose = FALSE, pAdjustMethod = 'BH' ) # Make readable gse_go <- setReadable(gse_go, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID')
GSEA with KEGG
gse_kegg <- gseKEGG( geneList = gene_list_entrez, organism = 'hsa', minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05, verbose = FALSE ) # Make readable gse_kegg <- setReadable(gse_kegg, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID')
GSEA with Custom Gene Sets
# Read GMT file (Gene Matrix Transposed) gene_sets <- read.gmt('msigdb_hallmarks.gmt') gse_custom <- GSEA( geneList = gene_list_entrez, TERM2GENE = gene_sets, minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05 )
MSigDB Gene Sets
# Use msigdbr package for MSigDB gene sets library(msigdbr) # Hallmark gene sets hallmarks <- msigdbr(species = 'Homo sapiens', category = 'H') hallmarks_t2g <- hallmarks[, c('gs_name', 'entrez_gene')] gse_hallmark <- GSEA( geneList = gene_list_entrez, TERM2GENE = hallmarks_t2g, pvalueCutoff = 0.05 ) # Other categories: C1 (positional), C2 (curated), C3 (motif), C5 (GO), C6 (oncogenic), C7 (immunologic)
Understanding Results
# View results head(gse_go) results <- as.data.frame(gse_go) # Key columns: # - NES: Normalized Enrichment Score (positive = upregulated, negative = downregulated) # - pvalue: Nominal p-value # - p.adjust: FDR-adjusted p-value # - core_enrichment: Leading edge genes
Interpreting NES (Normalized Enrichment Score)
| NES | Interpretation |
|---|---|
| Positive (> 0) | Gene set enriched in upregulated genes |
| Negative (< 0) | Gene set enriched in downregulated genes |
| NES |
Key Parameters
| Parameter | Default | Description |
|---|---|---|
| geneList | required | Named, sorted numeric vector |
| OrgDb | required | Organism database (for gseGO) |
| organism | hsa | KEGG organism code (for gseKEGG) |
| ont | BP | Ontology: BP, MF, CC, ALL |
| minGSSize | 10 | Min genes in gene set |
| maxGSSize | 500 | Max genes in gene set |
| pvalueCutoff | 0.05 | P-value threshold |
| pAdjustMethod | BH | Adjustment method |
| nPerm | 10000 | Permutations (if permutation test used) |
| eps | 1e-10 | Boundary for p-value calculation |
Export Results
results_df <- as.data.frame(gse_go) write.csv(results_df, 'gsea_go_results.csv', row.names = FALSE) # Get leading edge genes for a term leading_edge <- strsplit(results_df$core_enrichment[1], '/')[[1]]
Notes
- Must be sorted - gene list must be sorted in decreasing order
- Named vector - names are gene IDs, values are statistics
- No arbitrary cutoffs - uses all genes, not just significant ones
- NES sign matters - positive = upregulated enrichment
- Leading edge - core_enrichment contains driving genes
Related Skills
- go-enrichment - Over-representation analysis for GO
- kegg-pathways - Over-representation analysis for KEGG
- enrichment-visualization - GSEA plots, ridge plots