Claude-skill-registry clustermarkers
Finds differentially expressed genes (markers) for clusters of T/B cells using Seurat's FindMarkers function. Performs statistical testing between clusters, identifies cluster-defining genes, and automatically runs pathway enrichment analysis (via Enrichr) on significant markers. Generates publication-ready visualizations including volcano plots, dot plots, heatmaps, and enrichment plots.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/clustermarkers" ~/.claude/skills/majiayu000-claude-skill-registry-clustermarkers && rm -rf "$T"
skills/data/clustermarkers/SKILL.mdClusterMarkers Process Configuration
Purpose
Finds differentially expressed genes (markers) for clusters of T/B cells using Seurat's FindMarkers function. Performs statistical testing between clusters, identifies cluster-defining genes, and automatically runs pathway enrichment analysis (via Enrichr) on significant markers. Generates publication-ready visualizations including volcano plots, dot plots, heatmaps, and enrichment plots.
When to Use
- After SeuratClustering: Essential for cluster interpretation and annotation
- Cluster annotation: Identify marker genes to assign biological meaning to clusters
- Publication preparation: Generate marker tables, volcano plots, and enrichment figures
- Cell type characterization: Understand functional differences between cell populations
- Comparative analysis: Compare clusters to find unique gene expression signatures
Configuration Structure
Process Enablement
[ClusterMarkers] cache = true # Cache results for faster re-runs with different visualizations
Input Specification
[ClusterMarkers.in] srtobj = ["SeuratClustering"] # Seurat object with cluster assignments
Environment Variables
Core Parameters
[ClusterMarkers.envs] # Number of cores for parallel computation ncores = 1 # int; Parallelize Seurat procedures # Subset cells before marker finding (R expression) subset = "seurat_clusters %in% c('c1', 'c2', 'c3')" # Optional # Cache location for intermediate results cache = "/tmp" # Path; Set to false to disable caching # Assay to use for marker finding assay = "RNA" # Default: uses active assay # Error on no markers found error = false # bool; If true, fail if no markers found
Statistical Test Selection
[ClusterMarkers.envs] # Statistical test for differential expression test.use = "wilcox" # Default
Available tests:
: Wilcoxon rank sum test (default, fast)"wilcox"
: Limma implementation (Seurat v4 compatibility)"wilcox_limma"
: GLM with cellular detection rate covariate (recommended)"MAST"
: Negative binomial model (robust, requires counts)"DESeq2"
: ROC analysis (AUC-based classification)"roc"
: Student's t-test"t"
: Tobit test for censored data"tobit"
: Likelihood-ratio test for bimodal expression"bimod"
: Poisson distribution (UMI datasets only)"poisson"
: Negative binomial (UMI datasets only)"negbinom"
: Logistic regression (latent.vars supported)"LR"
Test selection guidelines:
- Default:
for speed and reliability"wilcox" - Publication-quality:
for single-cell-specific modeling"MAST" - Bulk-like DE:
for rigorous statistical testing"DESeq2" - UMI data:
or"negbinom"
for count-based models"poisson" - Classification:
for AUC-based marker ranking"roc"
Threshold Parameters (Seurat FindMarkers)
[ClusterMarkers.envs] # Minimum log2 fold change threshold logfc.threshold = 0.25 # float; Default: 0.25 # Minimum percentage of cells expressing gene min.pct = 0.1 # float; Range: 0.0-1.0 # Minimum difference in detection percentage min.diff.pct = -Inf # float; Default: no limit # Only positive markers (higher in ident.1 group) only.pos = false # bool; Default: false (both directions) # Maximum cells per identity (downsampling) max.cells.per.ident = Inf # int; No downsampling by default # Minimum cells expressing gene (poisson/negbinom tests) min.cells.feature = 3 # int # Minimum cells per group min.cells.group = 3 # int
Note: Use
- to replace . in parameter names (e.g., logfc.threshold, not logfc.threshold)
Significant Markers Filter (for Enrichment)
[ClusterMarkers.envs] # Filter markers for enrichment analysis (R expression) sigmarkers = "p_val_adj < 0.05 & avg_log2FC > 0" # Default # Variables available: p_val, avg_log2FC, pct.1, pct.2, p_val_adj # Example: "p_val_adj < 0.05 & abs(avg_log2FC) > 1" (both directions)
Enrichment Analysis
[ClusterMarkers.envs] # Databases for pathway enrichment dbs = ["KEGG_2021_Human", "MSigDB_Hallmark_2020"] # Default # Enrichment style enrich_style = "enrichr" # Options: "enrichr", "clusterprofiler", "clusterProfiler"
Available databases (enrichit):
,"KEGG_2021_Human"
: KEGG pathways"KEGG"
,"MSigDB_Hallmark_2020"
: MSigDB Hallmark gene sets"Hallmark"
: Gene Ontology Biological Process"GO_Biological_Process_2025"
: Gene Ontology Cellular Component"GO_Cellular_Component_2025"
: Gene Ontology Molecular Function"GO_Molecular_Function_2025"
,"Reactome_Pathways_2024"
: Reactome pathways"Reactome"
,"WikiPathways_2024_Human"
: WikiPathways"WikiPathways"
: BioCarta pathways"BioCarta_2016"
More databases: https://maayanlab.cloud/Enrichr/#libraries
Visualization Parameters
[ClusterMarkers.envs] # Marker plots configuration marker_plots_defaults = {order_by = "desc(avg_log2FC)"} # All markers plots (across clusters) allmarker_plots = {"Top 10 markers of all clusters": {plot_type = "heatmap"}} # Enrichment plots (all clusters) allenrich_plots = {} # Empty by default # Marker plots (per cluster) marker_plots = {} # Default: volcano plots and dot plots # Enrichment plots (per cluster) enrich_plots = {} # Default: bar plot # Overlap analysis (venn/upset) overlaps = {} # Empty by default
External References
Seurat FindMarkers
https://satijalab.org/seurat/reference/findmarkers
- Core differential expression function
- Statistical tests: wilcox, MAST, DESeq2, ROC, t-test, etc.
- Threshold parameters control sensitivity and speed
Enrichr Databases
https://maayanlab.cloud/Enrichr/#libraries
- Comprehensive gene set enrichment collection
- KEGG, GO, Reactome, MSigDB, WikiPathways
biopipen MarkersFinder
https://pwwang.github.io/biopipen/api/biopipen.ns.scrna/#biopipen.ns.scrna.MarkersFinder
- Parent process with extended functionality
- Visualization: biopipen.utils::VizDEGs, scplotter::EnrichmentPlot
Configuration Examples
Minimal Configuration
[ClusterMarkers] [ClusterMarkers.in] srtobj = ["SeuratClustering"]
Result: Default wilcox test, standard thresholds, hallmark + KEGG enrichment
Standard Marker Finding (Wilcoxon)
[ClusterMarkers] [ClusterMarkers.in] srtobj = ["SeuratClustering"] [ClusterMarkers.envs] test.use = "wilcox" logfc.threshold = 0.25 min.pct = 0.1 sigmarkers = "p_val_adj < 0.05 & avg_log2FC > 0"
Publication-Ready MAST Analysis
[ClusterMarkers] [ClusterMarkers.in] srtobj = ["SeuratClustering"] [ClusterMarkers.envs] test.use = "MAST" logfc.threshold = 0.25 min.pct = 0.1 sigmarkers = "p_val_adj < 0.01 & abs(avg_log2FC) > 1" ncores = 4
DESeq2 for Robust Analysis
[ClusterMarkers] [ClusterMarkers.in] srtobj = ["SeuratClustering"] [ClusterMarkers.envs] test.use = "DESeq2" logfc.threshold = 0.5 # More stringent min.pct = 0.15 sigmarkers = "p_val_adj < 0.05 & avg_log2FC > 0.5"
Note: DESeq2 requires count data in the Seurat object
Stringent Thresholds for High-Confidence Markers
[ClusterMarkers.envs] logfc.threshold = 0.58 # 1.5-fold change (2^0.58) min.pct = 0.25 # Expressed in >25% cells min.diff.pct = 0.1 # 10% difference in detection only.pos = true # Positive markers only sigmarkers = "p_val_adj < 0.01 & avg_log2FC > 1"
Subset Specific Clusters
[ClusterMarkers.envs] # Only analyze clusters c1, c2, c3 to save computation subset = "seurat_clusters %in% c('c1', 'c2', 'c3')"
Custom Enrichment Databases
[ClusterMarkers.envs] # Use different pathway databases dbs = ["Reactome_Pathways_2024", "GO_Biological_Process_2025"] enrich_style = "clusterprofiler"
Positive Markers Only (Cluster-Specific)
[ClusterMarkers.envs] only.pos = true sigmarkers = "p_val_adj < 0.05 & avg_log2FC > 0"
Downsample Large Clusters
[ClusterMarkers.envs] max.cells.per.ident = 5000 # Limit to 5000 cells per cluster random.seed = 42 # Reproducible downsampling
Common Patterns
Pattern 1: Quick Wilcoxon Test (Default)
[ClusterMarkers] [ClusterMarkers.in] srtobj = ["SeuratClustering"]
Use case: Initial exploration, speed priority
Pattern 2: Publication-Quality MAST
[ClusterMarkers] [ClusterMarkers.in] srtobj = ["SeuratClustering"] [ClusterMarkers.envs] test.use = "MAST" logfc.threshold = 0.25 min.pct = 0.1 ncores = 8
Use case: Single-cell publication, accounts for detection rate
Pattern 3: Both Positive and Negative Markers
[ClusterMarkers.envs] only.pos = false sigmarkers = "p_val_adj < 0.05 & abs(avg_log2FC) > 0.5"
Use case: Find genes upregulated and downregulated in each cluster
Pattern 4: Stringent Top Markers
[ClusterMarkers.envs] logfc.threshold = 1.0 # 2-fold change min.pct = 0.3 sigmarkers = "p_val_adj < 0.001 & avg_log2FC > 1" only.pos = true
Use case: High-confidence cluster markers for annotation
Pattern 5: Custom Enrichment with Multiple DBs
[ClusterMarkers.envs] dbs = [ "KEGG_2021_Human", "MSigDB_Hallmark_2020", "GO_Biological_Process_2025", "Reactome_Pathways_2024" ] enrich_style = "enrichr"
Pattern 6: ROC Analysis for Classification
[ClusterMarkers.envs] test.use = "roc" logfc.threshold = 0.1 sigmarkers = "p_val_adj < 0.05 & avg_log2FC > 0"
Use case: Find markers with highest AUC for classification
Dependencies
Upstream Processes
- Required:
(provides cluster assignments)SeuratClustering - Alternative:
(if sub-clustering analysis)SeuratSubClustering - Context: Runs after
if T/B cell selection is enabledTOrBCellSelection
Downstream Processes
- CellTypeAnnotation: Uses markers for automated cell type assignment
- SeuratMap2Ref: Reference-based annotation may use marker profiles
- ScFGSEA: Gene set enrichment on identified markers
- ModuleScoreCalculator: Score marker genes across cells
Validation Rules
Statistical Test Constraints
must be one of: wilcox, wilcox_limma, MAST, DESeq2, roc, t, tobit, bimod, poisson, negbinom, LRtest.use- DESeq2 requires count data (automatically uses counts slot)
- MAST, poisson, negbinom support
for additional covariateslatent.vars
Threshold Validation
: ≥ 0 (typical range: 0.1-1.0)logfc.threshold
: 0.0-1.0 (typical: 0.1-0.3)min.pct
: ≥ -Inf (typical: 0.05-0.2)min.diff.pct
: ≥ 1 (default: 3)min.cells.feature
: ≥ 1 (default: 3)min.cells.group
sigmarkers Expression
- Must be valid R/dplyr expression
- Available variables: p_val, avg_log2FC, pct.1, pct.2, p_val_adj
- Use
for AND,&
for OR,|
for NOT!
Database Constraints
must be valid enrichit database names or GMT file pathsdbs- Custom GMT files: use absolute paths or paths relative to config file
Troubleshooting
Issue: Too Many Markers Found
Symptoms: Thousands of markers, low statistical power
Solutions:
[ClusterMarkers.envs] logfc.threshold = 0.5 # Increase fold change threshold min.pct = 0.25 # Increase expression percentage min.diff.pct = 0.15 # Increase detection difference sigmarkers = "p_val_adj < 0.01 & avg_log2FC > 1" # Stricter filter
Issue: No Markers Found
Symptoms: Empty marker tables, no enrichment results
Solutions:
[ClusterMarkers.envs] logfc.threshold = 0.1 # Lower threshold min.pct = 0.05 # Lower expression requirement min.diff.pct = -Inf # Remove detection difference sigmarkers = "p_val_adj < 0.1 & avg_log2FC > 0.1" # Looser filter
Issue: Slow Performance
Symptoms: Marker finding takes hours
Solutions:
[ClusterMarkers.envs] ncores = 8 # Use more cores logfc.threshold = 0.5 # Higher threshold reduces genes tested max.cells.per.ident = 5000 # Downsample large clusters
Issue: DESeq2 Fails with Integrated Data
Symptoms: DESeq2 error on integrated Seurat object
Cause: DESeq2 requires count data, integrated objects have empty counts slot
Solution:
# Use SCTransform counts instead of integrated data [SeuratPreparing.envs] method = "SCTransform" integration_method = null # Skip integration for DESeq2 [ClusterMarkers.envs] test.use = "DESeq2"
Alternative: Use MAST or wilcox on integrated data
Issue: Enrichment Analysis Returns No Results
Symptoms: Empty enrichment tables/plots
Solutions:
[ClusterMarkers.envs] # Check sigmarkers filter is too strict sigmarkers = "p_val_adj < 0.1 & avg_log2FC > 0" # Add more databases dbs = ["KEGG_2021_Human", "MSigDB_Hallmark_2020", "Reactome_Pathways_2024"]
Issue: NA p-values in Results
Symptoms: Some markers have NA p-values
Cause: Insufficient cells per group or low expression variance
Solutions:
[ClusterMarkers.envs] min.cells.group = 10 # Increase minimum cells min.cells.feature = 5 # Increase minimum expressing cells
Issue: Different Test Methods Return Similar Results
Symptoms: wilcox and MAST return nearly identical gene lists
Cause: Strong markers are robust across methods
Solution: Use ROC analysis for alternative ranking:
[ClusterMarkers.envs] test.use = "roc"
Issue: Computationally Expensive Enrichment
Symptoms: Enrichment step takes very long
Solutions:
[ClusterMarkers.envs] # Limit markers for enrichment sigmarkers = "p_val_adj < 0.01 & avg_log2FC > 1" # Use fewer databases dbs = ["MSigDB_Hallmark_2020"] # Subset clusters for analysis subset = "seurat_clusters %in% c('c1', 'c2')"
Best Practices
- Start with default wilcox test for initial exploration
- Use MAST for publications (single-cell-specific modeling)
- Set appropriate thresholds: logfc.threshold = 0.25-0.5, min.pct = 0.1-0.2
- Filter for enrichment: Use sigmarkers to limit to high-confidence markers
- Customize enrichment databases: Choose databases relevant to your study
- Use both.pos = false to see upregulated and downregulated genes
- Parallelize with ncores for large datasets
- Subset clusters when analyzing many clusters to save computation
- Validate markers: Check expression patterns in visualization
- Reproducibility: Set random.seed for downsampling
Related Processes
- ClusterMarkersOfAllCells: Marker finding before T/B cell selection
- MarkersFinder: Extended parent process with more flexibility
- TopExpressingGenes: Top expressed genes per cluster (non-DE)
- SeuratClustering: Required upstream process for cluster assignments
- CellTypeAnnotation: Uses markers for automated annotation