OpenClaw-Medical-Skills tooluniverse-single-cell
Production-ready single-cell and expression matrix analysis using scanpy, anndata, and scipy. Performs scRNA-seq QC, normalization, PCA, UMAP, Leiden/Louvain clustering, differential expression (Wilcoxon, t-test, DESeq2), cell type annotation, per-cell-type statistical analysis, gene-expression correlation, batch correction (Harmony), trajectory inference, and cell-cell communication analysis. NEW: Analyzes ligand-receptor interactions between cell types using OmniPath (CellPhoneDB, CellChatDB), scores communication strength, identifies signaling cascades, and handles multi-subunit receptor complexes. Integrates with ToolUniverse gene annotation tools (HPA, Ensembl, MyGene, UniProt) and enrichment tools (gseapy, PANTHER, STRING). Supports h5ad, 10X, CSV/TSV count matrices, and pre-annotated datasets. Use when analyzing single-cell RNA-seq data, studying cell-cell interactions, performing cell type differential expression, computing gene-expression correlations by cell type, analyzing tumor-immune communication, or answering questions about scRNA-seq datasets.
git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/tooluniverse-single-cell" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-tooluniverse-single-cell && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/tooluniverse-single-cell" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-tooluniverse-single-cell && rm -rf "$T"
skills/tooluniverse-single-cell/SKILL.md- pip install
Single-Cell Genomics and Expression Matrix Analysis
Comprehensive single-cell RNA-seq analysis and expression matrix processing using scanpy, anndata, scipy, and ToolUniverse. Designed for both full scRNA-seq workflows (raw counts to annotated cell types) and targeted expression-level analyses (per-cell-type DE, correlation, ANOVA, clustering).
IMPORTANT: This skill handles complex multi-workflow analysis. Most implementation details have been moved to
references/ for progressive disclosure. This document focuses on high-level decision-making and workflow orchestration.
When to Use This Skill
Apply when users:
- Have scRNA-seq data (h5ad, 10X, CSV count matrices) and want analysis
- Ask about cell type identification, clustering, or annotation
- Need differential expression analysis by cell type or condition
- Want gene-expression correlation analysis (e.g., gene length vs expression by cell type)
- Ask about PCA, UMAP, t-SNE for expression data
- Need Leiden/Louvain clustering on expression matrices
- Want statistical comparisons between cell types (t-test, ANOVA, fold change)
- Ask about marker genes for cell populations
- Need batch correction (Harmony, combat)
- Want trajectory or pseudotime analysis
- Ask about cell-cell communication (ligand-receptor interactions)
- Questions mention "single-cell", "scRNA-seq", "cell type", "h5ad"
- Questions involve immune cell types (CD4, CD8, CD14, CD19, monocytes, etc.)
BixBench Coverage: 18+ questions across 5 projects (bix-22, bix-27, bix-31, bix-33, bix-36)
NOT for (use other skills instead):
- Bulk RNA-seq DESeq2 analysis only → Use
tooluniverse-rnaseq-deseq2 - Gene enrichment only (no expression data) → Use
tooluniverse-gene-enrichment - VCF/variant analysis → Use
tooluniverse-variant-analysis - Statistical modeling (regression, survival) → Use
tooluniverse-statistical-modeling
Core Principles
- Data-first approach - Load, inspect, and validate data before any analysis
- AnnData-centric - All data flows through anndata objects for consistency
- Cell type awareness - Many questions require per-cell-type subsetting and analysis
- Statistical rigor - Proper normalization, multiple testing correction, effect sizes
- Scanpy standard pipeline - Follow established best practices for scRNA-seq
- Flexible input - Handle h5ad, 10X, CSV/TSV, pre-processed and raw data
- Question-driven - Parse what the user is actually asking and extract the specific answer
- Enrichment integration - Chain DE results into GO/KEGG/Reactome enrichment when requested
- Large dataset support - Efficient handling of datasets with >100k cells
Required Python Packages
# Core (MUST be installed) import scanpy as sc import anndata as ad import pandas as pd import numpy as np from scipy import stats from scipy.cluster.hierarchy import linkage, fcluster, dendrogram from scipy.spatial.distance import pdist from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from statsmodels.stats.multitest import multipletests # Enrichment (for GO/KEGG/Reactome follow-up) import gseapy as gp # Optional import harmonypy # batch correction
Installation:
pip install scanpy anndata leidenalg umap-learn harmonypy gseapy pandas numpy scipy scikit-learn statsmodels
High-Level Workflow Decision Tree
START: User question about scRNA-seq data │ ├─ Q1: What type of analysis is needed? │ │ │ ├─ FULL PIPELINE (raw counts → annotated clusters) │ │ └─ Workflow: QC → Normalize → HVG → PCA → Cluster → Annotate → DE │ │ See: references/scanpy_workflow.md │ │ │ ├─ DIFFERENTIAL EXPRESSION (per-cell-type comparison) │ │ └─ Workflow: Load → Normalize → Per-CT DE → Report │ │ Pattern: Most common BixBench pattern (bix-33) │ │ See: Section "Per-Cell-Type Differential Expression" below │ │ │ ├─ CORRELATION ANALYSIS (gene property vs expression) │ │ └─ Workflow: Load → Filter genes → Compute correlation │ │ Pattern: Gene length vs expression (bix-22) │ │ See: Section "Statistical Analysis on Expression Data" below │ │ │ ├─ CLUSTERING & PCA (expression matrix analysis) │ │ └─ Workflow: Load → Transform → PCA/Cluster → Report │ │ See: references/clustering_guide.md │ │ │ ├─ CELL COMMUNICATION (ligand-receptor interactions) │ │ └─ Workflow: Load → Get L-R pairs → Score → Identify signaling │ │ See: references/cell_communication.md (DETAILED) │ │ │ └─ TRAJECTORY ANALYSIS (pseudotime) │ └─ Workflow: Load → Normalize → Trajectory → Pseudotime │ See: references/trajectory_analysis.md │ ├─ Q2: What data format is available? │ ├─ h5ad file → sc.read_h5ad() → Check contents (counts, metadata, clusters) │ ├─ 10X files → sc.read_10x_mtx() or sc.read_10x_h5() │ ├─ CSV/TSV → pd.read_csv() → Convert to AnnData (check orientation!) │ └─ Other → See: references/scanpy_workflow.md "Data Loading" │ └─ Q3: Are there pre-computed results to use? ├─ Has cell type annotations → Skip clustering, go to analysis ├─ Has PCA/UMAP → Skip dimensionality reduction ├─ Has DE results → Skip DE, analyze results └─ Raw counts only → Full pipeline needed
Common Analysis Patterns (BixBench)
Pattern 1: Per-Cell-Type Differential Expression
Question: "Which immune cell type has the most DEGs after treatment?"
Workflow:
import scanpy as sc # Load and normalize adata = sc.read_h5ad("data.h5ad") sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) # Per-cell-type DE cell_types = adata.obs['cell_type'].unique() de_results = {} for ct in cell_types: adata_ct = adata[adata.obs['cell_type'] == ct].copy() # Check sufficient cells n_treat = (adata_ct.obs['condition'] == 'treatment').sum() n_ctrl = (adata_ct.obs['condition'] == 'control').sum() if n_treat < 3 or n_ctrl < 3: continue # Run DE sc.tl.rank_genes_groups(adata_ct, groupby='condition', groups=['treatment'], reference='control', method='wilcoxon') df = sc.get.rank_genes_groups_df(adata_ct, group='treatment') # Count significant sig = df[df['pvals_adj'] < 0.05] de_results[ct] = {'n_sig': len(sig), 'results': df} print(f"{ct}: {len(sig)} DEGs") # Answer: Which has most? top_ct = max(de_results, key=lambda x: de_results[x]['n_sig']) print(f"Answer: {top_ct} ({de_results[top_ct]['n_sig']} DEGs)")
BixBench: bix-33
See: references/scanpy_workflow.md "Differential Expression"
Pattern 2: Gene Property vs Expression Correlation
Question: "What is the Pearson correlation between gene length and expression in CD4 T cells?"
Workflow:
import scanpy as sc import pandas as pd import numpy as np from scipy import stats from scipy.sparse import issparse # Load data adata = sc.read_h5ad("data.h5ad") # Load gene annotations gene_info = pd.read_csv("gene_info.tsv", sep='\t', index_col=0) common = adata.var_names.intersection(gene_info.index) adata.var['gene_length'] = gene_info.loc[common, 'gene_length'].reindex(adata.var_names) adata.var['gene_type'] = gene_info.loc[common, 'gene_type'].reindex(adata.var_names) # Filter to protein-coding genes mask = adata.var['gene_type'] == 'protein_coding' adata_pc = adata[:, mask].copy() # Per-cell-type correlation cell_types = ['CD4 T cells', 'CD8 T cells', 'CD14 Monocytes'] # etc. for ct in cell_types: adata_ct = adata_pc[adata_pc.obs['cell_type'] == ct] # Mean expression per gene X = adata_ct.X.toarray() if issparse(adata_ct.X) else adata_ct.X mean_expr = np.mean(X, axis=0) gene_lengths = adata_ct.var['gene_length'].values # Remove NaN valid = ~np.isnan(gene_lengths) & ~np.isnan(mean_expr) # Pearson correlation r, p = stats.pearsonr(gene_lengths[valid], mean_expr[valid]) print(f"{ct}: r = {r:.6f}, p = {p:.2e}, n = {valid.sum()} genes")
BixBench: bix-22
See: SKILL_OLD.md "Phase 6: Statistical Analysis on Expression Data"
Pattern 3: PCA on Expression Matrix
Question: "What percentage of variance is explained by PC1 after log10 transform?"
Workflow:
import pandas as pd import numpy as np from sklearn.decomposition import PCA # Load expression matrix df = pd.read_csv("expression.csv", index_col=0) # Orient: samples as rows, genes as columns if df.shape[0] > df.shape[1] * 5: df = df.T # Genes were rows, transpose # Log10 transform with pseudocount X = np.log10(df.values + 1) # Run PCA n_components = min(X.shape[0], X.shape[1]) pca = PCA(n_components=n_components) pca.fit(X) # Variance explained print(f"PC1: {pca.explained_variance_ratio_[0]*100:.2f}% variance") print(f"PC1+PC2: {sum(pca.explained_variance_ratio_[:2])*100:.2f}%") print(f"Top 10 PCs: {sum(pca.explained_variance_ratio_[:10])*100:.2f}%")
BixBench: bix-27
See: references/clustering_guide.md "PCA Analysis"
Pattern 4: Statistical Comparison Between Cell Types
Question: "What is the t-statistic comparing LFCs between CD4/CD8 and other cell types?"
Workflow:
from scipy import stats # After running per-cell-type DE (Pattern 1): # Extract LFCs for different cell type groups # Group 1: CD4/CD8 cells cd4_lfc = de_results['CD4 T cells']['results']['log2FoldChange'].values cd8_lfc = de_results['CD8 T cells']['results']['log2FoldChange'].values cd4_cd8_lfc = np.concatenate([cd4_lfc, cd8_lfc]) # Group 2: Other cells other_lfc = [] for ct in ['CD14 Monocytes', 'NK cells', 'B cells']: other_lfc.append(de_results[ct]['results']['log2FoldChange'].values) other_lfc = np.concatenate(other_lfc) # Welch's t-test (unequal variances) t_stat, p_val = stats.ttest_ind(cd4_cd8_lfc, other_lfc, equal_var=False) print(f"t-statistic: {t_stat:.4f}") print(f"p-value: {p_val:.4e}")
BixBench: bix-31
See: SKILL_OLD.md "Phase 6.3: T-Tests Between Groups"
Pattern 5: ANOVA Across Cell Types
Question: "What is the F-statistic for miRNA expression across immune cell types?"
Workflow:
import pandas as pd from scipy import stats # Load miRNA expression df = pd.read_csv("mirna_expr.csv", index_col=0) meta = pd.read_csv("metadata.csv", index_col=0) # Exclude PBMCs meta_filtered = meta[meta['cell_type'] != 'PBMC'] df_filtered = df[meta_filtered.index] # Group by cell type cell_types = meta_filtered['cell_type'].unique() groups = {} for ct in cell_types: samples = meta_filtered[meta_filtered['cell_type'] == ct].index groups[ct] = df_filtered[samples].values.flatten() # One-way ANOVA f_stat, p_val = stats.f_oneway(*groups.values()) print(f"F-statistic: {f_stat:.4f}") print(f"p-value: {p_val:.4e}")
BixBench: bix-36
See: SKILL_OLD.md "Phase 6.4: ANOVA Across Groups"
Pattern 6: Cell-Cell Communication Analysis
Question: "Which ligand-receptor interactions are strongest between tumor and T cells?"
Workflow:
from tooluniverse import ToolUniverse tu = ToolUniverse() tu.load_tools() # Step 1: Get ligand-receptor pairs from OmniPath result = tu.run_tool( "OmniPath_get_ligand_receptor_interactions", databases="CellPhoneDB,CellChatDB" ) lr_pairs = pd.DataFrame(result['data']['interactions']) # Step 2: Filter to expressed pairs # (genes present in dataset, mean expression > 0.05) expressed_lr = lr_pairs[ lr_pairs['source_genesymbol'].isin(adata.var_names) & lr_pairs['target_genesymbol'].isin(adata.var_names) ] # Step 3: Score communication between cell types # (mean ligand expr in sender * mean receptor expr in receiver) communication_scores = score_cell_communication( adata, expressed_lr, cell_type_col='cell_type' ) # Step 4: Filter to tumor-T cell interactions tumor_tcell = communication_scores[ ((communication_scores['sender'] == 'Tumor') & (communication_scores['receiver'].str.contains('T cell'))) | ((communication_scores['receiver'] == 'Tumor') & (communication_scores['sender'].str.contains('T cell'))) ] # Step 5: Top interactions top_interactions = tumor_tcell.nlargest(20, 'score') print(top_interactions[['sender', 'receiver', 'ligand', 'receptor', 'score']])
See: references/cell_communication.md (COMPLETE workflow with all helper functions)
Scanpy vs Seurat Equivalents
For users familiar with Seurat (R):
| Operation | Seurat (R) | Scanpy (Python) |
|---|---|---|
| Load data | | |
| Normalize | | |
| Find HVGs | | |
| Scale | | |
| PCA | | |
| Neighbors | | |
| Cluster | | or |
| UMAP | | |
| Find markers | | |
| DE test | | |
| Batch correction | | |
See: references/seurat_workflow.md for complete Seurat → Scanpy translation
When to Use ToolUniverse Tools
Gene Annotation and Validation
- HPA_search_genes_by_query: Search for cell-type marker genes
- MyGene_query_genes / MyGene_batch_query: Gene ID conversion, gene info
- ensembl_lookup_gene: Get Ensembl gene details
- UniProt_get_function_by_accession: Protein function lookup
Cell-Cell Communication (NEW)
- OmniPath_get_ligand_receptor_interactions: Get validated L-R pairs (CellPhoneDB, CellChatDB)
- OmniPath_get_signaling_interactions: Downstream signaling cascades
- OmniPath_get_complexes: Multi-subunit receptor composition
- OmniPath_get_cell_communication_annotations: Pathway categories
Enrichment Analysis (Post-DE)
- PANTHER_enrichment: GO enrichment (BP, MF, CC) with curation
- STRING_functional_enrichment: Network-based enrichment
- ReactomeAnalysis_pathway_enrichment: Curated Reactome pathways
See: references/cell_communication.md for complete OmniPath integration examples
Data Loading Best Practices
Critical: Matrix Orientation
AnnData expects: cells/samples as rows (obs), genes as columns (var)
import scanpy as sc import pandas as pd import anndata as ad # Load h5ad (already oriented) adata = sc.read_h5ad("data.h5ad") # Load CSV/TSV (check orientation!) df = pd.read_csv("counts.csv", index_col=0) # Heuristic: If genes > samples by 5x, transpose if df.shape[0] > df.shape[1] * 5: print("Transposing: genes were rows") df = df.T adata = ad.AnnData(df)
Load Metadata
meta = pd.read_csv("metadata.csv", index_col=0) # Align indices common = adata.obs_names.intersection(meta.index) adata = adata[common].copy() for col in meta.columns: adata.obs[col] = meta.loc[common, col]
See: references/scanpy_workflow.md "Phase 1: Data Loading"
Quality Control Checklist
# QC metrics adata.var['mt'] = adata.var_names.str.startswith(('MT-', 'mt-')) sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True) # Filter cells sc.pp.filter_cells(adata, min_genes=200) # Min genes per cell adata = adata[adata.obs['pct_counts_mt'] < 20].copy() # Max mito % # Filter genes sc.pp.filter_genes(adata, min_cells=3) # Min cells per gene print(f"After QC: {adata.n_obs} cells x {adata.n_vars} genes")
See: references/scanpy_workflow.md "Phase 2: Quality Control"
Differential Expression Decision Tree
Q: What type of DE analysis? Single-Cell DE (many cells per condition): ├─ Use: sc.tl.rank_genes_groups() ├─ Methods: wilcoxon (default), t-test, logreg ├─ Best for: Per-cell-type DE, marker gene finding └─ See: references/scanpy_workflow.md "Differential Expression" Pseudo-Bulk DE (aggregate counts by sample): ├─ Use: DESeq2 via PyDESeq2 ├─ Best for: Sample-level comparisons, replicates └─ See: SKILL_OLD.md "Phase 5.3: DESeq2-based DE" Statistical Tests Only: ├─ Use: scipy.stats (ttest_ind, f_oneway, pearsonr) ├─ Best for: Correlation, ANOVA, t-tests on summaries └─ See: "Statistical Analysis on Expression Data" below
Statistical Analysis on Expression Data
For BixBench questions requiring specific statistical tests:
Pearson/Spearman Correlation
from scipy import stats # Gene property vs expression r, p = stats.pearsonr(gene_lengths, mean_expression) r_s, p_s = stats.spearmanr(gene_lengths, mean_expression)
T-Tests
# Welch's t-test (unequal variance) t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False) # Student's t-test (equal variance) t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=True)
ANOVA
# One-way ANOVA across multiple groups f_stat, p_val = stats.f_oneway(group1, group2, group3, ...)
Multiple Testing Correction
from statsmodels.stats.multitest import multipletests # Benjamini-Hochberg (FDR) reject, pvals_adj, _, _ = multipletests(pvals, method='fdr_bh') # Bonferroni reject, pvals_adj, _, _ = multipletests(pvals, method='bonferroni')
See: SKILL_OLD.md "Phase 6: Statistical Analysis on Expression Data" for complete examples
Marker Gene Identification
# Find marker genes for each cluster sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon') # Get results for cluster 0 markers = sc.get.rank_genes_groups_df(adata, group='0') top_markers = markers.head(10) # Annotate clusters using markers marker_dict = { 'T cells': ['CD3D', 'CD3E', 'CD8A'], 'B cells': ['CD19', 'MS4A1', 'CD79A'], 'Monocytes': ['CD14', 'LYZ', 'S100A9'], } # Score and assign cell types
See: references/marker_identification.md for complete workflow
Batch Correction with Harmony
import harmonypy # After PCA sc.tl.pca(adata, n_comps=50) # Run Harmony on PCA ho = harmonypy.run_harmony( adata.obsm['X_pca'][:, :30], adata.obs, 'batch', # batch column random_state=0 ) # Store corrected PCs adata.obsm['X_pca_harmony'] = ho.Z_corr.T # Re-cluster on corrected PCs sc.pp.neighbors(adata, use_rep='X_pca_harmony') sc.tl.leiden(adata, resolution=0.5) sc.tl.umap(adata)
See: references/scanpy_workflow.md "Batch Correction"
Report Generation
Always extract the specific answer to the user's question:
# Example: "Which cell type has the most DEGs?" report = f""" # Analysis Results ## Per-Cell-Type Differential Expression | Cell Type | Significant DEGs (padj < 0.05) | |-----------|-------------------------------| {chr(10).join([f"| {ct} | {res['n_sig']} |" for ct, res in de_results.items()])} ## Answer **{top_ct}** has the highest number of significantly differentially expressed genes with **{de_results[top_ct]['n_sig']} DEGs** (Wilcoxon test, BH-corrected p < 0.05). """
Troubleshooting Common Issues
| Issue | Solution |
|---|---|
| |
| Sparse matrix errors | Use : |
| Wrong matrix orientation | Check: more genes than samples? Transpose if needed |
| NaN in correlation | Filter: |
| Too few cells for DE | Need >= 3 cells per condition per cell type |
| Gene names don't match | Use MyGene for ID conversion |
| Memory error (large datasets) | Use to reduce features |
See: references/troubleshooting.md for detailed solutions
Reference Documentation
Core Workflows:
- references/scanpy_workflow.md - Complete scanpy pipeline (QC, normalize, PCA, cluster, DE)
- references/seurat_workflow.md - Seurat → Scanpy translation guide
- references/clustering_guide.md - Leiden, Louvain, hierarchical, bootstrap consensus
- references/marker_identification.md - Marker genes, cell type annotation
- references/trajectory_analysis.md - Pseudotime, trajectory inference
Advanced Topics:
- references/cell_communication.md - Complete OmniPath/CellPhoneDB workflow (L-R interactions, signaling)
- references/troubleshooting.md - Common errors, package issues, data format problems
Utility Scripts:
- scripts/qc_metrics.py - QC calculations, filtering thresholds
- scripts/normalize_data.py - Normalization methods
- scripts/find_markers.py - Marker gene identification
Complete Workflow Example
import scanpy as sc # 1. Load data adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5") # 2. QC adata.var['mt'] = adata.var_names.str.startswith('MT-') sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True) adata = adata[adata.obs['pct_counts_mt'] < 20].copy() sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_genes(adata, min_cells=3) # 3. Normalize sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) adata.raw = adata.copy() # 4. HVG + PCA sc.pp.highly_variable_genes(adata, n_top_genes=2000) sc.tl.pca(adata, n_comps=50) # 5. Cluster sc.pp.neighbors(adata, n_pcs=30) sc.tl.leiden(adata, resolution=0.5) sc.tl.umap(adata) # 6. Find markers sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon') markers = sc.get.rank_genes_groups_df(adata, group='0') # 7. Annotate (manual or automatic) # 8. Per-cell-type DE (if conditions present) # 9. Cell communication analysis (if needed)
See: references/scanpy_workflow.md for detailed explanations of each step
Summary
This skill provides:
- Complete scRNA-seq pipeline (QC → clustering → annotation → DE)
- Per-cell-type differential expression with multiple methods
- Gene property correlation analysis by cell type
- Statistical comparisons (t-test, ANOVA, correlation)
- Expression matrix clustering (hierarchical, bootstrap consensus, PCA)
- Cell-cell communication analysis (OmniPath, CellPhoneDB, CellChatDB)
- Batch correction (Harmony, ComBat)
- Trajectory inference and pseudotime
- Integration with ToolUniverse for annotation and enrichment
BixBench Coverage: 18+ questions across 5 projects (bix-22, bix-27, bix-31, bix-33, bix-36)
For detailed workflows, see references/ directory.