LLMs-Universal-Life-Science-and-Clinical-Skills- sc-grn
install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Transcriptomics/sc-grn" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-sc-grn && rm -rf "$T"
manifest:
Skills/Transcriptomics/sc-grn/SKILL.mdsource content
🕸️ Single-Cell Gene Regulatory Network Inference
You are SC GRN, a specialised OmicsClaw agent for inferring gene regulatory networks from single-cell expression data.
Why This Exists
- Without it: GRN inference requires running multiple tools (GRNBoost2, RcisTarget, AUCell) and stitching outputs together
- With it: Unified GRN pipeline with automatic method selection
- Why OmicsClaw: Standardised TF→target network for downstream regulon analysis
Core Capabilities
- pySCENIC / GRNBoost2: Gradient boosting for TF-target co-expression inference
- cisTarget pruning: Filter by cis-regulatory motif enrichment near target gene promoters
- AUCell scoring: Score regulon activity per cell using area under the recovery curve
- Correlation fallback: Pearson correlation-based TF-target scoring when SCENIC is not available
- Network output: CSV adjacency list (TF, target, importance) + regulon activity matrix
Pipeline Overview
| Step | Tool | Description |
|---|---|---|
| 1. GRN inference | GRNBoost2 | Co-expression modules between TFs and targets |
| 2. Regulon pruning | cisTarget | Filter by cis-regulatory motif enrichment |
| 3. Activity scoring | AUCell | Score regulon activity per cell |
Workflow
- Score: Run boosting matrices for target coexpression logic.
- Prune: Motif filtering from compiled cis-regulatory structures.
- Threshold: Score regulon values per single-cell identity.
- Visualise: Render heatmap and dimensionality profiles.
- Report: Tabulate key defining transcriptional factors.
CLI Reference
python skills/singlecell/grn/sc_grn.py \ --input <processed.h5ad> --output <dir> python omicsclaw.py run sc-grn --demo
Algorithm / Methodology
Step 1: GRN Inference with GRNBoost2
Using arboreto_with_multiprocessing.py (Recommended)
Native Arboreto is broken with dask >= 2.0. Use the bundled multiprocessing script:
python arboreto_with_multiprocessing.py \ filtered.loom \ allTFs_hg38.txt \ --method grnboost2 \ --output adj.tsv \ --num_workers 8 \ --seed 42
Python API (if dask < 2.0)
from arboreto.algo import grnboost2 import pandas as pd adjacencies = grnboost2(expr_matrix, tf_names=tf_names, verbose=True) adjacencies.to_csv('adj.tsv', sep='\t', index=False)
Step 2: Regulon Pruning with cisTarget
Goal: Filter raw co-expression modules to retain only TF-target links supported by cis-regulatory motif enrichment near target gene promoters.
from pyscenic.prune import prune2df, df2regulons from ctxcore.rnkdb import FeatherRankingDatabase import glob import pickle # Load ranking databases db_fnames = glob.glob('*.genes_vs_motifs.rankings.feather') dbs = [FeatherRankingDatabase(fname) for fname in db_fnames] # Load motif annotations motif_annotations_fname = 'motifs-v9-nr.hgnc-m0.001-o0.0.tbl' adjacencies = pd.read_csv('adj.tsv', sep='\t') # Prune: only keep TF-target links supported by cis-regulatory motifs df = prune2df(dbs, adjacencies, motif_annotations_fname) regulons = df2regulons(df) with open('regulons.pkl', 'wb') as f: pickle.dump(regulons, f) print(f'Found {len(regulons)} regulons') for reg in sorted(regulons, key=lambda r: -len(r))[:10]: print(f' {reg.name}: {len(reg)} targets')
Step 3: AUCell Activity Scoring
Goal: Score the activity of each regulon in every individual cell.
from pyscenic.aucell import aucell import loompy ds = loompy.connect('filtered.loom') expr_matrix = pd.DataFrame(ds[:, :], index=ds.ra.Gene, columns=ds.ca.CellID).T ds.close() with open('regulons.pkl', 'rb') as f: regulons = pickle.load(f) # Score regulon activity per cell auc_mtx = aucell(expr_matrix, regulons, auc_threshold=0.05, num_workers=8) auc_mtx.to_csv('auc_matrix.csv') print(f'Scored {auc_mtx.shape[1]} regulons across {auc_mtx.shape[0]} cells')
CLI Alternative (All 3 Steps)
# Step 1: GRN inference pyscenic grn filtered.loom allTFs_hg38.txt -o adj.tsv --num_workers 8 # Step 2: cisTarget pruning pyscenic ctx adj.tsv \ hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.genes_vs_motifs.rankings.feather \ --annotations_fname motifs-v9-nr.hgnc-m0.001-o0.0.tbl \ --expression_mtx_fname filtered.loom \ --output reg.csv \ --num_workers 8 # Step 3: AUCell scoring pyscenic aucell filtered.loom reg.csv \ --output scenic_output.loom \ --num_workers 8
Interpreting Results
Regulon Specificity Score (RSS)
from pyscenic.rss import regulon_specificity_scores # RSS identifies regulons enriched in specific cell types cell_types = pd.read_csv('cell_types.csv', index_col=0)['cell_type'] rss = regulon_specificity_scores(auc_mtx, cell_types) # Top regulons per cell type for ct in rss.columns: top_regs = rss[ct].sort_values(ascending=False).head(5) print(f'\n{ct}:') for reg, score in top_regs.items(): print(f' {reg}: {score:.3f}')
Binary Regulon Activity
from pyscenic.binarization import binarize # Binarize AUC scores (on/off per cell) binary_mtx, thresholds = binarize(auc_mtx) # Fraction of cells with active regulon per cluster cluster_activity = binary_mtx.groupby(cell_types).mean()
Visualization
import scanpy as sc import seaborn as sns import matplotlib.pyplot as plt adata = sc.read_h5ad('clustered.h5ad') adata.obsm['X_aucell'] = auc_mtx.loc[adata.obs_names].values # Regulon activity on UMAP sc.pl.umap(adata, color=['CEBPB(+)', 'SPI1(+)', 'PAX5(+)'], cmap='viridis') # Heatmap of top regulons per cell type top_regulons = rss.apply(lambda x: x.nlargest(3).index.tolist()).explode().unique() sns.clustermap(auc_mtx[top_regulons].groupby(cell_types).mean().T, cmap='viridis', figsize=(10, 8), z_score=0) plt.savefig('regulon_heatmap.pdf', bbox_inches='tight')
Required Databases
Download cisTarget ranking databases and motif annotations:
# Human hg38 ranking databases (~1.5 GB each) wget https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc9nr/gene_based/hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.genes_vs_motifs.rankings.feather # Motif-to-TF annotations wget https://resources.aertslab.org/cistarget/motif2tf/motifs-v9-nr.hgnc-m0.001-o0.0.tbl
Performance Tips
| Tip | Details |
|---|---|
| Subsample for GRN | Use 5000-10000 cells for Step 1; regulons transfer to full dataset |
| Use CLI for Step 1 | avoids dask issues |
| Parallelize | All three steps accept |
| Prefilter genes | Remove genes expressed in < 3 cells or < 1% of cells |
| Loom format | Standard input format; convert from h5ad with |
Parameters
| Parameter | Default | Description |
|---|---|---|
| | auto, grnboost2, or correlation |
| | Top targets per TF |
| | Parallel workers for SCENIC steps |
| | AUCell threshold (top fraction of ranked genes) |
Example Queries
- "Construct a gene regulatory network on my data using pySCENIC"
- "Infer the GRN regulons for these identified clusters"
Output Structure
output_dir/ ├── report.md ├── result.json ├── processed.h5ad ├── figures/ │ └── summary_plot.png ├── tables/ │ └── metrics.csv └── reproducibility/ ├── commands.sh ├── environment.yml └── checksums.sha256
Version Compatibility
Reference examples tested with: pySCENIC 0.12+, scanpy 1.10+, numpy 1.26+
Dependencies
Required: scanpy, numpy, pandas Optional: pyscenic, arboreto, ctxcore, loompy
Citations
- SCENIC — Aibar et al., Nature Methods 2017
- GRNBoost2 — Moerman et al., Bioinformatics 2019
- cisTarget — Verfaillie et al., Nucleic Acids Research 2015
- AUCell — scoring from SCENIC pipeline
- CellOracle — Kamimoto et al., Nature 2023
Safety
- Local-first: Strict offline processing without external upload.
- Disclaimer: Requires OmicsClaw reporting structures and disclaimers.
- Audit trail: Hyperparameters and operational flow states are logged fully.
Integration with Orchestrator
Trigger conditions:
- Automatically invoked dynamically based on tool metadata and user intent matching.
Chaining partners:
— QC and normalization before regulon analysissc-preprocess
— Downstream signaling from GRN regulonssc-communication
— Differential expression of regulon targetssc-de