LLMs-Universal-Life-Science-and-Clinical-Skills- sc-preprocessing
install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Transcriptomics/sc-preprocessing" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-sc-preprocessing && rm -rf "$T"
manifest:
Skills/Transcriptomics/sc-preprocessing/SKILL.mdsource content
🧫 Single-Cell Preprocessing
You are SC Preprocessing, the foundation skill for single-cell analysis in OmicsClaw. Your role is to load scRNA-seq data, perform quality control filtering, normalization, and clustering.
Why This Exists
- Without it: Users write 30+ lines of boilerplate Scanpy code per dataset
- With it: One command handles QC → normalize → HVG → PCA → UMAP → Leiden
- Why OmicsClaw: Standardised preprocessing ensures reproducibility across downstream single-cell skills
Core Capabilities
- QC filtering: min genes/cells, mitochondrial percentage thresholds
- Normalization: Library-size normalization + log1p (or SCTransform in R)
- HVG selection: Seurat-flavored highly variable gene detection
- Embedding: PCA → neighbors → UMAP
- Clustering: Leiden community detection
- Confounder regression: Optional regress-out of technical variation
Input Formats
| Format | Extension | Required | Example |
|---|---|---|---|
| AnnData | | Count matrix in X | |
| 10x H5 | | Filtered feature matrix | |
| 10x MTX | directory | + barcodes + features | |
| Demo | n/a | flag | Built-in PBMC3k |
Workflow
- Calculate Metrics: Compute per-cell UMI counts, features, and mitochondrial percentage.
- Filter: Remove low-quality cells and uninformative genes.
- Normalize: Library size normalization and log-transformation.
- Embed & Cluster: Compute PCA, neighborhood graph, UMAP, and Leiden communities.
- Report: Produce
detailing cell drop-out rates and visualization plots.report.md
CLI Reference
python skills/singlecell/preprocessing/sc_preprocess.py \ --input <data.h5ad> --output <dir> python skills/singlecell/preprocessing/sc_preprocess.py --demo --output /tmp/demo python omicsclaw.py run sc-preprocessing --demo
Algorithm / Methodology
Scanpy (Python)
Goal: Preprocess scRNA-seq data through QC filtering, normalization, and feature selection using Scanpy.
Approach: Calculate per-cell quality metrics, filter low-quality cells/genes, normalize library sizes, identify highly variable genes, and scale for downstream analysis.
Calculate QC Metrics
import scanpy as sc import numpy as np # Calculate mitochondrial gene percentage adata.var['mt'] = adata.var_names.str.startswith('MT-') sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True) # Key metrics added to adata.obs: # - n_genes_by_counts: genes detected per cell # - total_counts: total UMI counts per cell # - pct_counts_mt: percentage mitochondrial
Visualize QC Metrics
import matplotlib.pyplot as plt sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'], jitter=0.4, multi_panel=True) sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt') sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')
Filter Cells and Genes
# Filter cells by QC metrics sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_cells(adata, max_genes=5000) # Filter by mitochondrial percentage adata = adata[adata.obs['pct_counts_mt'] < 20, :].copy() # Filter genes sc.pp.filter_genes(adata, min_cells=3) print(f'After filtering: {adata.n_obs} cells, {adata.n_vars} genes')
Normalization
# Store raw counts before normalization adata.raw = adata.copy() adata.layers['counts'] = adata.X.copy() # Library size normalization (normalize to 10,000 counts per cell) sc.pp.normalize_total(adata, target_sum=1e4) # Log transform sc.pp.log1p(adata)
Highly Variable Genes
# Identify highly variable genes (default: top 2000) sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor='seurat_v3', layer='counts') # Visualize sc.pl.highly_variable_genes(adata) # Check results print(f'Highly variable genes: {adata.var.highly_variable.sum()}')
Scaling and Embedding
# Subset to HVGs adata = adata[:, adata.var.highly_variable].copy() # Scale to unit variance and zero mean sc.pp.scale(adata, max_value=10) # PCA, neighbors, UMAP, clustering sc.tl.pca(adata, n_comps=50) sc.pp.neighbors(adata) sc.tl.umap(adata) sc.tl.leiden(adata, resolution=1.0)
Regress Out Confounders (Optional)
# Regress out unwanted variation (e.g., cell cycle, mitochondrial) sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
Complete Pipeline
import scanpy as sc adata = sc.read_10x_mtx('filtered_feature_bc_matrix/') # QC adata.var['mt'] = adata.var_names.str.startswith('MT-') sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True) # Filter sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_genes(adata, min_cells=3) adata = adata[adata.obs['pct_counts_mt'] < 20, :].copy() # Store raw adata.raw = adata.copy() # Normalize sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) # HVGs sc.pp.highly_variable_genes(adata, n_top_genes=2000) # Scale adata = adata[:, adata.var.highly_variable].copy() sc.pp.scale(adata, max_value=10)
Seurat (R)
Goal: Preprocess scRNA-seq data through QC filtering, normalization, and feature selection using Seurat.
Standard Log Normalization Pipeline
library(Seurat) counts <- Read10X(data.dir = 'filtered_feature_bc_matrix/') seurat_obj <- CreateSeuratObject(counts = counts, min.cells = 3, min.features = 200) # QC seurat_obj[['percent.mt']] <- PercentageFeatureSet(seurat_obj, pattern = '^MT-') # Filter seurat_obj <- subset(seurat_obj, subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent.mt < 20) # Normalize seurat_obj <- NormalizeData(seurat_obj) # HVGs seurat_obj <- FindVariableFeatures(seurat_obj, nfeatures = 2000) # Scale seurat_obj <- ScaleData(seurat_obj)
SCTransform Pipeline (Recommended)
library(Seurat) counts <- Read10X(data.dir = 'filtered_feature_bc_matrix/') seurat_obj <- CreateSeuratObject(counts = counts, min.cells = 3, min.features = 200) # QC seurat_obj[['percent.mt']] <- PercentageFeatureSet(seurat_obj, pattern = '^MT-') # Filter seurat_obj <- subset(seurat_obj, subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent.mt < 20) # SCTransform (does normalization, HVG, and scaling) seurat_obj <- SCTransform(seurat_obj, vars.to.regress = 'percent.mt', verbose = FALSE)
QC Thresholds Reference
| Metric | Typical Range | Notes |
|---|---|---|
| min_genes | 200-500 | Remove empty droplets |
| max_genes | 2500-5000 | Remove doublets |
| max_mt | 5-20% | Remove dying cells (tissue-dependent) |
| min_cells | 3-10 | Remove rarely detected genes |
Method Comparison
| Step | Scanpy | Seurat (Standard) | Seurat (SCTransform) |
|---|---|---|---|
| Normalize | + | | |
| HVGs | | | (included) |
| Scale | | | (included) |
| Regress | | | |
Parameters
| Parameter | Default | Description |
|---|---|---|
| | Min genes per cell |
| | Min cells per gene |
| | Max mitochondrial % |
| | Number of HVGs |
| | PCA components |
| | Leiden resolution |
Example Queries
- "Run single cell preprocessing on this 10x h5 data"
- "Perform QC and clustering: filter out cells with >20% mito"
- "Normalize and cluster this PBMC count matrix using Scanpy"
Output Structure
output_dir/ ├── report.md ├── processed.h5ad ├── result.json ├── figures/ │ ├── qc_violin.png │ ├── hvg_plot.png │ ├── umap_clusters.png │ └── umap_genes.png └── reproducibility/ ├── commands.sh ├── environment.yml └── checksums.sha256
Version Compatibility
Reference examples tested with: scanpy 1.10+, numpy 1.26+, matplotlib 3.8+
Dependencies
Required: scanpy >= 1.9, anndata >= 0.11, numpy, pandas, matplotlib
Citations
- Scanpy — Wolf et al., Genome Biology 2018
- Seurat — Hao et al., Cell 2021
- SCTransform — Hafemeister & Satija, Genome Biology 2019
- Leiden algorithm — Traag et al., 2019
Safety
- Local-first: Strict offline processing without transmitting sample profiles.
- Disclaimer: Reproducible OmicsClaw reports clearly state parameter origins.
- Audit trail: Logging traces down to seed integers used in embedding.
Integration with Orchestrator
Trigger conditions:
- "preprocess", "QA/QC", "Scanpy pipeline", "filter normalize"
Chaining partners:
— Doublet detection before preprocessingsc-doublet
— Cell type annotation after clusteringsc-annotate
— Batch integration for multi-sample datasc-integrate