LLMs-Universal-Life-Science-and-Clinical-Skills- sc-preprocessing

install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Transcriptomics/sc-preprocessing" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-sc-preprocessing && rm -rf "$T"
manifest: Skills/Transcriptomics/sc-preprocessing/SKILL.md
source content

🧫 Single-Cell Preprocessing

You are SC Preprocessing, the foundation skill for single-cell analysis in OmicsClaw. Your role is to load scRNA-seq data, perform quality control filtering, normalization, and clustering.

Why This Exists

  • Without it: Users write 30+ lines of boilerplate Scanpy code per dataset
  • With it: One command handles QC → normalize → HVG → PCA → UMAP → Leiden
  • Why OmicsClaw: Standardised preprocessing ensures reproducibility across downstream single-cell skills

Core Capabilities

  1. QC filtering: min genes/cells, mitochondrial percentage thresholds
  2. Normalization: Library-size normalization + log1p (or SCTransform in R)
  3. HVG selection: Seurat-flavored highly variable gene detection
  4. Embedding: PCA → neighbors → UMAP
  5. Clustering: Leiden community detection
  6. Confounder regression: Optional regress-out of technical variation

Input Formats

FormatExtensionRequiredExample
AnnData
.h5ad
Count matrix in X
raw_sc.h5ad
10x H5
.h5
Filtered feature matrix
filtered_feature_bc_matrix.h5
10x MTXdirectory
matrix.mtx.gz
+ barcodes + features
filtered_feature_bc_matrix/
Demon/a
--demo
flag
Built-in PBMC3k

Workflow

  1. Calculate Metrics: Compute per-cell UMI counts, features, and mitochondrial percentage.
  2. Filter: Remove low-quality cells and uninformative genes.
  3. Normalize: Library size normalization and log-transformation.
  4. Embed & Cluster: Compute PCA, neighborhood graph, UMAP, and Leiden communities.
  5. Report: Produce
    report.md
    detailing cell drop-out rates and visualization plots.

CLI Reference

python skills/singlecell/preprocessing/sc_preprocess.py \
  --input <data.h5ad> --output <dir>
python skills/singlecell/preprocessing/sc_preprocess.py --demo --output /tmp/demo
python omicsclaw.py run sc-preprocessing --demo

Algorithm / Methodology

Scanpy (Python)

Goal: Preprocess scRNA-seq data through QC filtering, normalization, and feature selection using Scanpy.

Approach: Calculate per-cell quality metrics, filter low-quality cells/genes, normalize library sizes, identify highly variable genes, and scale for downstream analysis.

Calculate QC Metrics

import scanpy as sc
import numpy as np

# Calculate mitochondrial gene percentage
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

# Key metrics added to adata.obs:
# - n_genes_by_counts: genes detected per cell
# - total_counts: total UMI counts per cell
# - pct_counts_mt: percentage mitochondrial

Visualize QC Metrics

import matplotlib.pyplot as plt

sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'], jitter=0.4, multi_panel=True)
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')

Filter Cells and Genes

# Filter cells by QC metrics
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_cells(adata, max_genes=5000)

# Filter by mitochondrial percentage
adata = adata[adata.obs['pct_counts_mt'] < 20, :].copy()

# Filter genes
sc.pp.filter_genes(adata, min_cells=3)

print(f'After filtering: {adata.n_obs} cells, {adata.n_vars} genes')

Normalization

# Store raw counts before normalization
adata.raw = adata.copy()
adata.layers['counts'] = adata.X.copy()

# Library size normalization (normalize to 10,000 counts per cell)
sc.pp.normalize_total(adata, target_sum=1e4)

# Log transform
sc.pp.log1p(adata)

Highly Variable Genes

# Identify highly variable genes (default: top 2000)
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor='seurat_v3', layer='counts')

# Visualize
sc.pl.highly_variable_genes(adata)

# Check results
print(f'Highly variable genes: {adata.var.highly_variable.sum()}')

Scaling and Embedding

# Subset to HVGs
adata = adata[:, adata.var.highly_variable].copy()

# Scale to unit variance and zero mean
sc.pp.scale(adata, max_value=10)

# PCA, neighbors, UMAP, clustering
sc.tl.pca(adata, n_comps=50)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=1.0)

Regress Out Confounders (Optional)

# Regress out unwanted variation (e.g., cell cycle, mitochondrial)
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])

Complete Pipeline

import scanpy as sc

adata = sc.read_10x_mtx('filtered_feature_bc_matrix/')

# QC
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)

# Filter
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs['pct_counts_mt'] < 20, :].copy()

# Store raw
adata.raw = adata.copy()

# Normalize
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# HVGs
sc.pp.highly_variable_genes(adata, n_top_genes=2000)

# Scale
adata = adata[:, adata.var.highly_variable].copy()
sc.pp.scale(adata, max_value=10)

Seurat (R)

Goal: Preprocess scRNA-seq data through QC filtering, normalization, and feature selection using Seurat.

Standard Log Normalization Pipeline

library(Seurat)

counts <- Read10X(data.dir = 'filtered_feature_bc_matrix/')
seurat_obj <- CreateSeuratObject(counts = counts, min.cells = 3, min.features = 200)

# QC
seurat_obj[['percent.mt']] <- PercentageFeatureSet(seurat_obj, pattern = '^MT-')

# Filter
seurat_obj <- subset(seurat_obj,
    subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent.mt < 20)

# Normalize
seurat_obj <- NormalizeData(seurat_obj)

# HVGs
seurat_obj <- FindVariableFeatures(seurat_obj, nfeatures = 2000)

# Scale
seurat_obj <- ScaleData(seurat_obj)

SCTransform Pipeline (Recommended)

library(Seurat)

counts <- Read10X(data.dir = 'filtered_feature_bc_matrix/')
seurat_obj <- CreateSeuratObject(counts = counts, min.cells = 3, min.features = 200)

# QC
seurat_obj[['percent.mt']] <- PercentageFeatureSet(seurat_obj, pattern = '^MT-')

# Filter
seurat_obj <- subset(seurat_obj,
    subset = nFeature_RNA > 200 & nFeature_RNA < 5000 & percent.mt < 20)

# SCTransform (does normalization, HVG, and scaling)
seurat_obj <- SCTransform(seurat_obj, vars.to.regress = 'percent.mt', verbose = FALSE)

QC Thresholds Reference

MetricTypical RangeNotes
min_genes200-500Remove empty droplets
max_genes2500-5000Remove doublets
max_mt5-20%Remove dying cells (tissue-dependent)
min_cells3-10Remove rarely detected genes

Method Comparison

StepScanpySeurat (Standard)Seurat (SCTransform)
Normalize
normalize_total
+
log1p
NormalizeData
SCTransform
HVGs
highly_variable_genes
FindVariableFeatures
(included)
Scale
scale
ScaleData
(included)
Regress
regress_out
ScaleData(vars.to.regress)
SCTransform(vars.to.regress)

Parameters

ParameterDefaultDescription
--min-genes
200
Min genes per cell
--min-cells
3
Min cells per gene
--max-mt-pct
20.0
Max mitochondrial %
--n-top-hvg
2000
Number of HVGs
--n-pcs
50
PCA components
--leiden-resolution
1.0
Leiden resolution

Example Queries

  • "Run single cell preprocessing on this 10x h5 data"
  • "Perform QC and clustering: filter out cells with >20% mito"
  • "Normalize and cluster this PBMC count matrix using Scanpy"

Output Structure

output_dir/
├── report.md
├── processed.h5ad
├── result.json
├── figures/
│   ├── qc_violin.png
│   ├── hvg_plot.png
│   ├── umap_clusters.png
│   └── umap_genes.png
└── reproducibility/
    ├── commands.sh
    ├── environment.yml
    └── checksums.sha256

Version Compatibility

Reference examples tested with: scanpy 1.10+, numpy 1.26+, matplotlib 3.8+

Dependencies

Required: scanpy >= 1.9, anndata >= 0.11, numpy, pandas, matplotlib

Citations

Safety

  • Local-first: Strict offline processing without transmitting sample profiles.
  • Disclaimer: Reproducible OmicsClaw reports clearly state parameter origins.
  • Audit trail: Logging traces down to seed integers used in embedding.

Integration with Orchestrator

Trigger conditions:

  • "preprocess", "QA/QC", "Scanpy pipeline", "filter normalize"

Chaining partners:

  • sc-doublet
    — Doublet detection before preprocessing
  • sc-annotate
    — Cell type annotation after clustering
  • sc-integrate
    — Batch integration for multi-sample data