LLMs-Universal-Life-Science-and-Clinical-Skills- sc-batch-integration
install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Transcriptomics/sc-batch-integration" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-sc-batch-integrati && rm -rf "$T"
manifest:
Skills/Transcriptomics/sc-batch-integration/SKILL.mdsource content
🔗 Single-Cell Batch Integration
Integrate multiple scRNA-seq datasets to remove batch effects while preserving biological variation.
Why This Exists
- Without it: Multi-sample analysis is dominated by technical batch effects
- With it: Corrected embedding space where clusters reflect biology, not batches
- Why OmicsClaw: Automated integration with method selection and evaluation metrics
Tool Comparison
| Tool | Speed | Scalability | Best For |
|---|---|---|---|
| Harmony | Fast | Good | Quick integration, most use cases |
| scVI | Moderate | Excellent | Large datasets, deep learning |
| Seurat CCA/RPCA | Moderate | Good | Conserved biology across batches |
| fastMNN | Fast | Good | MNN-based correction |
| BBKNN | Fast | Good | Lightweight, k-NN correction |
Workflow
- Calculate: Prepare modalities and normalize batch representations.
- Execute: Run chosen integration mechanism across sample blocks.
- Assess: Quantify batch mixing versus bio-preservation.
- Generate: Save corrected matrices and compute UMAP graph.
- Report: Synthesize report with mixing scoring metadata.
CLI Reference
python skills/singlecell/batch-integration/sc_integrate.py \ --input <merged.h5ad> --output <dir> python omicsclaw.py run sc-batch-integration --demo
Algorithm / Methodology
Harmony (Python — Scanpy)
Goal: Remove batch effects by iteratively correcting PCA embeddings.
import scanpy as sc import scanpy.external as sce adata = sc.read_h5ad('merged.h5ad') # Standard preprocessing sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, batch_key='batch') adata = adata[:, adata.var.highly_variable] sc.pp.scale(adata) sc.tl.pca(adata) # Run Harmony sce.pp.harmony_integrate(adata, key='batch') # Use corrected embedding sc.pp.neighbors(adata, use_rep='X_pca_harmony') sc.tl.umap(adata) sc.tl.leiden(adata)
Harmony (R — Seurat)
library(Seurat) library(harmony) merged <- merge(sample1, y = list(sample2, sample3), add.cell.ids = c('S1', 'S2', 'S3')) merged <- NormalizeData(merged) merged <- FindVariableFeatures(merged) merged <- ScaleData(merged) merged <- RunPCA(merged) # Run Harmony on PCA embeddings merged <- RunHarmony(merged, group.by.vars = 'orig.ident', dims.use = 1:30) # Use harmony embeddings for downstream merged <- RunUMAP(merged, reduction = 'harmony', dims = 1:30) merged <- FindNeighbors(merged, reduction = 'harmony', dims = 1:30) merged <- FindClusters(merged, resolution = 0.5)
scVI (Python)
Goal: Integrate batches using a deep generative model that learns a shared latent space.
import scvi import scanpy as sc adata = sc.read_h5ad('merged.h5ad') scvi.model.SCVI.setup_anndata(adata, batch_key='batch') model = scvi.model.SCVI(adata, n_latent=30, n_layers=2) model.train(max_epochs=100, early_stopping=True) adata.obsm['X_scVI'] = model.get_latent_representation() sc.pp.neighbors(adata, use_rep='X_scVI') sc.tl.umap(adata) sc.tl.leiden(adata)
scANVI (with Cell Type Labels)
scvi.model.SCANVI.setup_anndata(adata, batch_key='batch', labels_key='cell_type', unlabeled_category='Unknown') model = scvi.model.SCANVI(adata, n_latent=30) model.train(max_epochs=100) adata.obs['predicted_type'] = model.predict()
Seurat CCA Integration (R)
library(Seurat) obj_list <- SplitObject(merged, split.by = 'batch') obj_list <- lapply(obj_list, function(x) { x <- NormalizeData(x) x <- FindVariableFeatures(x, nfeatures = 2000) return(x) }) anchors <- FindIntegrationAnchors(object.list = obj_list, dims = 1:30) integrated <- IntegrateData(anchorset = anchors, dims = 1:30) DefaultAssay(integrated) <- 'integrated' integrated <- ScaleData(integrated) integrated <- RunPCA(integrated) integrated <- RunUMAP(integrated, dims = 1:30)
Seurat RPCA (Faster for Large Datasets)
anchors <- FindIntegrationAnchors(object.list = obj_list, dims = 1:30, reduction = 'rpca') integrated <- IntegrateData(anchorset = anchors, dims = 1:30)
Evaluate Integration
Mixing Metrics (R)
library(lisi) lisi_scores <- compute_lisi(Embeddings(merged, 'harmony'), merged@meta.data, c('batch', 'cell_type')) mean(lisi_scores$batch) # Want high (batches mixed) mean(lisi_scores$cell_type) # Want low (types preserved)
Silhouette Score (Python)
from sklearn.metrics import silhouette_score batch_sil = silhouette_score(adata.obsm['X_scVI'], adata.obs['batch']) # Want low celltype_sil = silhouette_score(adata.obsm['X_scVI'], adata.obs['cell_type']) # Want high
When to Use Each Method
| Scenario | Recommended |
|---|---|
| Quick integration, most cases | Harmony |
| Large datasets (>500k cells) | scVI or Harmony |
| Strong batch effects | scVI |
| Reference mapping | Seurat anchors or scANVI |
| Preserving rare populations | fastMNN |
Parameters
| Parameter | Default | Description |
|---|---|---|
| | harmony, scvi, scanorama, bbknn |
| | Column with batch labels |
| | Latent dimensions (scVI) |
Example Queries
- "Run Harmony integration on my cell clusters"
- "Use scVI to eliminate technical batch effects"
Output Structure
output_dir/ ├── report.md ├── result.json ├── processed.h5ad ├── figures/ │ └── summary_plot.png ├── tables/ │ └── metrics.csv └── reproducibility/ ├── commands.sh ├── environment.yml └── checksums.sha256
Version Compatibility
Reference examples tested with: scanpy 1.10+, scvi-tools 1.1+, anndata 0.10+
Dependencies
Required: scanpy >= 1.9, anndata Optional: scvi-tools, harmonypy, bbknn
Citations
- Harmony — Korsunsky et al., Nature Methods 2019
- scVI — Lopez et al., Nature Methods 2018
- Seurat v3 — Stuart et al., Cell 2019
- BBKNN — Polanski et al., Bioinformatics 2020
Safety
- Local-first: Strict offline processing without external upload.
- Disclaimer: Requires OmicsClaw reporting structures and disclaimers.
- Audit trail: Hyperparameters and operational flow states are logged fully.
Integration with Orchestrator
Trigger conditions:
- Automatically invoked dynamically based on tool metadata and user intent matching.
Chaining partners:
— QC before integrationsc-preprocess
— Annotation after integrationsc-annotate
— Doublet removal before integrationsc-doublet