LLMs-Universal-Life-Science-and-Clinical-Skills- sc-batch-integration

install

source · Clone the upstream repo

git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Transcriptomics/sc-batch-integration" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-sc-batch-integrati && rm -rf "$T"

manifest: Skills/Transcriptomics/sc-batch-integration/SKILL.md

🔗 Single-Cell Batch Integration

Integrate multiple scRNA-seq datasets to remove batch effects while preserving biological variation.

Why This Exists

Without it: Multi-sample analysis is dominated by technical batch effects
With it: Corrected embedding space where clusters reflect biology, not batches
Why OmicsClaw: Automated integration with method selection and evaluation metrics

Tool Comparison

Tool	Speed	Scalability	Best For
Harmony	Fast	Good	Quick integration, most use cases
scVI	Moderate	Excellent	Large datasets, deep learning
Seurat CCA/RPCA	Moderate	Good	Conserved biology across batches
fastMNN	Fast	Good	MNN-based correction
BBKNN	Fast	Good	Lightweight, k-NN correction

Workflow

Calculate: Prepare modalities and normalize batch representations.
Execute: Run chosen integration mechanism across sample blocks.
Assess: Quantify batch mixing versus bio-preservation.
Generate: Save corrected matrices and compute UMAP graph.
Report: Synthesize report with mixing scoring metadata.

CLI Reference

python skills/singlecell/batch-integration/sc_integrate.py \
  --input <merged.h5ad> --output <dir>
python omicsclaw.py run sc-batch-integration --demo

Algorithm / Methodology

Harmony (Python — Scanpy)

Goal: Remove batch effects by iteratively correcting PCA embeddings.

import scanpy as sc
import scanpy.external as sce

adata = sc.read_h5ad('merged.h5ad')

# Standard preprocessing
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, batch_key='batch')
adata = adata[:, adata.var.highly_variable]
sc.pp.scale(adata)
sc.tl.pca(adata)

# Run Harmony
sce.pp.harmony_integrate(adata, key='batch')

# Use corrected embedding
sc.pp.neighbors(adata, use_rep='X_pca_harmony')
sc.tl.umap(adata)
sc.tl.leiden(adata)

Harmony (R — Seurat)

library(Seurat)
library(harmony)

merged <- merge(sample1, y = list(sample2, sample3), add.cell.ids = c('S1', 'S2', 'S3'))
merged <- NormalizeData(merged)
merged <- FindVariableFeatures(merged)
merged <- ScaleData(merged)
merged <- RunPCA(merged)

# Run Harmony on PCA embeddings
merged <- RunHarmony(merged, group.by.vars = 'orig.ident', dims.use = 1:30)

# Use harmony embeddings for downstream
merged <- RunUMAP(merged, reduction = 'harmony', dims = 1:30)
merged <- FindNeighbors(merged, reduction = 'harmony', dims = 1:30)
merged <- FindClusters(merged, resolution = 0.5)

scVI (Python)

Goal: Integrate batches using a deep generative model that learns a shared latent space.

import scvi
import scanpy as sc

adata = sc.read_h5ad('merged.h5ad')

scvi.model.SCVI.setup_anndata(adata, batch_key='batch')
model = scvi.model.SCVI(adata, n_latent=30, n_layers=2)
model.train(max_epochs=100, early_stopping=True)

adata.obsm['X_scVI'] = model.get_latent_representation()

sc.pp.neighbors(adata, use_rep='X_scVI')
sc.tl.umap(adata)
sc.tl.leiden(adata)

scANVI (with Cell Type Labels)

scvi.model.SCANVI.setup_anndata(adata, batch_key='batch', labels_key='cell_type',
                                 unlabeled_category='Unknown')
model = scvi.model.SCANVI(adata, n_latent=30)
model.train(max_epochs=100)

adata.obs['predicted_type'] = model.predict()

Seurat CCA Integration (R)

library(Seurat)

obj_list <- SplitObject(merged, split.by = 'batch')
obj_list <- lapply(obj_list, function(x) {
    x <- NormalizeData(x)
    x <- FindVariableFeatures(x, nfeatures = 2000)
    return(x)
})

anchors <- FindIntegrationAnchors(object.list = obj_list, dims = 1:30)
integrated <- IntegrateData(anchorset = anchors, dims = 1:30)

DefaultAssay(integrated) <- 'integrated'
integrated <- ScaleData(integrated)
integrated <- RunPCA(integrated)
integrated <- RunUMAP(integrated, dims = 1:30)

Seurat RPCA (Faster for Large Datasets)

anchors <- FindIntegrationAnchors(object.list = obj_list, dims = 1:30, reduction = 'rpca')
integrated <- IntegrateData(anchorset = anchors, dims = 1:30)

Evaluate Integration

Mixing Metrics (R)

library(lisi)
lisi_scores <- compute_lisi(Embeddings(merged, 'harmony'),
                            merged@meta.data, c('batch', 'cell_type'))
mean(lisi_scores$batch)      # Want high (batches mixed)
mean(lisi_scores$cell_type)  # Want low (types preserved)

Silhouette Score (Python)

from sklearn.metrics import silhouette_score

batch_sil = silhouette_score(adata.obsm['X_scVI'], adata.obs['batch'])      # Want low
celltype_sil = silhouette_score(adata.obsm['X_scVI'], adata.obs['cell_type'])  # Want high

When to Use Each Method

Scenario	Recommended
Quick integration, most cases	Harmony
Large datasets (>500k cells)	scVI or Harmony
Strong batch effects	scVI
Reference mapping	Seurat anchors or scANVI
Preserving rare populations	fastMNN

Parameters

Parameter	Default	Description
`--method`	`harmony`	harmony, scvi, scanorama, bbknn
`--batch-key`	`batch`	Column with batch labels
`--n-latent`	`30`	Latent dimensions (scVI)

Example Queries

"Run Harmony integration on my cell clusters"
"Use scVI to eliminate technical batch effects"

Output Structure

output_dir/
├── report.md
├── result.json
├── processed.h5ad
├── figures/
│   └── summary_plot.png
├── tables/
│   └── metrics.csv
└── reproducibility/
    ├── commands.sh
    ├── environment.yml
    └── checksums.sha256

Version Compatibility

Reference examples tested with: scanpy 1.10+, scvi-tools 1.1+, anndata 0.10+

Dependencies

Required: scanpy >= 1.9, anndata Optional: scvi-tools, harmonypy, bbknn

Citations

Harmony — Korsunsky et al., Nature Methods 2019
scVI — Lopez et al., Nature Methods 2018
Seurat v3 — Stuart et al., Cell 2019
BBKNN — Polanski et al., Bioinformatics 2020

Safety

Local-first: Strict offline processing without external upload.
Disclaimer: Requires OmicsClaw reporting structures and disclaimers.
Audit trail: Hyperparameters and operational flow states are logged fully.

Integration with Orchestrator

Trigger conditions:

Automatically invoked dynamically based on tool metadata and user intent matching.

Chaining partners:

```
sc-preprocess
```
— QC before integration
```
sc-annotate
```
— Annotation after integration
```
sc-doublet
```
— Doublet removal before integration