BioSkills bio-differential-expression-batch-correction

Remove batch effects from RNA-seq data using ComBat, ComBat-Seq, limma removeBatchEffect, and SVA for unknown batch variables. Use when correcting batch effects in expression data.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/differential-expression/batch-correction" ~/.claude/skills/gptomics-bioskills-bio-differential-expression-batch-correction && rm -rf "$T"
manifest: differential-expression/batch-correction/SKILL.md
source content

Version Compatibility

Reference examples tested with: DESeq2 1.42+, ggplot2 3.5+, limma 3.58+, scanpy 1.10+

Before using code patterns, verify installed versions match. If versions differ:

  • R:
    packageVersion('<pkg>')
    then
    ?function_name
    to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Batch Effect Correction

ComBat-Seq (Count Data)

Goal: Remove batch effects from raw count data while preserving biological group differences.

Approach: Apply ComBat-Seq's negative binomial regression to adjust counts, keeping the integer nature of the data.

"Remove batch effects from my RNA-seq counts" → Adjust raw count matrix for known batch labels using negative binomial modeling, preserving biological condition effects.

library(sva)

# counts: raw count matrix (genes x samples)
# batch: vector of batch labels
# group: vector of biological condition (optional, to preserve)

corrected_counts <- ComBat_seq(counts = as.matrix(counts),
                                batch = batch,
                                group = condition,
                                full_mod = TRUE)

# Result is batch-corrected count matrix
# Use for visualization, clustering, but NOT for DE (use design formula instead)

ComBat (Normalized Data)

Goal: Remove batch effects from normalized (log-transformed or TPM) expression data.

Approach: Apply parametric empirical Bayes adjustment to normalized expression while protecting biological covariates.

library(sva)

# For normalized expression (log-transformed, TPM, etc.)
# NOT for raw counts

# Create model matrix
mod <- model.matrix(~ condition, data = metadata)
mod0 <- model.matrix(~ 1, data = metadata)

# Run ComBat
corrected_expr <- ComBat(dat = as.matrix(normalized_expr),
                          batch = metadata$batch,
                          mod = mod,
                          par.prior = TRUE)

limma removeBatchEffect

Goal: Produce batch-corrected expression values for visualization while preserving group differences.

Approach: Regress out the batch effect from normalized expression using limma's linear model.

library(limma)

# For visualization/clustering only
# Preserves group differences while removing batch

design <- model.matrix(~ condition, data = metadata)
corrected_expr <- removeBatchEffect(normalized_expr,
                                     batch = metadata$batch,
                                     design = design)

# For PCA, heatmaps, etc.

DESeq2 Design Formula (Recommended for DE)

Goal: Account for batch effects during DE testing without modifying the count data.

Approach: Include batch as a covariate in the DESeq2 design formula so batch variance is modeled, not removed.

library(DESeq2)

# Include batch in design formula - preferred for DE analysis
dds <- DESeqDataSetFromMatrix(countData = counts,
                               colData = metadata,
                               design = ~ batch + condition)

# Batch is modeled, not removed
# DE results are adjusted for batch
dds <- DESeq(dds)
res <- results(dds, contrast = c('condition', 'treatment', 'control'))

Surrogate Variable Analysis (SVA)

Goal: Discover and correct for unknown sources of variation (hidden batch effects).

Approach: Estimate surrogate variables from the residual variation not explained by the biological model.

"Correct for unknown batch effects in my expression data" → Estimate latent surrogate variables capturing unwanted variation, then include them as covariates in the DE model.

library(sva)

# When batch is unknown, estimate surrogate variables
mod <- model.matrix(~ condition, data = metadata)
mod0 <- model.matrix(~ 1, data = metadata)

# Estimate number of surrogate variables
n_sv <- num.sv(normalized_expr, mod, method = 'leek')

# Estimate surrogate variables
svobj <- sva(normalized_expr, mod, mod0, n.sv = n_sv)

# Add SVs to design for DE
design_with_sv <- cbind(mod, svobj$sv)

SVA with DESeq2

Goal: Integrate surrogate variables into DESeq2 to adjust for hidden confounders during DE testing.

Approach: Estimate SVs from normalized counts, add them to colData, and update the design formula.

library(DESeq2)
library(sva)

# Normalize for SV estimation
dds <- DESeqDataSetFromMatrix(countData = counts, colData = metadata, design = ~ condition)
dds <- estimateSizeFactors(dds)
norm_counts <- counts(dds, normalized = TRUE)

# Estimate SVs
mod <- model.matrix(~ condition, data = metadata)
mod0 <- model.matrix(~ 1, data = metadata)
svobj <- sva(norm_counts, mod, mod0)

# Add SVs to colData
for (i in seq_len(ncol(svobj$sv))) {
    colData(dds)[[paste0('SV', i)]] <- svobj$sv[, i]
}

# Update design
sv_formula <- as.formula(paste('~', paste(paste0('SV', 1:ncol(svobj$sv)), collapse = ' + '), '+ condition'))
design(dds) <- sv_formula

# Run DESeq2
dds <- DESeq(dds)

Visualize Batch Effects

Goal: Confirm batch effect removal by comparing PCA plots before and after correction.

Approach: Run PCA on pre- and post-correction expression, coloring points by batch and condition.

library(ggplot2)

# PCA before correction
pca_before <- prcomp(t(normalized_expr), scale. = TRUE)
pca_df <- data.frame(PC1 = pca_before$x[, 1], PC2 = pca_before$x[, 2],
                     batch = metadata$batch, condition = metadata$condition)

p1 <- ggplot(pca_df, aes(PC1, PC2, color = batch, shape = condition)) +
    geom_point(size = 3) + ggtitle('Before Correction')

# PCA after correction
pca_after <- prcomp(t(corrected_expr), scale. = TRUE)
pca_df_after <- data.frame(PC1 = pca_after$x[, 1], PC2 = pca_after$x[, 2],
                           batch = metadata$batch, condition = metadata$condition)

p2 <- ggplot(pca_df_after, aes(PC1, PC2, color = batch, shape = condition)) +
    geom_point(size = 3) + ggtitle('After Correction')

library(patchwork)
p1 + p2

Quantify Batch Effect

Goal: Measure the proportion of variance attributable to batch versus biological condition.

Approach: Correlate principal components with batch and condition labels, or use PVCA.

# PVCA - Principal Variance Component Analysis
library(pvca)

# Proportion of variance explained by batch vs condition
pvcaObj <- pvcaBatchAssess(normalized_expr, metadata, threshold = 0.6,
                            theInteractionTerms = c('batch', 'condition'))

# Or manual approach
pca <- prcomp(t(normalized_expr), scale. = TRUE)
variance_explained <- summary(pca)$importance[2, 1:5]

# Correlation of PCs with batch
cor(pca$x[, 1], as.numeric(as.factor(metadata$batch)))

Harmony (Single-Cell Integration)

Goal: Integrate single-cell data from multiple batches into a shared embedding.

Approach: Apply Harmony to PCA embeddings to iteratively remove batch effects while preserving cell-type structure.

library(harmony)
library(Seurat)

# For single-cell data with multiple batches
seurat_obj <- RunHarmony(seurat_obj, group.by.vars = 'batch', reduction = 'pca',
                          dims.use = 1:30)

# Use harmony reduction for downstream
seurat_obj <- RunUMAP(seurat_obj, reduction = 'harmony', dims = 1:30)
seurat_obj <- FindNeighbors(seurat_obj, reduction = 'harmony', dims = 1:30)

Critical: When NOT to Use Corrected Counts

Never use batch-corrected counts as input to DESeq2 or edgeR. Two-step correction (ComBat/ComBat-Seq then DE) introduces correlation structure in residuals, inflating or deflating significance. Include batch as a covariate in the design formula (

~ batch + condition
) instead.

limma::removeBatchEffect()
is designed for visualization only — it does not produce counts suitable for DE testing.

TaskUse Raw Counts + DesignUse Corrected Values
Differential expressionYes (
~ batch + condition
)
No
Pathway analysis on DE resultsYes (from design-based DE)No
PCA, UMAP, heatmapsYes
ClusteringYes
Machine learning featuresYes
Cross-study meta-analysisYes (ComBat-Seq)

Confounding and Limitations

SituationDetectable?Solution
Batch partially correlated with conditionYes (PCA)Include batch in design; results are valid but power is reduced
Batch perfectly correlated with condition (all treated in batch 1, all control in batch 2)Yes (PCA)Unfixable — cannot separate batch from condition
Unknown batch variablesSometimes (PCA shows unexplained clustering)Use SVA to estimate surrogate variables
Over-correction removing biological signalHard to detectCompare results with and without correction; balanced designs prevent this

Related Skills

  • differential-expression/deseq2-basics - DE with batch in design
  • single-cell/clustering - Integration methods
  • expression-matrix/normalization - Data normalization and transformation