BioSkills bio-expression-matrix-normalization
Normalize and transform RNA-seq count matrices for differential expression, visualization, and clustering. Covers between-sample (TMM, RLE, upper quartile), within-sample (TPM, FPKM), variance-stabilizing (VST, rlog), and single-cell (scran) methods. Use when choosing or applying normalization to expression data.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/expression-matrix/normalization" ~/.claude/skills/gptomics-bioskills-bio-expression-matrix-normalization && rm -rf "$T"
expression-matrix/normalization/SKILL.mdVersion Compatibility
Reference examples tested with: DESeq2 1.42+, edgeR 4.0+, pandas 2.2+, numpy 1.26+, scanpy 1.10+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - R:
thenpackageVersion('<pkg>')
to verify parameters?function_name
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Expression Matrix Normalization
Why Normalization Matters: Composition Bias
Simple library-size normalization (dividing by total counts) fails because RNA populations differ between samples. If a treatment massively upregulates a few genes, those genes consume a disproportionate share of sequencing reads, making every other gene appear downregulated even if unchanged. This composition bias is not a theoretical edge case -- it occurs routinely in stress responses, viral infection, and knockout experiments.
All robust normalization methods (TMM, RLE/DESeq2, upper quartile) address composition bias by estimating scaling factors from the majority of non-differentially-expressed genes.
Normalization Decision Table
| Task | Method | Tool | Input |
|---|---|---|---|
| DE analysis (DESeq2) | RLE / median-of-ratios | handles internally | Raw integer counts |
| DE analysis (edgeR) | TMM | | Raw integer counts |
| DE analysis (limma-voom) | TMM + voom | then | Raw integer counts |
| PCA, heatmaps, clustering | VST or rlog | or | DESeqDataSet |
| PCA/clustering (edgeR/limma) | log-CPM | | DGEList |
| WGCNA | VST or log-CPM | | DESeqDataSet |
| GSVA / ssGSEA | log2(TPM+1) or VST | Precomputed | TPM or DESeqDataSet |
| Machine learning / biomarkers | VST | | DESeqDataSet |
| Reporting expression levels | TPM (within-sample) | Quantification tool output | N/A |
| Cross-sample comparison | DESeq2 normalized counts | | DESeqDataSet |
| Single-cell RNA-seq | scran deconvolution | | SingleCellExperiment |
Critical rule: DE tools (DESeq2, edgeR, limma-voom) normalize internally from raw counts. Never provide pre-normalized data to these tools.
Between-Sample Normalization
RLE / Median of Ratios (DESeq2)
Goal: Estimate per-sample size factors that correct for library size and composition bias.
Approach: Compute a pseudo-reference sample (geometric mean per gene across all samples), then take the median of per-gene ratios between each sample and the reference. The median is robust to the minority of DE genes.
library(DESeq2) dds <- DESeqDataSetFromMatrix(countData=counts, colData=coldata, design=~condition) dds <- estimateSizeFactors(dds) sizeFactors(dds) # Normalized counts for export/visualization norm_counts <- counts(dds, normalized=TRUE)
Size factor interpretation: a factor of 1.2 means the sample has 20% more sequencing depth (after composition adjustment) than the reference. Genes with zero counts in any sample are excluded from the geometric mean calculation.
Size factor alternatives for special cases:
| Scenario | Solution |
|---|---|
| High zero-inflation (single-cell) | |
| Very small libraries | |
| Known stable reference genes | |
| Majority of genes DE (e.g., prokaryotic stress) | Spike-in normalization or |
TMM (edgeR)
Goal: Compute normalization factors that account for composition bias using trimmed mean of M-values.
Approach: Select a reference sample, compute gene-wise log-ratios (M-values) and average expression (A-values) relative to the reference, trim extreme values, and compute a weighted mean scaling factor.
library(edgeR) y <- DGEList(counts=counts, group=metadata$condition) y <- calcNormFactors(y, method='TMM') # TMM trims 30% of M-values and 5% of A-values by default # Normalization factors are stored in y$samples$norm.factors # Effective library size = lib.size * norm.factors
# Python equivalent using edgepy or manual calculation import numpy as np import pandas as pd def tmm_norm_factors(counts): '''Simplified TMM normalization factors.''' lib_sizes = counts.sum(axis=0) ref_idx = np.argmin(np.abs(np.log(lib_sizes) - np.mean(np.log(lib_sizes)))) ref = counts.iloc[:, ref_idx] factors = np.ones(counts.shape[1]) for i in range(counts.shape[1]): if i == ref_idx: continue sample = counts.iloc[:, i] keep = (ref > 0) & (sample > 0) m = np.log2(sample[keep] / lib_sizes[i]) - np.log2(ref[keep] / lib_sizes[ref_idx]) a = 0.5 * (np.log2(sample[keep] / lib_sizes[i]) + np.log2(ref[keep] / lib_sizes[ref_idx])) m_lo, m_hi = np.percentile(m, [15, 85]) a_lo, a_hi = np.percentile(a, [2.5, 97.5]) keep2 = (m >= m_lo) & (m <= m_hi) & (a >= a_lo) & (a <= a_hi) factors[i] = 2 ** np.average(m[keep2]) return factors / np.exp(np.mean(np.log(factors)))
Upper Quartile
y <- calcNormFactors(y, method='upperquartile')
Simpler than TMM; uses the 75th percentile of non-zero counts. Less robust to asymmetric DE but adequate when composition bias is mild.
Within-Sample Normalization (TPM, FPKM)
Why FPKM/RPKM Is Problematic
FPKM normalizes by total mapped reads AND gene length, enabling comparison of Gene A to Gene B within one sample. However, the per-million scaling factor depends on the RNA composition of that sample. If one sample has a few massively expressed genes consuming 20% of reads, all other genes' FPKM values are deflated relative to another sample. Wagner et al. (2012, Theory Biosci) showed that average FPKM varies between samples even for the same genome.
TPM (Transcripts Per Million) partially fixes this by normalizing to a constant sum (1 million per sample), but it still represents relative abundance within a sample's sequenced population, not absolute abundance. Neither TPM nor FPKM should be used for differential expression testing.
| Unit | Within-sample comparable | Between-sample comparable | Use for DE |
|---|---|---|---|
| Raw counts | No | No | Yes (with DE tools) |
| CPM | No (no length correction) | Only with TMM/RLE factors | No |
| FPKM/RPKM | Yes | No | No |
| TPM | Yes | Partially (same composition caveat) | No |
| Normalized counts | No | Yes | Via DE tool |
| VST/rlog | No | Yes | No (use for viz) |
Computing TPM
Goal: Convert raw counts to TPM for within-sample gene expression comparison.
Approach: Divide each gene's count by its effective length (in kilobases), then normalize so the column sums to 1 million.
import pandas as pd import numpy as np def counts_to_tpm(counts, gene_lengths): '''Convert raw counts to TPM. gene_lengths in bp.''' rate = counts.div(gene_lengths / 1000, axis=0) # RPK tpm = rate.div(rate.sum(axis=0), axis=1) * 1e6 return tpm
counts_to_tpm <- function(counts, gene_lengths) { rate <- counts / (gene_lengths / 1000) # RPK tpm <- t(t(rate) / colSums(rate)) * 1e6 return(tpm) }
Variance-Stabilizing Transformations
VST (DESeq2)
Goal: Transform counts so that variance is approximately constant across the mean, suitable for PCA, heatmaps, clustering, and distance-based analyses.
Approach: Fit the dispersion-mean trend and apply a variance-stabilizing function. The
blind parameter controls whether the design is used during dispersion estimation.
library(DESeq2) # blind=TRUE: re-estimates dispersions ignoring design; use for unbiased QC vsd <- vst(dds, blind=TRUE) # blind=FALSE: uses dispersions from the full model; recommended for most analyses # Prevents the transformation from treating design-based differences as noise vsd <- vst(dds, blind=FALSE) vst_matrix <- assay(vsd)
The
blind parameter decision:
: for initial QC where the goal is unbiased sample assessment (does PCA structure match expected groups?)blind=TRUE
: for all downstream use (heatmaps, WGCNA, machine learning) because it preserves biologically meaningful varianceblind=FALSE
rlog (DESeq2)
Goal: Apply regularized log transformation that handles low counts and variable library sizes better than VST.
Approach: Fit a shrinkage model per gene per sample that pulls low-information estimates toward the mean.
# Better than VST when library sizes vary widely (>10-fold range) # Much slower -- O(genes x samples); impractical for >100 samples rld <- rlog(dds, blind=FALSE) rlog_matrix <- assay(rld)
| Criterion | VST | rlog |
|---|---|---|
| Speed | Fast | Slow (O(genes x samples)) |
| >100 samples | Recommended | Not recommended |
| Unequal library sizes | Adequate | Better |
| Low-count genes | May be noisy | Better shrinkage |
| Default choice | Yes | Only when library sizes vary >10-fold |
log-CPM (edgeR)
Goal: Produce log-transformed CPM values suitable for visualization, with dampened low-count noise.
Approach: Add a scaled prior count before log transformation to prevent log(0) and reduce noise from low-count genes.
library(edgeR) # Default prior.count=0.25; use prior.count >= 2 for visualization log_cpm <- cpm(y, log=TRUE, prior.count=2)
import numpy as np def log_cpm(counts, prior_count=2): '''Log2 CPM with prior count for noise dampening.''' lib_sizes = counts.sum(axis=0) # Scale prior proportional to library size (edgeR convention) cpm_vals = (counts + prior_count) / (lib_sizes + 2 * prior_count) * 1e6 return np.log2(cpm_vals)
The
prior.count parameter matters: the default (0.25) is adequate for statistical use, but for heatmaps and PCA, a larger prior (2-5) dampens noise from genes with very few counts, reducing visual artifacts from low-expression noise.
GC Content and Gene Length Bias
Standard normalization (TMM, RLE) does not correct for sample-specific GC content bias or gene length bias. These biases arise from library preparation (fragmentation, PCR) and create systematic differences in read coverage that correlate with gene properties.
When this matters: gene set enrichment analysis is particularly susceptible. A 2019 PLOS Biology study showed that uncorrected sample-specific gene-length bias causes recurrent false positives in GSEA.
EDASeq
Goal: Remove GC content and gene length bias using within-lane normalization.
Approach: Two rounds of full-quantile normalization -- first within GC-content bins per sample, then between samples.
library(EDASeq) # Requires gene-level GC content feature_data <- data.frame(gc=gene_gc_content, length=gene_lengths, row.names=rownames(counts)) data <- newSeqExpressionSet(counts=as.matrix(counts), featureData=feature_data, phenoData=coldata) # Within-lane GC normalization, then between-lane normalization data_norm <- withinLaneNormalization(data, 'gc', which='full') data_norm <- betweenLaneNormalization(data_norm, which='full') normalized_counts <- counts(data_norm)
cqn (Conditional Quantile Normalization)
library(cqn) # Combines GC and length correction with quantile normalization cqn_result <- cqn(counts, x=gene_gc_content, lengths=gene_lengths) # Use offset in edgeR y$offset <- cqn_result$glm.offset
Single-Cell Normalization
scran Deconvolution
Goal: Estimate per-cell size factors that handle the high zero-inflation of single-cell data.
Approach: Pool cells in overlapping windows (sorted by library size), compute pool-level size factors, then deconvolve back to individual cells. Rough pre-clustering prevents mixing distinct cell types with very different transcriptome sizes.
library(scran) library(scater) # Pre-cluster to avoid mixing cell types clusters <- quickCluster(sce) # Deconvolution-based size factors sce <- computeSumFactors(sce, clusters=clusters) # Apply normalization sce <- logNormCounts(sce)
import scanpy as sc # scanpy's normalize_total is simpler but less robust than scran # For rigorous analysis, use scran via rpy2 or anndata2ri sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata)
Standard bulk normalization (TMM, RLE) fails on single-cell data because many genes have zero counts due to dropout, violating the assumption that most genes are detected in most samples. scanpy's
normalize_total is a simple CPM-like normalization that works for exploratory analysis but does not handle composition bias as well as scran.
Pre-Filtering Before Normalization
Goal: Remove genes with insufficient expression to reduce noise and multiple testing burden.
Approach: Apply expression thresholds that account for experimental design (group sizes).
edgeR: filterByExpr (Design-Aware)
library(edgeR) # Automatically determines thresholds based on design and library sizes # min.count=10, min.total.count=15 by default # Uses smallest group size from design matrix keep <- filterByExpr(y, design=model.matrix(~condition, data=metadata)) y <- y[keep, , keep.lib.sizes=FALSE]
filterByExpr is design-aware: it requires a gene to have CPM above a threshold in at least n samples, where n is the smallest group size. This is more principled than arbitrary thresholds.
DESeq2: Independent Filtering
DESeq2 performs automatic independent filtering at the
results() step, optimizing a mean-expression cutoff to maximize adjusted p-value rejections. Manual pre-filtering (rowSums(counts(dds)) >= 10) is purely for computational speed and memory -- it does not affect statistical results because independent filtering handles it downstream.
# Speed-only pre-filter (does not affect statistical results) keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep, ] # The alpha in results() should match the planned FDR cutoff # because independent filtering is optimized at this threshold res <- results(dds, alpha=0.05)
Key distinction: edgeR requires explicit filtering (via
filterByExpr) because its quasi-likelihood pipeline does not have DESeq2's independent filtering. Forgetting filterByExpr in edgeR inflates the multiple testing burden.
Common Mistakes
| Mistake | Why it fails | Correct approach |
|---|---|---|
| Providing normalized data to DESeq2/edgeR | DE tools model raw count distributions; normalized input violates assumptions | Always use raw integer counts |
| Using FPKM for cross-sample DE | Composition bias makes FPKM incomparable between samples | Use TMM/RLE-normalized counts via DE tools |
| Using VST/rlog values for DE testing | Transformed values lose the count distribution properties DE tools need | VST/rlog for visualization only |
| Applying log2(counts+1) globally | Pseudocount of 1 is arbitrary; does not stabilize variance properly | Use VST, rlog, or edgeR's log-CPM with prior.count |
| Normalizing single-cell data with TMM | High dropout violates TMM assumptions | Use scran deconvolution or scanpy normalize_total |
Forgetting in edgeR | Inflated multiple testing burden | Always filter before |
Using for downstream VST | Over-shrinks transformed values by ignoring known design | Use for heatmaps, WGCNA, ML |
Related Skills
- expression-matrix/counts-ingest - Load count data before normalization
- differential-expression/deseq2-basics - DESeq2 handles normalization internally
- differential-expression/edger-basics - edgeR TMM normalization
- single-cell/preprocessing - Single-cell normalization workflows
- rna-quantification/count-matrix-qc - QC before normalization