SciAgent-Skills deseq2-differential-expression
Differential expression analysis for bulk RNA-seq using R/Bioconductor DESeq2. Negative binomial GLM with empirical Bayes shrinkage, Wald and LRT tests, multi-factor designs, interaction terms, Salmon tximeta import, apeglm LFC shrinkage, MA/volcano/heatmap visualization. The R gold standard for DE analysis with native Bioconductor integration. Use pydeseq2-differential-expression for Python-based pipelines; use edgeR for TMM normalization.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/deseq2-differential-expression" ~/.claude/skills/jaechang-hits-sciagent-skills-deseq2-differential-expression && rm -rf "$T"
skills/genomics-bioinformatics/deseq2-differential-expression/SKILL.mdDESeq2 Differential Expression Analysis (R/Bioconductor)
Overview
DESeq2 is the Bioconductor R package for differential gene expression analysis from bulk RNA-seq count data. It fits a negative binomial generalized linear model per gene, estimates dispersion parameters using empirical Bayes shrinkage across genes, and tests differential expression using Wald tests (two-group) or likelihood ratio tests (complex designs). DESeq2 is the R gold standard for RNA-seq DE analysis, with native Bioconductor integration for seamless import from Salmon (tximeta/tximport), featureCounts, or HTSeq.
When to Use
- Identifying differentially expressed genes between two experimental conditions (treated vs. control, disease vs. healthy) from bulk RNA-seq count data
- Analyzing multi-factor designs that account for batch effects or covariates (e.g.,
)~ batch + condition - Testing complex hypotheses with interaction terms (e.g., time × treatment) or reduced models using likelihood ratio tests (LRT)
- Importing Salmon pseudoalignment output via tximeta or tximport for transcript-level uncertainty propagation
- Performing LFC shrinkage with apeglm for ranked gene lists, volcano plots, and downstream pathway analysis
- Conducting time-series experiments or any design with more than two levels requiring model comparison
- Working in an R/Bioconductor ecosystem where integration with SummarizedExperiment, clusterProfiler, or EnhancedVolcano is needed
- Use pydeseq2-differential-expression instead for Python-based pipelines with the same statistical model
- Use edgeR for negative binomial DE with TMM normalization, quasi-likelihood F-tests, or TREAT testing
- Use gseapy-gene-enrichment after DE to interpret results at the pathway level
Prerequisites
- R packages:
(Bioconductor),DESeq2
ortximeta
(Salmon import),tximport
(LFC shrinkage),apeglm
,pheatmap
,ggplot2EnhancedVolcano - Data requirements: Raw (unnormalized) integer count matrix — gene rows × sample columns — plus a sample metadata data frame with matching column names. If using Salmon: per-sample
files and a transcript-to-gene mappingquant.sf - Environment: R ≥ 4.2, Bioconductor ≥ 3.16
# Install Bioconductor packages if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c("DESeq2", "tximeta", "tximport", "apeglm", "EnhancedVolcano")) install.packages(c("pheatmap", "ggplot2", "dplyr"))
Quick Start
Complete two-group comparison from a count matrix in under 20 lines.
library(DESeq2) # Load count matrix (genes x samples) and metadata counts <- as.matrix(read.csv("counts.csv", row.names = 1)) coldata <- read.csv("metadata.csv", row.names = 1) coldata$condition <- factor(coldata$condition) # Build DESeqDataSet, run full pipeline dds <- DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ condition) dds <- dds[rowSums(counts(dds)) >= 10, ] # pre-filter dds <- DESeq(dds) # Extract results (treated vs. control) res <- results(dds, contrast = c("condition", "treated", "control"), alpha = 0.05) summary(res) # LFC shrinkage for visualization res_shrunk <- lfcShrink(dds, contrast = c("condition", "treated", "control"), type = "apeglm", coef = 2)
Workflow
Step 1: Prepare Count Matrix and Sample Metadata
Build a
DESeqDataSet from a gene × sample count matrix and a colData data frame. Column names of the count matrix must match row names of colData.
library(DESeq2) library(dplyr) # Load raw count matrix (genes as rows, samples as columns) counts <- as.matrix(read.csv("featureCounts_matrix.csv", row.names = 1)) # Or from featureCounts tab-delimited output — skip first comment line # fc <- read.table("featurecounts_output.txt", header = TRUE, skip = 1, row.names = 1) # counts <- as.matrix(fc[, 7:ncol(fc)]) # columns 7+ are sample counts # Load sample metadata coldata <- data.frame( condition = factor(c("control", "control", "control", "treated", "treated", "treated")), batch = factor(c("A", "A", "B", "A", "B", "B")), row.names = colnames(counts) ) # Build DESeqDataSet dds <- DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ condition) cat("Samples:", ncol(dds), "\n") cat("Genes:", nrow(dds), "\n") cat("Condition levels:", levels(coldata$condition), "\n")
Step 2: Import from Salmon via tximeta
When reads were quantified with Salmon, use
tximeta to import transcript-level estimates with proper offset correction that accounts for transcript length and GC bias.
library(tximeta) # Build a coldata with a "files" column pointing to quant.sf files quant_dirs <- file.path("salmon_output", coldata$sample_id, "quant.sf") coldata$files <- quant_dirs coldata$names <- coldata$sample_id # Import with tximeta — automatically fetches transcript metadata from Ensembl se <- tximeta(coldata) # SummarizedExperiment with transcript-level data # Summarize to gene level (requires Ensembl or custom txdb) gse <- summarizeToGene(se) # gene-level SummarizedExperiment # Build DESeqDataSet from the SummarizedExperiment dds <- DESeqDataSet(gse, design = ~ condition) cat("Gene-level SE dimensions:", dim(gse), "\n") cat("Assay names:", assayNames(gse), "\n") # Assay "counts" = estimated counts; "abundance" = TPM; "length" = eff. lengths
Step 3: Pre-Filtering and Quality Control
Remove genes with very low counts to improve statistical power and reduce the multiple testing burden. Explore sample quality with PCA on variance-stabilized counts.
library(ggplot2) # Pre-filter: keep genes with at least 10 reads total keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep, ] cat("Genes after pre-filtering:", nrow(dds), "\n") # Variance-stabilizing transformation for QC visualization vsd <- vst(dds, blind = TRUE) # blind = TRUE for exploratory QC # PCA plot pca_data <- plotPCA(vsd, intgroup = c("condition", "batch"), returnData = TRUE) percent_var <- round(100 * attr(pca_data, "percentVar")) ggplot(pca_data, aes(PC1, PC2, color = condition, shape = batch)) + geom_point(size = 3) + xlab(paste0("PC1: ", percent_var[1], "% variance")) + ylab(paste0("PC2: ", percent_var[2], "% variance")) + ggtitle("PCA — Variance-Stabilized Counts") + theme_bw() ggsave("pca_plot.pdf", width = 6, height = 5) cat("Saved pca_plot.pdf\n")
Step 4: Run DESeq() — Normalization, Dispersion, and Model Fitting
DESeq() runs three sequential steps: (1) median-of-ratios size factor estimation, (2) gene-wise and shrunken dispersion estimation, (3) negative binomial GLM fitting and Wald statistics.
# Run full DESeq2 pipeline dds <- DESeq(dds) # Inspect size factors (library normalization) cat("Size factors:\n") print(sizeFactors(dds)) # Expected range: 0.3–3.0; outliers indicate library quality issues # Dispersion plot — should show decreasing dispersion trend with mean plotDispEsts(dds, main = "Dispersion Estimates") # Inspect fitted model coefficients resultsNames(dds) # [1] "Intercept" "condition_treated_vs_control"
Step 5: Extract Results and Apply FDR Correction
Use
results() to extract Wald test statistics. Specify the contrast explicitly for clarity. Independent filtering maximizes the number of detectable genes.
# Extract results for the contrast of interest res <- results(dds, contrast = c("condition", "treated", "control"), alpha = 0.05, # FDR threshold for independent filtering lfcThreshold = 0, # Test H0: |LFC| = 0 (Wald test) independentFilter = TRUE) # Maximize power via adaptive filtering # Summary: how many genes are up/down/NA summary(res) # out of N with nonzero total read count: # adjusted p-value < 0.05: ... # LFC > 0 (up): ... # LFC < 0 (down): ... # outliers [Cook's distance]: ... # low counts [below independent filtering threshold]: ... # Convert to data frame and sort by adjusted p-value res_df <- as.data.frame(res) res_df <- res_df[order(res_df$padj, na.last = TRUE), ] cat("Significant genes (padj < 0.05):", sum(res_df$padj < 0.05, na.rm = TRUE), "\n") cat("Upregulated:", sum(res_df$padj < 0.05 & res_df$log2FoldChange > 0, na.rm = TRUE), "\n") cat("Downregulated:", sum(res_df$padj < 0.05 & res_df$log2FoldChange < 0, na.rm = TRUE), "\n") write.csv(res_df, "deseq2_all_results.csv")
Step 6: LFC Shrinkage with apeglm
Shrink log2 fold changes using the adaptive t prior (apeglm). Use shrunk LFCs for MA plots, volcano plots, and gene ranking — NOT for significance calling (padj comes from the unshrunk Wald test).
library(apeglm) # apeglm shrinkage — specify coefficient name from resultsNames(dds) res_shrunk <- lfcShrink(dds, coef = "condition_treated_vs_control", type = "apeglm", res = res) # pass original results for speed # Compare shrunk vs. unshrunk LFC cat("Max |LFC| unshrunk:", max(abs(res$log2FoldChange), na.rm = TRUE), "\n") cat("Max |LFC| shrunk: ", max(abs(res_shrunk$log2FoldChange), na.rm = TRUE), "\n") # Export shrunk results res_shrunk_df <- as.data.frame(res_shrunk) res_shrunk_df <- res_shrunk_df[order(res_shrunk_df$padj, na.last = TRUE), ] write.csv(res_shrunk_df, "deseq2_shrunk_results.csv") # Filter significant with effect size threshold sig_genes <- res_shrunk_df[ !is.na(res_shrunk_df$padj) & res_shrunk_df$padj < 0.05 & abs(res_shrunk_df$log2FoldChange) > 1, ] cat("Significant (padj<0.05, |LFC|>1):", nrow(sig_genes), "\n") write.csv(sig_genes, "deseq2_significant.csv")
Step 7: Visualize — MA Plot, Volcano Plot, and Heatmap
library(EnhancedVolcano) library(pheatmap) # --- MA Plot (base DESeq2) --- plotMA(res_shrunk, ylim = c(-5, 5), main = "MA Plot (apeglm-shrunk LFC)", colSig = "firebrick", colNonSig = "grey60") # Blue points are significantly DE genes; y-axis is shrunk log2FC # --- Volcano Plot (EnhancedVolcano) --- EnhancedVolcano(res_shrunk_df, lab = rownames(res_shrunk_df), x = "log2FoldChange", y = "padj", pCutoff = 0.05, FCcutoff = 1.0, xlim = c(-6, 6), title = "Treated vs. Control", subtitle = "apeglm-shrunk LFC | BH-adjusted p-value", legendLabels = c("NS", "LFC", "padj", "padj & LFC")) ggsave("volcano_plot.pdf", width = 8, height = 7) # --- Heatmap of top 50 DE genes --- top50 <- head(rownames(sig_genes[order(sig_genes$padj), ]), 50) mat <- assay(vsd)[top50, ] # variance-stabilized expression mat <- mat - rowMeans(mat) # center by gene mean annotation_col <- data.frame( Condition = coldata$condition, Batch = coldata$batch, row.names = colnames(mat) ) pheatmap(mat, annotation_col = annotation_col, cluster_rows = TRUE, cluster_cols = TRUE, show_rownames = TRUE, fontsize_row = 6, filename = "heatmap_top50.pdf", width = 8, height = 10) cat("Saved volcano_plot.pdf and heatmap_top50.pdf\n")
Step 8: Multi-Factor Design and Interaction Terms
For designs with batch correction, paired samples, or interaction effects between two variables.
# --- Batch-corrected two-group comparison --- dds_batch <- DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ batch + condition) dds_batch <- dds_batch[rowSums(counts(dds_batch)) >= 10, ] dds_batch <- DESeq(dds_batch) res_batch <- results(dds_batch, contrast = c("condition", "treated", "control"), alpha = 0.05) cat("Significant (batch-corrected):", sum(res_batch$padj < 0.05, na.rm = TRUE), "\n") # --- Interaction design: test whether treatment effect differs between genotypes --- # H0: the effect of treatment is the same in WT and KO coldata$genotype <- factor(coldata$genotype) # "WT" or "KO" dds_int <- DESeqDataSetFromMatrix( countData = counts, colData = coldata, design = ~ genotype + condition + genotype:condition ) dds_int <- DESeq(dds_int) resultsNames(dds_int) # Interaction coefficient: "genotypeKO.conditiontreated" # Extract interaction term — genes where treatment effect differs by genotype res_int <- results(dds_int, name = "genotypeKO.conditiontreated", alpha = 0.05) cat("Genes with significant interaction:", sum(res_int$padj < 0.05, na.rm = TRUE), "\n") # --- Likelihood Ratio Test for complex designs (e.g., time series) --- # LRT compares full model vs. reduced model dds_lrt <- DESeq(dds_int, test = "LRT", reduced = ~ genotype + condition) res_lrt <- results(dds_lrt) cat("Genes with genotype:condition interaction (LRT):", sum(res_lrt$padj < 0.05, na.rm = TRUE), "\n")
Key Parameters
| Parameter | Function | Default | Range / Options | Effect |
|---|---|---|---|---|
| | (required) | or | Specifies the GLM formula; put batch/covariates before variable of interest |
| | | or coefficient name | Defines the comparison; always specify explicitly for clarity |
| | | – | FDR threshold used for independent filtering; does not change padj values |
| | | – | Test H0: |
| | | / | Adaptive mean-count filtering to maximize the number of rejectible hypotheses |
| | | , , | Shrinkage prior; apeglm is fastest and best calibrated |
| | (required) | coefficient name from | Which model coefficient to shrink; must match output |
| | | , | Wald: two-group comparisons; LRT: model comparison for complex designs |
| | | formula without the term to test | Required for LRT; specifies the null model |
| / | | / | for QC/exploration; for downstream analysis using dispersion info |
Key Concepts
Negative Binomial Model and Dispersion
DESeq2 models each gene's count as Negative Binomial(mean = µ, dispersion = α), where mean µ depends on the experimental design through a log-linear model. The key innovation is dispersion shrinkage: gene-wise dispersion estimates are shrunk toward a fitted trend across all genes using an empirical Bayes approach, borrowing information across genes to stabilize estimates from small sample sizes.
Size Factor Normalization
DESeq2 uses the median-of-ratios method: for each sample, compute the ratio of each gene's count to the geometric mean of that gene across all samples, then take the median ratio across all genes as the size factor. This is robust to differentially expressed genes and does not require any assumptions about the fraction of DE genes.
Wald Test vs. Likelihood Ratio Test
- Wald test (default): tests a single coefficient
; fast and appropriate for two-group comparisons and specific contrasts in multi-factor designsβ = 0 - LRT: compares the full model to a
model by the likelihood ratio; appropriate for testing whether any level of a multi-level factor (e.g., 5 time points) has an effect, or for testing interaction terms as a groupreduced
Cook's Distance Outlier Detection
DESeq2 computes Cook's distance per gene per sample to flag potential expression outliers. Genes where any sample has a Cook's distance above a threshold are automatically replaced with
NA in the results. With refit = TRUE (default), outlier samples are excluded and the model is refit for that gene.
Common Recipes
Recipe: Multiple Contrasts from One Fitted Model
Efficient when comparing multiple treatment groups — fit
DESeq() once, extract multiple contrasts.
# Fit once dds_multi <- DESeqDataSetFromMatrix(counts, coldata, design = ~ condition) dds_multi <- dds_multi[rowSums(counts(dds_multi)) >= 10, ] dds_multi <- DESeq(dds_multi) # Extract contrasts — all vs. "control" contrasts_list <- list( A_vs_ctrl = c("condition", "treatment_A", "control"), B_vs_ctrl = c("condition", "treatment_B", "control"), C_vs_ctrl = c("condition", "treatment_C", "control") ) results_list <- lapply(names(contrasts_list), function(nm) { res <- results(dds_multi, contrast = contrasts_list[[nm]], alpha = 0.05) res_shrunk <- lfcShrink(dds_multi, contrast = contrasts_list[[nm]], type = "apeglm", res = res) write.csv(as.data.frame(res_shrunk), paste0("results_", nm, ".csv")) cat(nm, "— significant:", sum(res$padj < 0.05, na.rm = TRUE), "\n") as.data.frame(res_shrunk) }) names(results_list) <- names(contrasts_list)
Recipe: Paired Sample Design
Eliminate within-subject variability by including subject as a blocking factor.
# Paired design: each subject has pre- and post-treatment samples coldata$subject <- factor(c("S1","S1","S2","S2","S3","S3")) coldata$timepoint <- factor(c("pre","post","pre","post","pre","post")) dds_paired <- DESeqDataSetFromMatrix(counts, coldata, design = ~ subject + timepoint) dds_paired <- dds_paired[rowSums(counts(dds_paired)) >= 10, ] dds_paired <- DESeq(dds_paired) res_paired <- results(dds_paired, contrast = c("timepoint", "post", "pre"), alpha = 0.05) cat("Significant (paired):", sum(res_paired$padj < 0.05, na.rm = TRUE), "\n")
Recipe: Rank Genes for GSEA Using Shrunk LFC × -log10(pvalue)
Create a pre-ranked gene list for input to GSEA or gseapy's
prerank function.
# Build ranking metric: signed -log10(pvalue) with LFC direction res_rank <- as.data.frame(res_shrunk) res_rank <- res_rank[!is.na(res_rank$pvalue) & !is.na(res_rank$log2FoldChange), ] res_rank$rank_metric <- sign(res_rank$log2FoldChange) * (-log10(res_rank$pvalue + 1e-300)) res_rank <- res_rank[order(res_rank$rank_metric, decreasing = TRUE), ] # Export as two-column TSv for fgsea or gseapy prerank write.table(data.frame(gene = rownames(res_rank), score = res_rank$rank_metric), "gsea_ranked_list.rnk", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE) cat("Ranked gene list saved:", nrow(res_rank), "genes\n")
Recipe: Save and Reload Fitted DESeq2 Object
Avoid re-running the expensive
DESeq() step in subsequent sessions.
# Save fitted DDS object saveRDS(dds, "dds_fitted.rds") cat("Saved fitted DESeqDataSet to dds_fitted.rds\n") # Reload in a new session dds_loaded <- readRDS("dds_fitted.rds") # Continue with results(), lfcShrink(), etc. res_reloaded <- results(dds_loaded, contrast = c("condition", "treated", "control"))
Expected Outputs
| File | Description |
|---|---|
| Full results table: baseMean, log2FoldChange, lfcSE, stat, pvalue, padj for all genes |
| Results with apeglm-shrunk LFC; use for ranking and visualization |
| Filtered results: padj < 0.05 and |
| PCA of variance-stabilized counts; used to check batch structure and outliers |
| Gene-wise vs. fitted dispersions; should show tight cloud around trend |
| Volcano plot with significance and LFC thresholds labeled |
| Hierarchically clustered heatmap of top 50 DE genes |
| Pre-ranked gene list for pathway enrichment (fgsea/gseapy) |
| Serialized DESeqDataSet for checkpoint/resume |
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| A design factor has only one level | Check ; ensure all factor levels are present |
| Confounded design (e.g., batch perfectly correlated with condition) | Run ; drop the confounded variable or collect more samples |
All or most | Genes flagged by independent filtering (low mean count) or Cook's outliers | Check for count of filtered genes; relax or verify data quality |
| No significant genes despite strong biology | Under-powered experiment or high within-group variability | Verify n ≥ 3 per group; check PCA for outliers; inspect p-value histogram (should show spike near 0) |
: coefficient not found | name does not match | Run and copy the exact coefficient string into |
fails with convergence warning | Extreme LFC estimates (sparse data or complete separation) | Switch to which is more robust to these cases |
| Very large size factors (> 5) | Extremely different library sizes, or normalized counts accidentally used | Check ; ensure input is raw integer counts from aligner/counter |
| Dispersion plot shows outlier cloud far from trend | Noisy or failed libraries; incorrect metadata grouping | Examine PCA; remove outlier samples; check that formula matches biological groups |
| Volcano plot: no labeled points | defaults label too few genes | Adjust and ; or set |
References
- DESeq2 Bioconductor page — Package home, vignette, and reference manual
- DESeq2 vignette — Comprehensive worked example covering all design types
- Love et al. (2014) Genome Biology — Original DESeq2 paper: moderated estimation of fold change and dispersion for RNA-seq data
- Zhu et al. (2019) Bioinformatics — apeglm LFC shrinkage for DESeq2
- tximeta Bioconductor page — Salmon import with transcript-level metadata
- EnhancedVolcano Bioconductor page — Publication-quality volcano plots for DE results