SciAgent-Skills deseq2-differential-expression

Differential expression analysis for bulk RNA-seq using R/Bioconductor DESeq2. Negative binomial GLM with empirical Bayes shrinkage, Wald and LRT tests, multi-factor designs, interaction terms, Salmon tximeta import, apeglm LFC shrinkage, MA/volcano/heatmap visualization. The R gold standard for DE analysis with native Bioconductor integration. Use pydeseq2-differential-expression for Python-based pipelines; use edgeR for TMM normalization.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/deseq2-differential-expression" ~/.claude/skills/jaechang-hits-sciagent-skills-deseq2-differential-expression && rm -rf "$T"
manifest: skills/genomics-bioinformatics/deseq2-differential-expression/SKILL.md
source content

DESeq2 Differential Expression Analysis (R/Bioconductor)

Overview

DESeq2 is the Bioconductor R package for differential gene expression analysis from bulk RNA-seq count data. It fits a negative binomial generalized linear model per gene, estimates dispersion parameters using empirical Bayes shrinkage across genes, and tests differential expression using Wald tests (two-group) or likelihood ratio tests (complex designs). DESeq2 is the R gold standard for RNA-seq DE analysis, with native Bioconductor integration for seamless import from Salmon (tximeta/tximport), featureCounts, or HTSeq.

When to Use

  • Identifying differentially expressed genes between two experimental conditions (treated vs. control, disease vs. healthy) from bulk RNA-seq count data
  • Analyzing multi-factor designs that account for batch effects or covariates (e.g.,
    ~ batch + condition
    )
  • Testing complex hypotheses with interaction terms (e.g., time × treatment) or reduced models using likelihood ratio tests (LRT)
  • Importing Salmon pseudoalignment output via tximeta or tximport for transcript-level uncertainty propagation
  • Performing LFC shrinkage with apeglm for ranked gene lists, volcano plots, and downstream pathway analysis
  • Conducting time-series experiments or any design with more than two levels requiring model comparison
  • Working in an R/Bioconductor ecosystem where integration with SummarizedExperiment, clusterProfiler, or EnhancedVolcano is needed
  • Use pydeseq2-differential-expression instead for Python-based pipelines with the same statistical model
  • Use edgeR for negative binomial DE with TMM normalization, quasi-likelihood F-tests, or TREAT testing
  • Use gseapy-gene-enrichment after DE to interpret results at the pathway level

Prerequisites

  • R packages:
    DESeq2
    (Bioconductor),
    tximeta
    or
    tximport
    (Salmon import),
    apeglm
    (LFC shrinkage),
    pheatmap
    ,
    ggplot2
    ,
    EnhancedVolcano
  • Data requirements: Raw (unnormalized) integer count matrix — gene rows × sample columns — plus a sample metadata data frame with matching column names. If using Salmon: per-sample
    quant.sf
    files and a transcript-to-gene mapping
  • Environment: R ≥ 4.2, Bioconductor ≥ 3.16
# Install Bioconductor packages
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("DESeq2", "tximeta", "tximport", "apeglm",
                        "EnhancedVolcano"))
install.packages(c("pheatmap", "ggplot2", "dplyr"))

Quick Start

Complete two-group comparison from a count matrix in under 20 lines.

library(DESeq2)

# Load count matrix (genes x samples) and metadata
counts <- as.matrix(read.csv("counts.csv", row.names = 1))
coldata <- read.csv("metadata.csv", row.names = 1)
coldata$condition <- factor(coldata$condition)

# Build DESeqDataSet, run full pipeline
dds <- DESeqDataSetFromMatrix(countData = counts,
                               colData   = coldata,
                               design    = ~ condition)
dds <- dds[rowSums(counts(dds)) >= 10, ]   # pre-filter
dds <- DESeq(dds)

# Extract results (treated vs. control)
res <- results(dds, contrast = c("condition", "treated", "control"),
               alpha = 0.05)
summary(res)
# LFC shrinkage for visualization
res_shrunk <- lfcShrink(dds, contrast = c("condition", "treated", "control"),
                         type = "apeglm", coef = 2)

Workflow

Step 1: Prepare Count Matrix and Sample Metadata

Build a

DESeqDataSet
from a gene × sample count matrix and a colData data frame. Column names of the count matrix must match row names of colData.

library(DESeq2)
library(dplyr)

# Load raw count matrix (genes as rows, samples as columns)
counts <- as.matrix(read.csv("featureCounts_matrix.csv", row.names = 1))
# Or from featureCounts tab-delimited output — skip first comment line
# fc <- read.table("featurecounts_output.txt", header = TRUE, skip = 1, row.names = 1)
# counts <- as.matrix(fc[, 7:ncol(fc)])  # columns 7+ are sample counts

# Load sample metadata
coldata <- data.frame(
    condition = factor(c("control", "control", "control",
                          "treated", "treated", "treated")),
    batch     = factor(c("A", "A", "B", "A", "B", "B")),
    row.names = colnames(counts)
)

# Build DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = counts,
                               colData   = coldata,
                               design    = ~ condition)

cat("Samples:", ncol(dds), "\n")
cat("Genes:", nrow(dds), "\n")
cat("Condition levels:", levels(coldata$condition), "\n")

Step 2: Import from Salmon via tximeta

When reads were quantified with Salmon, use

tximeta
to import transcript-level estimates with proper offset correction that accounts for transcript length and GC bias.

library(tximeta)

# Build a coldata with a "files" column pointing to quant.sf files
quant_dirs <- file.path("salmon_output", coldata$sample_id, "quant.sf")
coldata$files <- quant_dirs
coldata$names <- coldata$sample_id

# Import with tximeta — automatically fetches transcript metadata from Ensembl
se <- tximeta(coldata)           # SummarizedExperiment with transcript-level data

# Summarize to gene level (requires Ensembl or custom txdb)
gse <- summarizeToGene(se)       # gene-level SummarizedExperiment

# Build DESeqDataSet from the SummarizedExperiment
dds <- DESeqDataSet(gse, design = ~ condition)

cat("Gene-level SE dimensions:", dim(gse), "\n")
cat("Assay names:", assayNames(gse), "\n")
# Assay "counts" = estimated counts; "abundance" = TPM; "length" = eff. lengths

Step 3: Pre-Filtering and Quality Control

Remove genes with very low counts to improve statistical power and reduce the multiple testing burden. Explore sample quality with PCA on variance-stabilized counts.

library(ggplot2)

# Pre-filter: keep genes with at least 10 reads total
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]
cat("Genes after pre-filtering:", nrow(dds), "\n")

# Variance-stabilizing transformation for QC visualization
vsd <- vst(dds, blind = TRUE)    # blind = TRUE for exploratory QC

# PCA plot
pca_data <- plotPCA(vsd, intgroup = c("condition", "batch"), returnData = TRUE)
percent_var <- round(100 * attr(pca_data, "percentVar"))

ggplot(pca_data, aes(PC1, PC2, color = condition, shape = batch)) +
    geom_point(size = 3) +
    xlab(paste0("PC1: ", percent_var[1], "% variance")) +
    ylab(paste0("PC2: ", percent_var[2], "% variance")) +
    ggtitle("PCA — Variance-Stabilized Counts") +
    theme_bw()

ggsave("pca_plot.pdf", width = 6, height = 5)
cat("Saved pca_plot.pdf\n")

Step 4: Run DESeq() — Normalization, Dispersion, and Model Fitting

DESeq()
runs three sequential steps: (1) median-of-ratios size factor estimation, (2) gene-wise and shrunken dispersion estimation, (3) negative binomial GLM fitting and Wald statistics.

# Run full DESeq2 pipeline
dds <- DESeq(dds)

# Inspect size factors (library normalization)
cat("Size factors:\n")
print(sizeFactors(dds))
# Expected range: 0.3–3.0; outliers indicate library quality issues

# Dispersion plot — should show decreasing dispersion trend with mean
plotDispEsts(dds, main = "Dispersion Estimates")

# Inspect fitted model coefficients
resultsNames(dds)
# [1] "Intercept" "condition_treated_vs_control"

Step 5: Extract Results and Apply FDR Correction

Use

results()
to extract Wald test statistics. Specify the contrast explicitly for clarity. Independent filtering maximizes the number of detectable genes.

# Extract results for the contrast of interest
res <- results(dds,
               contrast          = c("condition", "treated", "control"),
               alpha             = 0.05,   # FDR threshold for independent filtering
               lfcThreshold      = 0,      # Test H0: |LFC| = 0 (Wald test)
               independentFilter = TRUE)   # Maximize power via adaptive filtering

# Summary: how many genes are up/down/NA
summary(res)
# out of N with nonzero total read count:
# adjusted p-value < 0.05: ...
# LFC > 0 (up): ...
# LFC < 0 (down): ...
# outliers [Cook's distance]: ...
# low counts [below independent filtering threshold]: ...

# Convert to data frame and sort by adjusted p-value
res_df <- as.data.frame(res)
res_df <- res_df[order(res_df$padj, na.last = TRUE), ]

cat("Significant genes (padj < 0.05):", sum(res_df$padj < 0.05, na.rm = TRUE), "\n")
cat("Upregulated:", sum(res_df$padj < 0.05 & res_df$log2FoldChange > 0, na.rm = TRUE), "\n")
cat("Downregulated:", sum(res_df$padj < 0.05 & res_df$log2FoldChange < 0, na.rm = TRUE), "\n")

write.csv(res_df, "deseq2_all_results.csv")

Step 6: LFC Shrinkage with apeglm

Shrink log2 fold changes using the adaptive t prior (apeglm). Use shrunk LFCs for MA plots, volcano plots, and gene ranking — NOT for significance calling (padj comes from the unshrunk Wald test).

library(apeglm)

# apeglm shrinkage — specify coefficient name from resultsNames(dds)
res_shrunk <- lfcShrink(dds,
                         coef = "condition_treated_vs_control",
                         type = "apeglm",
                         res  = res)    # pass original results for speed

# Compare shrunk vs. unshrunk LFC
cat("Max |LFC| unshrunk:", max(abs(res$log2FoldChange), na.rm = TRUE), "\n")
cat("Max |LFC| shrunk:  ", max(abs(res_shrunk$log2FoldChange), na.rm = TRUE), "\n")

# Export shrunk results
res_shrunk_df <- as.data.frame(res_shrunk)
res_shrunk_df <- res_shrunk_df[order(res_shrunk_df$padj, na.last = TRUE), ]
write.csv(res_shrunk_df, "deseq2_shrunk_results.csv")

# Filter significant with effect size threshold
sig_genes <- res_shrunk_df[
    !is.na(res_shrunk_df$padj) &
    res_shrunk_df$padj < 0.05 &
    abs(res_shrunk_df$log2FoldChange) > 1, ]
cat("Significant (padj<0.05, |LFC|>1):", nrow(sig_genes), "\n")
write.csv(sig_genes, "deseq2_significant.csv")

Step 7: Visualize — MA Plot, Volcano Plot, and Heatmap

library(EnhancedVolcano)
library(pheatmap)

# --- MA Plot (base DESeq2) ---
plotMA(res_shrunk, ylim = c(-5, 5),
       main = "MA Plot (apeglm-shrunk LFC)",
       colSig = "firebrick", colNonSig = "grey60")
# Blue points are significantly DE genes; y-axis is shrunk log2FC

# --- Volcano Plot (EnhancedVolcano) ---
EnhancedVolcano(res_shrunk_df,
    lab        = rownames(res_shrunk_df),
    x          = "log2FoldChange",
    y          = "padj",
    pCutoff    = 0.05,
    FCcutoff   = 1.0,
    xlim       = c(-6, 6),
    title      = "Treated vs. Control",
    subtitle   = "apeglm-shrunk LFC | BH-adjusted p-value",
    legendLabels = c("NS", "LFC", "padj", "padj & LFC"))

ggsave("volcano_plot.pdf", width = 8, height = 7)

# --- Heatmap of top 50 DE genes ---
top50 <- head(rownames(sig_genes[order(sig_genes$padj), ]), 50)
mat   <- assay(vsd)[top50, ]            # variance-stabilized expression
mat   <- mat - rowMeans(mat)            # center by gene mean

annotation_col <- data.frame(
    Condition = coldata$condition,
    Batch     = coldata$batch,
    row.names = colnames(mat)
)

pheatmap(mat,
         annotation_col  = annotation_col,
         cluster_rows    = TRUE,
         cluster_cols    = TRUE,
         show_rownames   = TRUE,
         fontsize_row    = 6,
         filename        = "heatmap_top50.pdf",
         width           = 8,
         height          = 10)
cat("Saved volcano_plot.pdf and heatmap_top50.pdf\n")

Step 8: Multi-Factor Design and Interaction Terms

For designs with batch correction, paired samples, or interaction effects between two variables.

# --- Batch-corrected two-group comparison ---
dds_batch <- DESeqDataSetFromMatrix(countData = counts,
                                     colData   = coldata,
                                     design    = ~ batch + condition)
dds_batch <- dds_batch[rowSums(counts(dds_batch)) >= 10, ]
dds_batch <- DESeq(dds_batch)

res_batch <- results(dds_batch,
                      contrast = c("condition", "treated", "control"),
                      alpha    = 0.05)
cat("Significant (batch-corrected):", sum(res_batch$padj < 0.05, na.rm = TRUE), "\n")

# --- Interaction design: test whether treatment effect differs between genotypes ---
# H0: the effect of treatment is the same in WT and KO
coldata$genotype <- factor(coldata$genotype)  # "WT" or "KO"

dds_int <- DESeqDataSetFromMatrix(
    countData = counts, colData = coldata,
    design = ~ genotype + condition + genotype:condition
)
dds_int <- DESeq(dds_int)
resultsNames(dds_int)
# Interaction coefficient: "genotypeKO.conditiontreated"

# Extract interaction term — genes where treatment effect differs by genotype
res_int <- results(dds_int, name = "genotypeKO.conditiontreated", alpha = 0.05)
cat("Genes with significant interaction:", sum(res_int$padj < 0.05, na.rm = TRUE), "\n")

# --- Likelihood Ratio Test for complex designs (e.g., time series) ---
# LRT compares full model vs. reduced model
dds_lrt <- DESeq(dds_int, test = "LRT", reduced = ~ genotype + condition)
res_lrt  <- results(dds_lrt)
cat("Genes with genotype:condition interaction (LRT):",
    sum(res_lrt$padj < 0.05, na.rm = TRUE), "\n")

Key Parameters

ParameterFunctionDefaultRange / OptionsEffect
design
DESeqDataSetFromMatrix()
(required)
~ var
or
~ cov + var
Specifies the GLM formula; put batch/covariates before variable of interest
contrast
results()
NULL
c(var, level1, level2)
or coefficient name
Defines the comparison; always specify explicitly for clarity
alpha
results()
0.1
0.01
0.10
FDR threshold used for independent filtering; does not change padj values
lfcThreshold
results()
0
0
2
Test H0:
independentFilter
results()
TRUE
TRUE
/
FALSE
Adaptive mean-count filtering to maximize the number of rejectible hypotheses
type
lfcShrink()
"apeglm"
"apeglm"
,
"ashr"
,
"normal"
Shrinkage prior; apeglm is fastest and best calibrated
coef
lfcShrink()
(required)coefficient name from
resultsNames()
Which model coefficient to shrink; must match
resultsNames()
output
test
DESeq()
"Wald"
"Wald"
,
"LRT"
Wald: two-group comparisons; LRT: model comparison for complex designs
reduced
DESeq()
NULL
formula without the term to testRequired for LRT; specifies the null model
blind
vst()
/
rlog()
TRUE
TRUE
/
FALSE
TRUE
for QC/exploration;
FALSE
for downstream analysis using dispersion info

Key Concepts

Negative Binomial Model and Dispersion

DESeq2 models each gene's count as Negative Binomial(mean = µ, dispersion = α), where mean µ depends on the experimental design through a log-linear model. The key innovation is dispersion shrinkage: gene-wise dispersion estimates are shrunk toward a fitted trend across all genes using an empirical Bayes approach, borrowing information across genes to stabilize estimates from small sample sizes.

Size Factor Normalization

DESeq2 uses the median-of-ratios method: for each sample, compute the ratio of each gene's count to the geometric mean of that gene across all samples, then take the median ratio across all genes as the size factor. This is robust to differentially expressed genes and does not require any assumptions about the fraction of DE genes.

Wald Test vs. Likelihood Ratio Test

  • Wald test (default): tests a single coefficient
    β = 0
    ; fast and appropriate for two-group comparisons and specific contrasts in multi-factor designs
  • LRT: compares the full model to a
    reduced
    model by the likelihood ratio; appropriate for testing whether any level of a multi-level factor (e.g., 5 time points) has an effect, or for testing interaction terms as a group

Cook's Distance Outlier Detection

DESeq2 computes Cook's distance per gene per sample to flag potential expression outliers. Genes where any sample has a Cook's distance above a threshold are automatically replaced with

NA
in the results. With
refit = TRUE
(default), outlier samples are excluded and the model is refit for that gene.

Common Recipes

Recipe: Multiple Contrasts from One Fitted Model

Efficient when comparing multiple treatment groups — fit

DESeq()
once, extract multiple contrasts.

# Fit once
dds_multi <- DESeqDataSetFromMatrix(counts, coldata, design = ~ condition)
dds_multi <- dds_multi[rowSums(counts(dds_multi)) >= 10, ]
dds_multi <- DESeq(dds_multi)

# Extract contrasts — all vs. "control"
contrasts_list <- list(
    A_vs_ctrl = c("condition", "treatment_A", "control"),
    B_vs_ctrl = c("condition", "treatment_B", "control"),
    C_vs_ctrl = c("condition", "treatment_C", "control")
)

results_list <- lapply(names(contrasts_list), function(nm) {
    res <- results(dds_multi, contrast = contrasts_list[[nm]], alpha = 0.05)
    res_shrunk <- lfcShrink(dds_multi, contrast = contrasts_list[[nm]],
                             type = "apeglm", res = res)
    write.csv(as.data.frame(res_shrunk), paste0("results_", nm, ".csv"))
    cat(nm, "— significant:", sum(res$padj < 0.05, na.rm = TRUE), "\n")
    as.data.frame(res_shrunk)
})
names(results_list) <- names(contrasts_list)

Recipe: Paired Sample Design

Eliminate within-subject variability by including subject as a blocking factor.

# Paired design: each subject has pre- and post-treatment samples
coldata$subject   <- factor(c("S1","S1","S2","S2","S3","S3"))
coldata$timepoint <- factor(c("pre","post","pre","post","pre","post"))

dds_paired <- DESeqDataSetFromMatrix(counts, coldata,
                                      design = ~ subject + timepoint)
dds_paired <- dds_paired[rowSums(counts(dds_paired)) >= 10, ]
dds_paired <- DESeq(dds_paired)

res_paired <- results(dds_paired,
                       contrast = c("timepoint", "post", "pre"),
                       alpha    = 0.05)
cat("Significant (paired):", sum(res_paired$padj < 0.05, na.rm = TRUE), "\n")

Recipe: Rank Genes for GSEA Using Shrunk LFC × -log10(pvalue)

Create a pre-ranked gene list for input to GSEA or gseapy's

prerank
function.

# Build ranking metric: signed -log10(pvalue) with LFC direction
res_rank <- as.data.frame(res_shrunk)
res_rank <- res_rank[!is.na(res_rank$pvalue) & !is.na(res_rank$log2FoldChange), ]
res_rank$rank_metric <- sign(res_rank$log2FoldChange) *
                         (-log10(res_rank$pvalue + 1e-300))
res_rank <- res_rank[order(res_rank$rank_metric, decreasing = TRUE), ]

# Export as two-column TSv for fgsea or gseapy prerank
write.table(data.frame(gene  = rownames(res_rank),
                        score = res_rank$rank_metric),
            "gsea_ranked_list.rnk",
            sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
cat("Ranked gene list saved:", nrow(res_rank), "genes\n")

Recipe: Save and Reload Fitted DESeq2 Object

Avoid re-running the expensive

DESeq()
step in subsequent sessions.

# Save fitted DDS object
saveRDS(dds, "dds_fitted.rds")
cat("Saved fitted DESeqDataSet to dds_fitted.rds\n")

# Reload in a new session
dds_loaded <- readRDS("dds_fitted.rds")
# Continue with results(), lfcShrink(), etc.
res_reloaded <- results(dds_loaded, contrast = c("condition", "treated", "control"))

Expected Outputs

FileDescription
deseq2_all_results.csv
Full results table: baseMean, log2FoldChange, lfcSE, stat, pvalue, padj for all genes
deseq2_shrunk_results.csv
Results with apeglm-shrunk LFC; use for ranking and visualization
deseq2_significant.csv
Filtered results: padj < 0.05 and
pca_plot.pdf
PCA of variance-stabilized counts; used to check batch structure and outliers
dispersion_plot.pdf
Gene-wise vs. fitted dispersions; should show tight cloud around trend
volcano_plot.pdf
Volcano plot with significance and LFC thresholds labeled
heatmap_top50.pdf
Hierarchically clustered heatmap of top 50 DE genes
gsea_ranked_list.rnk
Pre-ranked gene list for pathway enrichment (fgsea/gseapy)
dds_fitted.rds
Serialized DESeqDataSet for checkpoint/resume

Troubleshooting

ProblemCauseSolution
Error: design contains one or more variables with all samples having the same value
A design factor has only one levelCheck
table(coldata$condition)
; ensure all factor levels are present
Model matrix not full rank
Confounded design (e.g., batch perfectly correlated with condition)Run
table(coldata$batch, coldata$condition)
; drop the confounded variable or collect more samples
All or most
padj = NA
Genes flagged by independent filtering (low mean count) or Cook's outliersCheck
summary(res)
for count of filtered genes; relax
alpha
or verify data quality
No significant genes despite strong biologyUnder-powered experiment or high within-group variabilityVerify n ≥ 3 per group; check PCA for outliers; inspect p-value histogram (should show spike near 0)
Error in lfcShrink
: coefficient not found
coef
name does not match
resultsNames(dds)
Run
resultsNames(dds)
and copy the exact coefficient string into
coef =
apeglm
fails with convergence warning
Extreme LFC estimates (sparse data or complete separation)Switch to
type = "ashr"
which is more robust to these cases
Very large size factors (> 5)Extremely different library sizes, or normalized counts accidentally usedCheck
colSums(counts(dds))
; ensure input is raw integer counts from aligner/counter
Dispersion plot shows outlier cloud far from trendNoisy or failed libraries; incorrect metadata groupingExamine PCA; remove outlier samples; check that
design
formula matches biological groups
Volcano plot: no labeled points
EnhancedVolcano
defaults label too few genes
Adjust
pCutoff
and
FCcutoff
; or set
selectLab = rownames(sig_genes)[1:20]

References