SciAgent-Skills pydeseq2-differential-expression

Bulk RNA-seq differential expression analysis with PyDESeq2. Load count matrices, normalize, fit negative binomial models, Wald test with BH-FDR correction, LFC shrinkage, volcano/MA plots. Use for two-group comparisons, multi-factor designs with batch correction, and multiple contrast testing.

install

source · Clone the upstream repo

git clone https://github.com/jaechang-hits/SciAgent-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/pydeseq2-differential-expression" ~/.claude/skills/jaechang-hits-sciagent-skills-pydeseq2-differential-expression && rm -rf "$T"

manifest: skills/genomics-bioinformatics/pydeseq2-differential-expression/SKILL.md

source content

PyDESeq2 Differential Expression Analysis

Overview

PyDESeq2 is a Python reimplementation of the R DESeq2 package for differential gene expression analysis from bulk RNA-seq count data. It fits negative binomial generalized linear models per gene, estimates dispersion with empirical Bayes shrinkage, and performs Wald tests with Benjamini-Hochberg FDR correction. This skill covers the full pipeline from raw counts to publication-ready result tables and visualizations.

When to Use

Identifying differentially expressed genes between two or more experimental conditions from bulk RNA-seq
Performing two-group comparisons (e.g., treated vs control) with proper statistical testing
Running multi-factor designs that account for batch effects or covariates (e.g.,
```
~batch + condition
```
)
Applying log2 fold change shrinkage (apeGLM) for ranking and visualization
Generating volcano plots, MA plots, and heatmaps from differential expression results
Converting R-based DESeq2 workflows to a pure Python environment
Integrating DE analysis into larger Python bioinformatics pipelines (e.g., with scanpy, pandas)
Use DESeq2 (R/Bioconductor) or edgeR instead for the reference R implementations with the broadest method support and community validation

Prerequisites

Python packages:

pydeseq2>=0.4

pandas>=1.4

numpy>=1.23

scipy>=1.11

scikit-learn>=1.1

anndata>=0.8

Data requirements: Raw (unnormalized) integer count matrix (samples x genes) + sample metadata DataFrame
Environment: Python 3.10+; optional
```
matplotlib
```
,
```
seaborn
```
for visualization

pip install pydeseq2 matplotlib seaborn

Workflow

Step 1: Data Loading and Validation

Load the count matrix and metadata. PyDESeq2 expects counts as a samples x genes DataFrame with non-negative integers, and metadata as a samples x variables DataFrame with matching indices.

import pandas as pd

# Load data — typical CSV has genes as rows, samples as columns
counts_raw = pd.read_csv("counts.csv", index_col=0)
metadata = pd.read_csv("metadata.csv", index_col=0)

# Transpose if needed: PyDESeq2 requires samples x genes
if counts_raw.shape[0] > counts_raw.shape[1]:
    counts_df = counts_raw.T  # genes x samples → samples x genes
else:
    counts_df = counts_raw

# Validate alignment
common_samples = counts_df.index.intersection(metadata.index)
counts_df = counts_df.loc[common_samples]
metadata = metadata.loc[common_samples]

print(f"Samples: {counts_df.shape[0]}, Genes: {counts_df.shape[1]}")
print(f"Metadata columns: {list(metadata.columns)}")
print(f"Condition counts:\n{metadata['condition'].value_counts()}")

Step 2: Gene Filtering

Remove lowly expressed genes to improve statistical power and reduce multiple testing burden.

# Filter genes with total counts below threshold
min_total_counts = 10
gene_counts = counts_df.sum(axis=0)
genes_to_keep = gene_counts[gene_counts >= min_total_counts].index
counts_df = counts_df[genes_to_keep]

# Optional: require minimum counts in a minimum number of samples
min_count_per_sample = 5
min_samples = 3
genes_expressed = (counts_df >= min_count_per_sample).sum(axis=0) >= min_samples
counts_df = counts_df.loc[:, genes_expressed]

print(f"Genes after filtering: {counts_df.shape[1]}")

Step 3: DeseqDataSet Initialization and Fitting

Create the DESeq dataset object, specify the design formula, and run the full pipeline (size factor estimation, dispersion estimation, model fitting).

from pydeseq2.dds import DeseqDataSet

dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~condition",   # Wilkinson-style formula
    refit_cooks=True,      # Refit after Cook's outlier removal
    n_cpus=4               # Parallel threads
)

# Run: size factors → dispersions → trend → MAP shrinkage → LFC fitting
dds.deseq2()

# Inspect normalization
print(f"Size factors (first 5): {dds.obsm['size_factors'][:5]}")
print(f"Size factor range: {dds.obsm['size_factors'].min():.2f} - {dds.obsm['size_factors'].max():.2f}")

Step 4: Statistical Testing (Wald Test)

Perform Wald tests to identify differentially expressed genes. Specify the contrast as

[variable, test_level, reference_level]

from pydeseq2.ds import DeseqStats

ds = DeseqStats(
    dds,
    contrast=["condition", "treated", "control"],
    alpha=0.05,              # FDR threshold
    cooks_filter=True,       # Filter Cook's outliers
    independent_filter=True  # Independent filtering for power
)

ds.summary()

# Access full results
results = ds.results_df
print(f"Total genes tested: {len(results)}")
print(f"Significant (padj < 0.05): {(results.padj < 0.05).sum()}")

Step 5: LFC Shrinkage (Optional)

Apply apeGLM shrinkage to reduce noise in log2 fold change estimates. Use shrunk values for visualization and ranking, not for significance calls.

# Apply shrinkage — modifies results_df.log2FoldChange in place
ds.lfc_shrink()

# Compare pre/post shrinkage effect
print(f"Max |LFC| after shrinkage: {results.log2FoldChange.abs().max():.2f}")
print(f"Genes with |LFC| > 2: {(results.log2FoldChange.abs() > 2).sum()}")

Step 6: Result Filtering and Export

Filter significant genes and export results for downstream analysis.

import numpy as np

# Significance + effect size filter
significant = results[
    (results.padj < 0.05) &
    (results.log2FoldChange.abs() > 1.0)
].copy()

# Separate up/down-regulated
up = significant[significant.log2FoldChange > 0].sort_values("padj")
down = significant[significant.log2FoldChange < 0].sort_values("padj")

print(f"Upregulated: {len(up)}, Downregulated: {len(down)}")

# Export
results.to_csv("deseq2_all_results.csv")
significant.to_csv("deseq2_significant.csv")
up.to_csv("deseq2_upregulated.csv")
down.to_csv("deseq2_downregulated.csv")
print("Results exported to CSV files")

Step 7: Visualization — Volcano Plot

import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots(figsize=(10, 7))
res = results.dropna(subset=["padj"]).copy()
res["-log10_padj"] = -np.log10(res.padj)

# Color categories
is_sig = (res.padj < 0.05) & (res.log2FoldChange.abs() > 1.0)
is_up = is_sig & (res.log2FoldChange > 0)
is_down = is_sig & (res.log2FoldChange < 0)

ax.scatter(res.loc[~is_sig, "log2FoldChange"], res.loc[~is_sig, "-log10_padj"],
           c="grey", alpha=0.3, s=8, label="Not significant")
ax.scatter(res.loc[is_up, "log2FoldChange"], res.loc[is_up, "-log10_padj"],
           c="firebrick", alpha=0.6, s=12, label=f"Up ({is_up.sum()})")
ax.scatter(res.loc[is_down, "log2FoldChange"], res.loc[is_down, "-log10_padj"],
           c="steelblue", alpha=0.6, s=12, label=f"Down ({is_down.sum()})")

ax.axhline(-np.log10(0.05), ls="--", c="black", alpha=0.4)
ax.axvline(-1, ls="--", c="black", alpha=0.4)
ax.axvline(1, ls="--", c="black", alpha=0.4)
ax.set_xlabel("Log2 Fold Change")
ax.set_ylabel("-Log10(Adjusted P-value)")
ax.set_title("Volcano Plot — Treated vs Control")
ax.legend()
plt.tight_layout()
plt.savefig("volcano_plot.png", dpi=300)
print("Saved volcano_plot.png")

Step 8: Visualization — MA Plot

fig, ax = plt.subplots(figsize=(10, 7))

ax.scatter(np.log10(res.loc[~is_sig, "baseMean"] + 1),
           res.loc[~is_sig, "log2FoldChange"],
           c="grey", alpha=0.3, s=8)
ax.scatter(np.log10(res.loc[is_sig, "baseMean"] + 1),
           res.loc[is_sig, "log2FoldChange"],
           c="firebrick", alpha=0.6, s=12)

ax.axhline(0, ls="--", c="black", alpha=0.5)
ax.set_xlabel("Log10(Mean Normalized Count + 1)")
ax.set_ylabel("Log2 Fold Change")
ax.set_title("MA Plot")
plt.tight_layout()
plt.savefig("ma_plot.png", dpi=300)
print("Saved ma_plot.png")

Key Parameters

Parameter	Default	Range / Options	Effect
`design`	(required)	Wilkinson formula	Model formula; put covariates before variable of interest
`contrast`	`None`	`[var, test, ref]`	Which comparison to test; `None` uses last coefficient
`alpha`	`0.05`	`0.01` – `0.10`	FDR threshold for significance calling
`refit_cooks`	`True`	`True` / `False`	Refit model after removing Cook's distance outliers
`cooks_filter`	`True`	`True` / `False`	Apply Cook's distance filtering during testing
`independent_filter`	`True`	`True` / `False`	Independent filtering to optimize detection power
`n_cpus`	`1`	`1` – `N`	Number of parallel threads for dispersion fitting
`min_total_counts` (user)	`10`	`5` – `50`	Gene filtering: minimum total reads across all samples
`lfc_shrink()`	off	call after `summary()`	apeGLM shrinkage; reduces noisy LFC estimates

Common Recipes

Recipe: Multiple Contrasts from One Model

When comparing multiple treatment groups against a shared control.

dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition")
dds.deseq2()

contrasts = {
    "A_vs_ctrl": ["condition", "treatment_A", "control"],
    "B_vs_ctrl": ["condition", "treatment_B", "control"],
    "C_vs_ctrl": ["condition", "treatment_C", "control"],
}

for name, contrast in contrasts.items():
    ds = DeseqStats(dds, contrast=contrast, alpha=0.05)
    ds.summary()
    n_sig = (ds.results_df.padj < 0.05).sum()
    ds.results_df.to_csv(f"results_{name}.csv")
    print(f"{name}: {n_sig} significant genes")

Recipe: Batch-Corrected Analysis

When samples come from multiple batches or sequencing runs.

# Verify batch is not confounded with condition
print(pd.crosstab(metadata["batch"], metadata["condition"]))

dds = DeseqDataSet(
    counts=counts_df, metadata=metadata,
    design="~batch + condition"  # batch first, then variable of interest
)
dds.deseq2()

ds = DeseqStats(dds, contrast=["condition", "treated", "control"])
ds.summary()
print(f"Significant genes (batch-corrected): {(ds.results_df.padj < 0.05).sum()}")

Recipe: P-value Distribution QC

Diagnostic check — a healthy analysis shows a flat histogram with a spike near 0.

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# P-value histogram
axes[0].hist(ds.results_df.pvalue.dropna(), bins=50, edgecolor="black")
axes[0].set_xlabel("P-value")
axes[0].set_ylabel("Frequency")
axes[0].set_title("P-value Distribution")

# Dispersion plot
axes[1].scatter(
    np.log10(dds.varm["_normed_means"] + 1),
    np.log10(dds.varm["dispersions"]),
    alpha=0.3, s=5
)
axes[1].set_xlabel("Log10(Mean Expression + 1)")
axes[1].set_ylabel("Log10(Dispersion)")
axes[1].set_title("Dispersion vs Mean")

plt.tight_layout()
plt.savefig("qc_diagnostics.png", dpi=300)
print("Saved qc_diagnostics.png")

Recipe: Save and Reload DESeq Dataset

For resuming analysis without re-running the expensive fitting step.

import pickle

# Save
with open("dds_fitted.pkl", "wb") as f:
    pickle.dump(dds.to_picklable_anndata(), f)
print("Saved fitted DESeqDataSet")

# Reload
with open("dds_fitted.pkl", "rb") as f:
    adata = pickle.load(f)
# Note: reload requires re-constructing DeseqDataSet from the AnnData

Expected Outputs

```
deseq2_all_results.csv
```
— Full results table (baseMean, log2FoldChange, lfcSE, stat, pvalue, padj) for all tested genes
```
deseq2_significant.csv
```
— Filtered results (padj < 0.05 and |LFC| > 1)
```
deseq2_upregulated.csv
```
— Significant upregulated genes sorted by padj
```
deseq2_downregulated.csv
```
— Significant downregulated genes sorted by padj
```
volcano_plot.png
```
— Volcano plot with significance and fold change thresholds
```
ma_plot.png
```
— MA plot showing fold change vs mean expression
```
qc_diagnostics.png
```
— P-value distribution and dispersion plot

Troubleshooting

Problem	Cause	Solution
`ValueError: Index mismatch`	Sample names differ between counts and metadata	Use `counts_df.index.intersection(metadata.index)` to align
All genes have `padj = NaN`	Genes have zero variance or all zero counts	Apply stricter gene filtering (increase `min_total_counts` )
`Design matrix is not full rank`	Confounded variables (e.g., all treated in one batch)	Check with `pd.crosstab()` ; simplify design or remove confounded variable
No significant genes found	Small effect size, high variability, or low sample size	Check p-value distribution; relax alpha or
`MemoryError` during fitting	Too many genes or very large dataset	Pre-filter more aggressively; reduce `n_cpus` ; use machine with more RAM
Very large size factors (>5)	Extreme library size differences	Verify raw counts are unnormalized; check for contamination or failed libraries
Shrinkage produces unexpected LFCs	Calling `lfc_shrink()` before `summary()`	Always call `ds.summary()` first, then `ds.lfc_shrink()`

References

PyDESeq2 Documentation — Official API reference and tutorials
Muzellec et al. (2023) — PyDESeq2: a Python package for bulk differential expression analysis with DESeq2. Bioinformatics.
Love et al. (2014) — Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology.
DESeq2 Vignette (Bioconductor) — Comprehensive R DESeq2 guide with design formula examples