SciAgent-Skills pydeseq2-differential-expression
Bulk RNA-seq differential expression analysis with PyDESeq2. Load count matrices, normalize, fit negative binomial models, Wald test with BH-FDR correction, LFC shrinkage, volcano/MA plots. Use for two-group comparisons, multi-factor designs with batch correction, and multiple contrast testing.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/pydeseq2-differential-expression" ~/.claude/skills/jaechang-hits-sciagent-skills-pydeseq2-differential-expression && rm -rf "$T"
skills/genomics-bioinformatics/pydeseq2-differential-expression/SKILL.mdPyDESeq2 Differential Expression Analysis
Overview
PyDESeq2 is a Python reimplementation of the R DESeq2 package for differential gene expression analysis from bulk RNA-seq count data. It fits negative binomial generalized linear models per gene, estimates dispersion with empirical Bayes shrinkage, and performs Wald tests with Benjamini-Hochberg FDR correction. This skill covers the full pipeline from raw counts to publication-ready result tables and visualizations.
When to Use
- Identifying differentially expressed genes between two or more experimental conditions from bulk RNA-seq
- Performing two-group comparisons (e.g., treated vs control) with proper statistical testing
- Running multi-factor designs that account for batch effects or covariates (e.g.,
)~batch + condition - Applying log2 fold change shrinkage (apeGLM) for ranking and visualization
- Generating volcano plots, MA plots, and heatmaps from differential expression results
- Converting R-based DESeq2 workflows to a pure Python environment
- Integrating DE analysis into larger Python bioinformatics pipelines (e.g., with scanpy, pandas)
- Use DESeq2 (R/Bioconductor) or edgeR instead for the reference R implementations with the broadest method support and community validation
Prerequisites
- Python packages:
,pydeseq2>=0.4
,pandas>=1.4
,numpy>=1.23
,scipy>=1.11
,scikit-learn>=1.1anndata>=0.8 - Data requirements: Raw (unnormalized) integer count matrix (samples x genes) + sample metadata DataFrame
- Environment: Python 3.10+; optional
,matplotlib
for visualizationseaborn
pip install pydeseq2 matplotlib seaborn
Workflow
Step 1: Data Loading and Validation
Load the count matrix and metadata. PyDESeq2 expects counts as a samples x genes DataFrame with non-negative integers, and metadata as a samples x variables DataFrame with matching indices.
import pandas as pd # Load data — typical CSV has genes as rows, samples as columns counts_raw = pd.read_csv("counts.csv", index_col=0) metadata = pd.read_csv("metadata.csv", index_col=0) # Transpose if needed: PyDESeq2 requires samples x genes if counts_raw.shape[0] > counts_raw.shape[1]: counts_df = counts_raw.T # genes x samples → samples x genes else: counts_df = counts_raw # Validate alignment common_samples = counts_df.index.intersection(metadata.index) counts_df = counts_df.loc[common_samples] metadata = metadata.loc[common_samples] print(f"Samples: {counts_df.shape[0]}, Genes: {counts_df.shape[1]}") print(f"Metadata columns: {list(metadata.columns)}") print(f"Condition counts:\n{metadata['condition'].value_counts()}")
Step 2: Gene Filtering
Remove lowly expressed genes to improve statistical power and reduce multiple testing burden.
# Filter genes with total counts below threshold min_total_counts = 10 gene_counts = counts_df.sum(axis=0) genes_to_keep = gene_counts[gene_counts >= min_total_counts].index counts_df = counts_df[genes_to_keep] # Optional: require minimum counts in a minimum number of samples min_count_per_sample = 5 min_samples = 3 genes_expressed = (counts_df >= min_count_per_sample).sum(axis=0) >= min_samples counts_df = counts_df.loc[:, genes_expressed] print(f"Genes after filtering: {counts_df.shape[1]}")
Step 3: DeseqDataSet Initialization and Fitting
Create the DESeq dataset object, specify the design formula, and run the full pipeline (size factor estimation, dispersion estimation, model fitting).
from pydeseq2.dds import DeseqDataSet dds = DeseqDataSet( counts=counts_df, metadata=metadata, design="~condition", # Wilkinson-style formula refit_cooks=True, # Refit after Cook's outlier removal n_cpus=4 # Parallel threads ) # Run: size factors → dispersions → trend → MAP shrinkage → LFC fitting dds.deseq2() # Inspect normalization print(f"Size factors (first 5): {dds.obsm['size_factors'][:5]}") print(f"Size factor range: {dds.obsm['size_factors'].min():.2f} - {dds.obsm['size_factors'].max():.2f}")
Step 4: Statistical Testing (Wald Test)
Perform Wald tests to identify differentially expressed genes. Specify the contrast as
[variable, test_level, reference_level].
from pydeseq2.ds import DeseqStats ds = DeseqStats( dds, contrast=["condition", "treated", "control"], alpha=0.05, # FDR threshold cooks_filter=True, # Filter Cook's outliers independent_filter=True # Independent filtering for power ) ds.summary() # Access full results results = ds.results_df print(f"Total genes tested: {len(results)}") print(f"Significant (padj < 0.05): {(results.padj < 0.05).sum()}")
Step 5: LFC Shrinkage (Optional)
Apply apeGLM shrinkage to reduce noise in log2 fold change estimates. Use shrunk values for visualization and ranking, not for significance calls.
# Apply shrinkage — modifies results_df.log2FoldChange in place ds.lfc_shrink() # Compare pre/post shrinkage effect print(f"Max |LFC| after shrinkage: {results.log2FoldChange.abs().max():.2f}") print(f"Genes with |LFC| > 2: {(results.log2FoldChange.abs() > 2).sum()}")
Step 6: Result Filtering and Export
Filter significant genes and export results for downstream analysis.
import numpy as np # Significance + effect size filter significant = results[ (results.padj < 0.05) & (results.log2FoldChange.abs() > 1.0) ].copy() # Separate up/down-regulated up = significant[significant.log2FoldChange > 0].sort_values("padj") down = significant[significant.log2FoldChange < 0].sort_values("padj") print(f"Upregulated: {len(up)}, Downregulated: {len(down)}") # Export results.to_csv("deseq2_all_results.csv") significant.to_csv("deseq2_significant.csv") up.to_csv("deseq2_upregulated.csv") down.to_csv("deseq2_downregulated.csv") print("Results exported to CSV files")
Step 7: Visualization — Volcano Plot
import matplotlib.pyplot as plt import numpy as np fig, ax = plt.subplots(figsize=(10, 7)) res = results.dropna(subset=["padj"]).copy() res["-log10_padj"] = -np.log10(res.padj) # Color categories is_sig = (res.padj < 0.05) & (res.log2FoldChange.abs() > 1.0) is_up = is_sig & (res.log2FoldChange > 0) is_down = is_sig & (res.log2FoldChange < 0) ax.scatter(res.loc[~is_sig, "log2FoldChange"], res.loc[~is_sig, "-log10_padj"], c="grey", alpha=0.3, s=8, label="Not significant") ax.scatter(res.loc[is_up, "log2FoldChange"], res.loc[is_up, "-log10_padj"], c="firebrick", alpha=0.6, s=12, label=f"Up ({is_up.sum()})") ax.scatter(res.loc[is_down, "log2FoldChange"], res.loc[is_down, "-log10_padj"], c="steelblue", alpha=0.6, s=12, label=f"Down ({is_down.sum()})") ax.axhline(-np.log10(0.05), ls="--", c="black", alpha=0.4) ax.axvline(-1, ls="--", c="black", alpha=0.4) ax.axvline(1, ls="--", c="black", alpha=0.4) ax.set_xlabel("Log2 Fold Change") ax.set_ylabel("-Log10(Adjusted P-value)") ax.set_title("Volcano Plot — Treated vs Control") ax.legend() plt.tight_layout() plt.savefig("volcano_plot.png", dpi=300) print("Saved volcano_plot.png")
Step 8: Visualization — MA Plot
fig, ax = plt.subplots(figsize=(10, 7)) ax.scatter(np.log10(res.loc[~is_sig, "baseMean"] + 1), res.loc[~is_sig, "log2FoldChange"], c="grey", alpha=0.3, s=8) ax.scatter(np.log10(res.loc[is_sig, "baseMean"] + 1), res.loc[is_sig, "log2FoldChange"], c="firebrick", alpha=0.6, s=12) ax.axhline(0, ls="--", c="black", alpha=0.5) ax.set_xlabel("Log10(Mean Normalized Count + 1)") ax.set_ylabel("Log2 Fold Change") ax.set_title("MA Plot") plt.tight_layout() plt.savefig("ma_plot.png", dpi=300) print("Saved ma_plot.png")
Key Parameters
| Parameter | Default | Range / Options | Effect |
|---|---|---|---|
| (required) | Wilkinson formula | Model formula; put covariates before variable of interest |
| | | Which comparison to test; uses last coefficient |
| | – | FDR threshold for significance calling |
| | / | Refit model after removing Cook's distance outliers |
| | / | Apply Cook's distance filtering during testing |
| | / | Independent filtering to optimize detection power |
| | – | Number of parallel threads for dispersion fitting |
(user) | | – | Gene filtering: minimum total reads across all samples |
| off | call after | apeGLM shrinkage; reduces noisy LFC estimates |
Common Recipes
Recipe: Multiple Contrasts from One Model
When comparing multiple treatment groups against a shared control.
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition") dds.deseq2() contrasts = { "A_vs_ctrl": ["condition", "treatment_A", "control"], "B_vs_ctrl": ["condition", "treatment_B", "control"], "C_vs_ctrl": ["condition", "treatment_C", "control"], } for name, contrast in contrasts.items(): ds = DeseqStats(dds, contrast=contrast, alpha=0.05) ds.summary() n_sig = (ds.results_df.padj < 0.05).sum() ds.results_df.to_csv(f"results_{name}.csv") print(f"{name}: {n_sig} significant genes")
Recipe: Batch-Corrected Analysis
When samples come from multiple batches or sequencing runs.
# Verify batch is not confounded with condition print(pd.crosstab(metadata["batch"], metadata["condition"])) dds = DeseqDataSet( counts=counts_df, metadata=metadata, design="~batch + condition" # batch first, then variable of interest ) dds.deseq2() ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary() print(f"Significant genes (batch-corrected): {(ds.results_df.padj < 0.05).sum()}")
Recipe: P-value Distribution QC
Diagnostic check — a healthy analysis shows a flat histogram with a spike near 0.
import matplotlib.pyplot as plt fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # P-value histogram axes[0].hist(ds.results_df.pvalue.dropna(), bins=50, edgecolor="black") axes[0].set_xlabel("P-value") axes[0].set_ylabel("Frequency") axes[0].set_title("P-value Distribution") # Dispersion plot axes[1].scatter( np.log10(dds.varm["_normed_means"] + 1), np.log10(dds.varm["dispersions"]), alpha=0.3, s=5 ) axes[1].set_xlabel("Log10(Mean Expression + 1)") axes[1].set_ylabel("Log10(Dispersion)") axes[1].set_title("Dispersion vs Mean") plt.tight_layout() plt.savefig("qc_diagnostics.png", dpi=300) print("Saved qc_diagnostics.png")
Recipe: Save and Reload DESeq Dataset
For resuming analysis without re-running the expensive fitting step.
import pickle # Save with open("dds_fitted.pkl", "wb") as f: pickle.dump(dds.to_picklable_anndata(), f) print("Saved fitted DESeqDataSet") # Reload with open("dds_fitted.pkl", "rb") as f: adata = pickle.load(f) # Note: reload requires re-constructing DeseqDataSet from the AnnData
Expected Outputs
— Full results table (baseMean, log2FoldChange, lfcSE, stat, pvalue, padj) for all tested genesdeseq2_all_results.csv
— Filtered results (padj < 0.05 and |LFC| > 1)deseq2_significant.csv
— Significant upregulated genes sorted by padjdeseq2_upregulated.csv
— Significant downregulated genes sorted by padjdeseq2_downregulated.csv
— Volcano plot with significance and fold change thresholdsvolcano_plot.png
— MA plot showing fold change vs mean expressionma_plot.png
— P-value distribution and dispersion plotqc_diagnostics.png
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Sample names differ between counts and metadata | Use to align |
All genes have | Genes have zero variance or all zero counts | Apply stricter gene filtering (increase ) |
| Confounded variables (e.g., all treated in one batch) | Check with ; simplify design or remove confounded variable |
| No significant genes found | Small effect size, high variability, or low sample size | Check p-value distribution; relax alpha or |
during fitting | Too many genes or very large dataset | Pre-filter more aggressively; reduce ; use machine with more RAM |
| Very large size factors (>5) | Extreme library size differences | Verify raw counts are unnormalized; check for contamination or failed libraries |
| Shrinkage produces unexpected LFCs | Calling before | Always call first, then |
References
- PyDESeq2 Documentation — Official API reference and tutorials
- Muzellec et al. (2023) — PyDESeq2: a Python package for bulk differential expression analysis with DESeq2. Bioinformatics.
- Love et al. (2014) — Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology.
- DESeq2 Vignette (Bioconductor) — Comprehensive R DESeq2 guide with design formula examples