AlterLab-Academic-Skills alterlab-pydeseq2
Differential gene expression analysis (Python DESeq2). Identify DE genes from bulk RNA-seq counts, Wald tests, FDR correction, volcano/MA plots, for RNA-seq analysis. Part of the AlterLab Academic Skills suite.
git clone https://github.com/AlterLab-IEU/AlterLab-Academic-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/AlterLab-IEU/AlterLab-Academic-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bioinformatics/alterlab-pydeseq2" ~/.claude/skills/alterlab-ieu-alterlab-academic-skills-alterlab-pydeseq2 && rm -rf "$T"
skills/bioinformatics/alterlab-pydeseq2/SKILL.mdPyDESeq2
Overview
PyDESeq2 is a Python implementation of DESeq2 for differential expression analysis with bulk RNA-seq data. Design and execute complete workflows from data loading through result interpretation, including single-factor and multi-factor designs, Wald tests with multiple testing correction, optional apeGLM shrinkage, and integration with pandas and AnnData.
When to Use This Skill
This skill should be used when:
- Analyzing bulk RNA-seq count data for differential expression
- Comparing gene expression between experimental conditions (e.g., treated vs control)
- Performing multi-factor designs accounting for batch effects or covariates
- Converting R-based DESeq2 workflows to Python
- Integrating differential expression analysis into Python-based pipelines
- Users mention "DESeq2", "differential expression", "RNA-seq analysis", or "PyDESeq2"
Quick Start Workflow
For users who want to perform a standard differential expression analysis:
import pandas as pd from pydeseq2.dds import DeseqDataSet from pydeseq2.ds import DeseqStats # 1. Load data counts_df = pd.read_csv("counts.csv", index_col=0).T # Transpose to samples × genes metadata = pd.read_csv("metadata.csv", index_col=0) # 2. Filter low-count genes genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10] counts_df = counts_df[genes_to_keep] # 3. Initialize and fit DESeq2 dds = DeseqDataSet( counts=counts_df, metadata=metadata, design="~condition", refit_cooks=True ) dds.deseq2() # 4. Perform statistical testing ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary() # 5. Access results results = ds.results_df significant = results[results.padj < 0.05] print(f"Found {len(significant)} significant genes")
Core Workflow Steps
Step 1: Data Preparation
Input requirements:
- Count matrix: Samples × genes DataFrame with non-negative integer read counts
- Metadata: Samples × variables DataFrame with experimental factors
Common data loading patterns:
# From CSV (typical format: genes × samples, needs transpose) counts_df = pd.read_csv("counts.csv", index_col=0).T metadata = pd.read_csv("metadata.csv", index_col=0) # From TSV counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T # From AnnData import anndata as ad adata = ad.read_h5ad("data.h5ad") counts_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names) metadata = adata.obs
Data filtering:
# Remove low-count genes genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10] counts_df = counts_df[genes_to_keep] # Remove samples with missing metadata samples_to_keep = ~metadata.condition.isna() counts_df = counts_df.loc[samples_to_keep] metadata = metadata.loc[samples_to_keep]
Step 2: Design Specification
The design formula specifies how gene expression is modeled.
Single-factor designs:
design = "~condition" # Simple two-group comparison
Multi-factor designs:
design = "~batch + condition" # Control for batch effects design = "~age + condition" # Include continuous covariate design = "~group + condition + group:condition" # Interaction effects
Design formula guidelines:
- Use Wilkinson formula notation (R-style)
- Put adjustment variables (e.g., batch) before the main variable of interest
- Ensure variables exist as columns in the metadata DataFrame
- Use appropriate data types (categorical for discrete variables)
Step 3: DESeq2 Fitting
Initialize the DeseqDataSet and run the complete pipeline:
from pydeseq2.dds import DeseqDataSet dds = DeseqDataSet( counts=counts_df, metadata=metadata, design="~condition", refit_cooks=True, # Refit after removing outliers n_cpus=1 # Parallel processing (adjust as needed) ) # Run the complete DESeq2 pipeline dds.deseq2()
What
does:deseq2()
- Computes size factors (normalization)
- Fits genewise dispersions
- Fits dispersion trend curve
- Computes dispersion priors
- Fits MAP dispersions (shrinkage)
- Fits log fold changes
- Calculates Cook's distances (outlier detection)
- Refits if outliers detected (optional)
Step 4: Statistical Testing
Perform Wald tests to identify differentially expressed genes:
from pydeseq2.ds import DeseqStats ds = DeseqStats( dds, contrast=["condition", "treated", "control"], # Test treated vs control alpha=0.05, # Significance threshold cooks_filter=True, # Filter outliers independent_filter=True # Filter low-power tests ) ds.summary()
Contrast specification:
- Format:
[variable, test_level, reference_level] - Example:
tests treated vs control["condition", "treated", "control"] - If
, uses the last coefficient in the designNone
Result DataFrame columns:
: Mean normalized count across samplesbaseMean
: Log2 fold change between conditionslog2FoldChange
: Standard error of LFClfcSE
: Wald test statisticstat
: Raw p-valuepvalue
: Adjusted p-value (FDR-corrected via Benjamini-Hochberg)padj
Step 5: Optional LFC Shrinkage
Apply shrinkage to reduce noise in fold change estimates:
ds.lfc_shrink() # Applies apeGLM shrinkage
When to use LFC shrinkage:
- For visualization (volcano plots, heatmaps)
- For ranking genes by effect size
- When prioritizing genes for follow-up experiments
Important: Shrinkage affects only the log2FoldChange values, not the statistical test results (p-values remain unchanged). Use shrunk values for visualization but report unshrunken p-values for significance.
Step 6: Result Export
Save results and intermediate objects:
import pickle # Export results as CSV ds.results_df.to_csv("deseq2_results.csv") # Save significant genes only significant = ds.results_df[ds.results_df.padj < 0.05] significant.to_csv("significant_genes.csv") # Save DeseqDataSet for later use with open("dds_result.pkl", "wb") as f: pickle.dump(dds.to_picklable_anndata(), f)
Common Analysis Patterns
Two-Group Comparison
Standard case-control comparison:
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition") dds.deseq2() ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary() results = ds.results_df significant = results[results.padj < 0.05]
Multiple Comparisons
Testing multiple treatment groups against control:
dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition") dds.deseq2() treatments = ["treatment_A", "treatment_B", "treatment_C"] all_results = {} for treatment in treatments: ds = DeseqStats(dds, contrast=["condition", treatment, "control"]) ds.summary() all_results[treatment] = ds.results_df sig_count = len(ds.results_df[ds.results_df.padj < 0.05]) print(f"{treatment}: {sig_count} significant genes")
Accounting for Batch Effects
Control for technical variation:
# Include batch in design dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~batch + condition") dds.deseq2() # Test condition while controlling for batch ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary()
Continuous Covariates
Include continuous variables like age or dosage:
# Ensure continuous variable is numeric metadata["age"] = pd.to_numeric(metadata["age"]) dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~age + condition") dds.deseq2() ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary()
Using the Analysis Script
This skill includes a complete command-line script for standard analyses:
# Basic usage python scripts/run_deseq2_analysis.py \ --counts counts.csv \ --metadata metadata.csv \ --design "~condition" \ --contrast condition treated control \ --output results/ # With additional options python scripts/run_deseq2_analysis.py \ --counts counts.csv \ --metadata metadata.csv \ --design "~batch + condition" \ --contrast condition treated control \ --output results/ \ --min-counts 10 \ --alpha 0.05 \ --n-cpus 4 \ --plots
Script features:
- Automatic data loading and validation
- Gene and sample filtering
- Complete DESeq2 pipeline execution
- Statistical testing with customizable parameters
- Result export (CSV, pickle)
- Optional visualization (volcano and MA plots)
Refer users to
scripts/run_deseq2_analysis.py when they need a standalone analysis tool or want to batch process multiple datasets.
Result Interpretation
Identifying Significant Genes
# Filter by adjusted p-value significant = ds.results_df[ds.results_df.padj < 0.05] # Filter by both significance and effect size sig_and_large = ds.results_df[ (ds.results_df.padj < 0.05) & (abs(ds.results_df.log2FoldChange) > 1) ] # Separate up- and down-regulated upregulated = significant[significant.log2FoldChange > 0] downregulated = significant[significant.log2FoldChange < 0] print(f"Upregulated: {len(upregulated)}") print(f"Downregulated: {len(downregulated)}")
Ranking and Sorting
# Sort by adjusted p-value top_by_padj = ds.results_df.sort_values("padj").head(20) # Sort by absolute fold change (use shrunk values) ds.lfc_shrink() ds.results_df["abs_lfc"] = abs(ds.results_df.log2FoldChange) top_by_lfc = ds.results_df.sort_values("abs_lfc", ascending=False).head(20) # Sort by a combined metric ds.results_df["score"] = -np.log10(ds.results_df.padj) * abs(ds.results_df.log2FoldChange) top_combined = ds.results_df.sort_values("score", ascending=False).head(20)
Quality Metrics
# Check normalization (size factors should be close to 1) print("Size factors:", dds.obsm["size_factors"]) # Examine dispersion estimates import matplotlib.pyplot as plt plt.hist(dds.varm["dispersions"], bins=50) plt.xlabel("Dispersion") plt.ylabel("Frequency") plt.title("Dispersion Distribution") plt.show() # Check p-value distribution (should be mostly flat with peak near 0) plt.hist(ds.results_df.pvalue.dropna(), bins=50) plt.xlabel("P-value") plt.ylabel("Frequency") plt.title("P-value Distribution") plt.show()
Visualization Guidelines
Volcano Plot
Visualize significance vs effect size:
import matplotlib.pyplot as plt import numpy as np results = ds.results_df.copy() results["-log10(padj)"] = -np.log10(results.padj) plt.figure(figsize=(10, 6)) significant = results.padj < 0.05 plt.scatter( results.loc[~significant, "log2FoldChange"], results.loc[~significant, "-log10(padj)"], alpha=0.3, s=10, c='gray', label='Not significant' ) plt.scatter( results.loc[significant, "log2FoldChange"], results.loc[significant, "-log10(padj)"], alpha=0.6, s=10, c='red', label='padj < 0.05' ) plt.axhline(-np.log10(0.05), color='blue', linestyle='--', alpha=0.5) plt.xlabel("Log2 Fold Change") plt.ylabel("-Log10(Adjusted P-value)") plt.title("Volcano Plot") plt.legend() plt.savefig("volcano_plot.png", dpi=300)
MA Plot
Show fold change vs mean expression:
plt.figure(figsize=(10, 6)) plt.scatter( np.log10(results.loc[~significant, "baseMean"] + 1), results.loc[~significant, "log2FoldChange"], alpha=0.3, s=10, c='gray' ) plt.scatter( np.log10(results.loc[significant, "baseMean"] + 1), results.loc[significant, "log2FoldChange"], alpha=0.6, s=10, c='red' ) plt.axhline(0, color='blue', linestyle='--', alpha=0.5) plt.xlabel("Log10(Base Mean + 1)") plt.ylabel("Log2 Fold Change") plt.title("MA Plot") plt.savefig("ma_plot.png", dpi=300)
Troubleshooting Common Issues
Data Format Problems
Issue: "Index mismatch between counts and metadata"
Solution: Ensure sample names match exactly
print("Counts samples:", counts_df.index.tolist()) print("Metadata samples:", metadata.index.tolist()) # Take intersection if needed common = counts_df.index.intersection(metadata.index) counts_df = counts_df.loc[common] metadata = metadata.loc[common]
Issue: "All genes have zero counts"
Solution: Check if data needs transposition
print(f"Counts shape: {counts_df.shape}") # If genes > samples, transpose is needed if counts_df.shape[1] < counts_df.shape[0]: counts_df = counts_df.T
Design Matrix Issues
Issue: "Design matrix is not full rank"
Cause: Confounded variables (e.g., all treated samples in one batch)
Solution: Remove confounded variable or add interaction term
# Check confounding print(pd.crosstab(metadata.condition, metadata.batch)) # Either simplify design or add interaction design = "~condition" # Remove batch # OR design = "~condition + batch + condition:batch" # Model interaction
No Significant Genes
Diagnostics:
# Check dispersion distribution plt.hist(dds.varm["dispersions"], bins=50) plt.show() # Check size factors print(dds.obsm["size_factors"]) # Look at top genes by raw p-value print(ds.results_df.nsmallest(20, "pvalue"))
Possible causes:
- Small effect sizes
- High biological variability
- Insufficient sample size
- Technical issues (batch effects, outliers)
Reference Documentation
For comprehensive details beyond this workflow-oriented guide:
-
API Reference (
): Complete documentation of PyDESeq2 classes, methods, and data structures. Use when needing detailed parameter information or understanding object attributes.references/api_reference.md -
Workflow Guide (
): In-depth guide covering complete analysis workflows, data loading patterns, multi-factor designs, troubleshooting, and best practices. Use when handling complex experimental designs or encountering issues.references/workflow_guide.md
Load these references into context when users need:
- Detailed API documentation:
Read references/api_reference.md - Comprehensive workflow examples:
Read references/workflow_guide.md - Troubleshooting guidance:
(see Troubleshooting section)Read references/workflow_guide.md
Key Reminders
-
Data orientation matters: Count matrices typically load as genes × samples but need to be samples × genes. Always transpose with
if needed..T -
Sample filtering: Remove samples with missing metadata before analysis to avoid errors.
-
Gene filtering: Filter low-count genes (e.g., < 10 total reads) to improve power and reduce computational time.
-
Design formula order: Put adjustment variables before the variable of interest (e.g.,
not"~batch + condition"
)."~condition + batch" -
LFC shrinkage timing: Apply shrinkage after statistical testing and only for visualization/ranking purposes. P-values remain based on unshrunken estimates.
-
Result interpretation: Use
for significance, not raw p-values. The Benjamini-Hochberg procedure controls false discovery rate.padj < 0.05 -
Contrast specification: The format is
where test_level is compared against reference_level.[variable, test_level, reference_level] -
Save intermediate objects: Use pickle to save DeseqDataSet objects for later use or additional analyses without re-running the expensive fitting step.
Installation and Requirements
uv pip install pydeseq2
System requirements:
- Python 3.10-3.11
- pandas 1.4.3+
- numpy 1.23.0+
- scipy 1.11.0+
- scikit-learn 1.1.1+
- anndata 0.8.0+
Optional for visualization:
- matplotlib
- seaborn
Additional Resources
- Official Documentation: https://pydeseq2.readthedocs.io
- GitHub Repository: https://github.com/owkin/PyDESeq2
- Publication: Muzellec et al. (2023) Bioinformatics, DOI: 10.1093/bioinformatics/btad547
- Original DESeq2 (R): Love et al. (2014) Genome Biology, DOI: 10.1186/s13059-014-0550-8