Medical-research-skills pydeseq2

Differential gene expression analysis for bulk RNA-seq count matrices using a DESeq2-like workflow in Python; use when you need Wald tests, FDR correction, and optional LFC shrinkage for condition/batch/covariate designs.

install
source · Clone the upstream repo
git clone https://github.com/aipoch/medical-research-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Data Analysis/pydeseq2" ~/.claude/skills/aipoch-medical-research-skills-pydeseq2 && rm -rf "$T"
manifest: scientific-skills/Data Analysis/pydeseq2/SKILL.md
source content

Source: https://github.com/aipoch/medical-research-skills

When to Use

Use this skill when you need to run DESeq2-style differential expression in Python, especially in these scenarios:

  1. Case vs control bulk RNA-seq from a raw integer count matrix (e.g., treated vs control).
  2. Multi-factor designs to adjust for batch effects or covariates (e.g.,
    ~ batch + condition
    ,
    ~ age + condition
    ).
  3. DESeq2 migration when converting an R DESeq2 workflow into a Python pipeline.
  4. Pipeline integration where results must stay in Python objects (pandas/AnnData) for downstream QC, plots, or reporting.
  5. Requests mentioning “DESeq2”, “differential expression”, “Wald test”, “FDR/padj”, “volcano plot”, “MA plot”, or “PyDESeq2”.

Key Features

  • End-to-end DESeq2-like workflow: normalization (size factors), dispersion estimation/shrinkage, LFC fitting, outlier handling.
  • Wald tests for differential expression with Benjamini–Hochberg FDR (
    padj
    ).
  • Design formulas in Wilkinson/R-style notation (single-factor and multi-factor).
  • Contrast-based comparisons:
    [variable, test_group, reference_group]
    .
  • Optional Cook’s distance outlier filtering and refitting.
  • Optional LFC shrinkage (apeGLM-style) for visualization/ranking.
  • Works naturally with pandas and can interoperate with AnnData.

Dependencies

Minimum environment (as documented in the source material):

  • Python 3.10–3.11
  • pydeseq2
    (install via pip/uv)
  • pandas
    >= 1.4.3
  • numpy
    >= 1.23.0
  • scipy
    >= 1.11.0
  • scikit-learn
    >= 1.1.1
  • anndata
    >= 0.8.0 (optional, for AnnData I/O)

Optional plotting:

  • matplotlib
    (recommended)
  • seaborn
    (optional)

Installation:

uv pip install pydeseq2

Example Usage

The following script is a complete, runnable example for a standard treated-vs-control analysis.

import pandas as pd
import numpy as np

from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

# -----------------------------
# 1) Load inputs
# -----------------------------
# counts.csv is commonly stored as genes x samples; transpose to samples x genes.
counts_df = pd.read_csv("counts.csv", index_col=0).T
metadata = pd.read_csv("metadata.csv", index_col=0)

# Ensure sample alignment
common = counts_df.index.intersection(metadata.index)
counts_df = counts_df.loc[common]
metadata = metadata.loc[common]

# -----------------------------
# 2) Basic filtering
# -----------------------------
# Remove genes with very low total counts
min_total_counts = 10
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= min_total_counts]
counts_df = counts_df[genes_to_keep]

# Drop samples with missing condition
metadata = metadata.dropna(subset=["condition"])
counts_df = counts_df.loc[metadata.index]

# -----------------------------
# 3) Fit DESeq2 model
# -----------------------------
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~ condition",
    refit_cooks=True,
    n_cpus=1,
)
dds.deseq2()

# -----------------------------
# 4) Wald test with contrast
# -----------------------------
ds = DeseqStats(
    dds,
    contrast=["condition", "treated", "control"],
    alpha=0.05,
    cooks_filter=True,
    independent_filter=True,
)
ds.summary()

# -----------------------------
# 5) Results + optional shrinkage
# -----------------------------
res = ds.results_df.copy()
sig = res[res["padj"] < 0.05].sort_values("padj")
print(f"Significant genes (padj < 0.05): {len(sig)}")

# Optional: shrink LFC for visualization/ranking (p-values do not change)
ds.lfc_shrink()
res_shrunk = ds.results_df.copy()

# Export
res.to_csv("deseq2_results.csv")
res_shrunk.to_csv("deseq2_results_shrunk_lfc.csv")
sig.to_csv("significant_genes.csv")

# -----------------------------
# 6) Minimal volcano plot (optional)
# -----------------------------
try:
    import matplotlib.pyplot as plt

    plot_df = res.copy()
    plot_df["neglog10_padj"] = -np.log10(plot_df["padj"].clip(lower=1e-300))
    is_sig = plot_df["padj"] < 0.05

    plt.figure(figsize=(9, 5))
    plt.scatter(
        plot_df.loc[~is_sig, "log2FoldChange"],
        plot_df.loc[~is_sig, "neglog10_padj"],
        s=10,
        alpha=0.3,
        c="gray",
        label="Not significant",
    )
    plt.scatter(
        plot_df.loc[is_sig, "log2FoldChange"],
        plot_df.loc[is_sig, "neglog10_padj"],
        s=10,
        alpha=0.6,
        c="red",
        label="padj < 0.05",
    )
    plt.axhline(-np.log10(0.05), linestyle="--", color="blue", alpha=0.5)
    plt.xlabel("Log2 Fold Change")
    plt.ylabel("-Log10(adjusted p-value)")
    plt.title("Volcano Plot")
    plt.legend()
    plt.tight_layout()
    plt.savefig("volcano_plot.png", dpi=300)
except ImportError:
    pass

Implementation Details

Inputs and orientation

  • Counts matrix must be samples × genes with non-negative integer counts.
  • Many files are stored as genes × samples; transpose with
    .T
    after loading.

Design formula (Wilkinson/R-style)

  • Use strings like:
    • ~ condition
      (single factor)
    • ~ batch + condition
      (batch-adjusted)
    • ~ age + condition
      (continuous covariate)
    • ~ group + condition + group:condition
      (interaction)
  • Put adjustment variables first (e.g.,
    ~ batch + condition
    ) so the primary effect is interpreted cleanly.

What
dds.deseq2()
does (high level)

The fitting pipeline typically includes:

  1. Size factor estimation (library-size normalization)
  2. Gene-wise dispersion estimation
  3. Dispersion trend fitting and prior estimation
  4. MAP dispersion shrinkage
  5. Log2 fold change fitting under the specified design
  6. Cook’s distance outlier detection
  7. Optional refitting after outlier handling (
    refit_cooks=True
    )

Statistical testing and multiple testing correction

  • DeseqStats(...).summary()
    runs Wald tests for the requested coefficient/contrast.
  • Output columns commonly include:
    • baseMean
      : mean normalized expression
    • log2FoldChange
      ,
      lfcSE
      ,
      stat
    • pvalue
      : raw p-value
    • padj
      : Benjamini–Hochberg FDR adjusted p-value
  • Use
    padj < alpha
    (commonly 0.05) for significance.

Contrast specification

  • Format:
    contrast=["variable", "test_group", "reference_group"]
  • Example:
    ["condition", "treated", "control"]
    tests treated relative to control.

LFC shrinkage (optional)

  • ds.lfc_shrink()
    applies shrinkage to log2FoldChange for more stable ranking/plots.
  • Shrinkage is intended for visualization and prioritization; statistical significance is still based on the (unshrunken) Wald test p-values.

Notes on bundled references/scripts

If your repository includes them, use:

  • references/api_reference.md
    for parameter/object details.
  • references/workflow_guide.md
    for extended workflows and troubleshooting.
  • scripts/run_deseq2_analysis.py
    for a CLI-style batch workflow (counts/metadata/design/contrast/output, optional plots).