Medical-research-skills pydeseq2

Differential gene expression analysis for bulk RNA-seq count matrices using a DESeq2-like workflow in Python; use when you need Wald tests, FDR correction, and optional LFC shrinkage for condition/batch/covariate designs.

install

source · Clone the upstream repo

git clone https://github.com/aipoch/medical-research-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Data Analysis/pydeseq2" ~/.claude/skills/aipoch-medical-research-skills-pydeseq2 && rm -rf "$T"

manifest: scientific-skills/Data Analysis/pydeseq2/SKILL.md

source content

Source: https://github.com/aipoch/medical-research-skills

When to Use

Use this skill when you need to run DESeq2-style differential expression in Python, especially in these scenarios:

Case vs control bulk RNA-seq from a raw integer count matrix (e.g., treated vs control).
Multi-factor designs to adjust for batch effects or covariates (e.g.,
```
~ batch + condition
```
,
```
~ age + condition
```
).
DESeq2 migration when converting an R DESeq2 workflow into a Python pipeline.
Pipeline integration where results must stay in Python objects (pandas/AnnData) for downstream QC, plots, or reporting.
Requests mentioning “DESeq2”, “differential expression”, “Wald test”, “FDR/padj”, “volcano plot”, “MA plot”, or “PyDESeq2”.

Key Features

End-to-end DESeq2-like workflow: normalization (size factors), dispersion estimation/shrinkage, LFC fitting, outlier handling.
Wald tests for differential expression with Benjamini–Hochberg FDR (
```
padj
```
).
Design formulas in Wilkinson/R-style notation (single-factor and multi-factor).
Contrast-based comparisons:
```
[variable, test_group, reference_group]
```
.
Optional Cook’s distance outlier filtering and refitting.
Optional LFC shrinkage (apeGLM-style) for visualization/ranking.
Works naturally with pandas and can interoperate with AnnData.

Dependencies

Minimum environment (as documented in the source material):

Python 3.10–3.11
```
pydeseq2
```
(install via pip/uv)
```
pandas
```
>= 1.4.3
```
numpy
```
>= 1.23.0
```
scipy
```
>= 1.11.0
```
scikit-learn
```
>= 1.1.1
```
anndata
```
>= 0.8.0 (optional, for AnnData I/O)

Optional plotting:

```
matplotlib
```
(recommended)
```
seaborn
```
(optional)

Installation:

uv pip install pydeseq2

Example Usage

The following script is a complete, runnable example for a standard treated-vs-control analysis.

import pandas as pd
import numpy as np

from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

# -----------------------------
# 1) Load inputs
# -----------------------------
# counts.csv is commonly stored as genes x samples; transpose to samples x genes.
counts_df = pd.read_csv("counts.csv", index_col=0).T
metadata = pd.read_csv("metadata.csv", index_col=0)

# Ensure sample alignment
common = counts_df.index.intersection(metadata.index)
counts_df = counts_df.loc[common]
metadata = metadata.loc[common]

# -----------------------------
# 2) Basic filtering
# -----------------------------
# Remove genes with very low total counts
min_total_counts = 10
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= min_total_counts]
counts_df = counts_df[genes_to_keep]

# Drop samples with missing condition
metadata = metadata.dropna(subset=["condition"])
counts_df = counts_df.loc[metadata.index]

# -----------------------------
# 3) Fit DESeq2 model
# -----------------------------
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design="~ condition",
    refit_cooks=True,
    n_cpus=1,
)
dds.deseq2()

# -----------------------------
# 4) Wald test with contrast
# -----------------------------
ds = DeseqStats(
    dds,
    contrast=["condition", "treated", "control"],
    alpha=0.05,
    cooks_filter=True,
    independent_filter=True,
)
ds.summary()

# -----------------------------
# 5) Results + optional shrinkage
# -----------------------------
res = ds.results_df.copy()
sig = res[res["padj"] < 0.05].sort_values("padj")
print(f"Significant genes (padj < 0.05): {len(sig)}")

# Optional: shrink LFC for visualization/ranking (p-values do not change)
ds.lfc_shrink()
res_shrunk = ds.results_df.copy()

# Export
res.to_csv("deseq2_results.csv")
res_shrunk.to_csv("deseq2_results_shrunk_lfc.csv")
sig.to_csv("significant_genes.csv")

# -----------------------------
# 6) Minimal volcano plot (optional)
# -----------------------------
try:
    import matplotlib.pyplot as plt

    plot_df = res.copy()
    plot_df["neglog10_padj"] = -np.log10(plot_df["padj"].clip(lower=1e-300))
    is_sig = plot_df["padj"] < 0.05

    plt.figure(figsize=(9, 5))
    plt.scatter(
        plot_df.loc[~is_sig, "log2FoldChange"],
        plot_df.loc[~is_sig, "neglog10_padj"],
        s=10,
        alpha=0.3,
        c="gray",
        label="Not significant",
    )
    plt.scatter(
        plot_df.loc[is_sig, "log2FoldChange"],
        plot_df.loc[is_sig, "neglog10_padj"],
        s=10,
        alpha=0.6,
        c="red",
        label="padj < 0.05",
    )
    plt.axhline(-np.log10(0.05), linestyle="--", color="blue", alpha=0.5)
    plt.xlabel("Log2 Fold Change")
    plt.ylabel("-Log10(adjusted p-value)")
    plt.title("Volcano Plot")
    plt.legend()
    plt.tight_layout()
    plt.savefig("volcano_plot.png", dpi=300)
except ImportError:
    pass

Implementation Details

Inputs and orientation

Counts matrix must be samples × genes with non-negative integer counts.
Many files are stored as genes × samples; transpose with
```
.T
```
after loading.

Design formula (Wilkinson/R-style)

Use strings like:

```
~ condition
```
(single factor)
```
~ batch + condition
```
(batch-adjusted)
```
~ age + condition
```
(continuous covariate)
```
~ group + condition + group:condition
```
(interaction)

Put adjustment variables first (e.g.,
```
~ batch + condition
```
) so the primary effect is interpreted cleanly.

What

dds.deseq2()

does (high level)

The fitting pipeline typically includes:

Size factor estimation (library-size normalization)
Gene-wise dispersion estimation
Dispersion trend fitting and prior estimation
MAP dispersion shrinkage
Log2 fold change fitting under the specified design
Cook’s distance outlier detection
Optional refitting after outlier handling (
```
refit_cooks=True
```
)

Statistical testing and multiple testing correction

```
DeseqStats(...).summary()
```
runs Wald tests for the requested coefficient/contrast.
Output columns commonly include:
- ```
baseMean
```
  : mean normalized expression
- ```
log2FoldChange
```
  ,
```
lfcSE
```
  ,
```
stat
```
- ```
pvalue
```
  : raw p-value
- ```
padj
```
  : Benjamini–Hochberg FDR adjusted p-value
Use
```
padj < alpha
```
(commonly 0.05) for significance.

Contrast specification

Format:

contrast=["variable", "test_group", "reference_group"]

Example:
```
["condition", "treated", "control"]
```
tests treated relative to control.

LFC shrinkage (optional)

```
ds.lfc_shrink()
```
applies shrinkage to log2FoldChange for more stable ranking/plots.
Shrinkage is intended for visualization and prioritization; statistical significance is still based on the (unshrunken) Wald test p-values.

Notes on bundled references/scripts

If your repository includes them, use:

```
references/api_reference.md
```
for parameter/object details.
```
references/workflow_guide.md
```
for extended workflows and troubleshooting.
```
scripts/run_deseq2_analysis.py
```
for a CLI-style batch workflow (counts/metadata/design/contrast/output, optional plots).