BioSkills bio-chipseq-differential-binding

Identifies differentially bound ChIP-seq regions between conditions using DiffBind (from BAMs), DESeq2, or PyDESeq2 (from count matrices). Handles normalization, statistical testing, and fold-change estimation with ChIP-seq-specific considerations. Use when comparing ChIP-seq binding between experimental conditions.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/chip-seq/differential-binding" ~/.claude/skills/gptomics-bioskills-bio-chipseq-differential-binding && rm -rf "$T"
manifest: chip-seq/differential-binding/SKILL.md
source content

Version Compatibility

Reference examples tested with: DiffBind 3.12+, DESeq2 1.42+, edgeR 4.0+, PyDESeq2 0.4+

Before using code patterns, verify installed versions match. If versions differ:

  • R:
    packageVersion('<pkg>')
    then
    ?function_name
    to verify parameters
  • Python:
    pip show pydeseq2
    then
    help(module.function)
    to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Differential Binding Analysis

"Compare ChIP-seq binding between conditions" → Identify genomic regions with statistically significant differences in transcription factor or histone mark occupancy between experimental groups.

  • R (from BAMs):
    DiffBind::dba()
    dba.count()
    dba.normalize()
    dba.analyze()
  • R (from count matrix):
    DESeq2::DESeqDataSetFromMatrix()
    DESeq()
    results()
  • Python (from count matrix):
    pydeseq2.DeseqDataSet()
    .deseq2()
    DeseqStats()

Choosing an Approach

ScenarioRecommended
Have BAM + peak filesDiffBind (handles consensus peaks, counting, normalization)
Have a count matrix (peaks x samples)DESeq2 directly (R) or PyDESeq2 (Python)
Need background normalizationDiffBind with
background=TRUE
or csaw
Suspect global binding changesCustom size factors or spike-in normalization
Pre-defined regions (promoters, enhancers)DESeq2/PyDESeq2 on counted regions

DiffBind wraps DESeq2/edgeR internally. When a count matrix already exists, DiffBind adds no statistical value — its main contributions (consensus peak generation, summit re-centering, read counting) are already done.

ChIP-seq Normalization

Normalization is the single most consequential analytical decision for differential binding. The choice of reference reads matters far more than algorithm (RLE vs TMM produce essentially identical results on the same reference reads).

Reference Reads

ReferenceHow appliedAssumptionRisk
Reads in peaksDESeq2 default
estimateSizeFactors()
on count matrix
Most peaks NOT differentially boundFails with global changes; can reverse conclusions
Full libraryTotal aligned reads from BAMSequencing depth is main technical variableConservative; DiffBind 3.0+ default
Background bins10-15 kb genomic bins via csawBackground is stable across conditionsRobust to composition bias; requires BAMs
Spike-inExogenous chromatin (e.g., Drosophila)Spike-in ratio is constantGold standard for global changes; requires spike-in protocol

When Default DESeq2 Normalization Works

Running

estimateSizeFactors()
on a ChIP-seq count matrix computes RLE normalization on reads-in-peaks. This works when:

  • Most peaks are expected to be stable (typical TF ChIP-seq with localized changes)
  • Changes are roughly symmetric (similar numbers of gained and lost peaks)
  • Differential peaks are a minority of the total peak set

When It Fails

  • Global binding changes: EZH2 inhibitor reducing H3K27me3, BET inhibitor reducing BRD4, knockdown of the ChIPped factor
  • Asymmetric changes: predominantly gained or lost, not balanced
  • Small peak sets where differential peaks are a large fraction of total

Diagnostic: if the MA plot loess curve deviates substantially from y=0, normalization is not centering correctly. Verify biological plausibility of the gain/loss ratio.

Custom Size Factors

When default normalization is inappropriate:

# Full library size normalization (when total aligned reads are known)
library_sizes <- c(15e6, 14e6, 16e6, 15e6, 13e6, 14e6)
names(library_sizes) <- colnames(counts(dds))
sizeFactors(dds) <- library_sizes / mean(library_sizes)

# Use known stable peaks as normalization reference
stable_idx <- which(rownames(dds) %in% known_stable_peaks)
dds <- estimateSizeFactors(dds, controlGenes = stable_idx)

From Count Matrix (R - DESeq2)

Goal: Identify differentially bound peaks from a pre-computed count matrix using DESeq2's negative binomial framework.

Approach: Load counts into DESeqDataSet, apply minimal ChIP-seq-appropriate filtering, run the DESeq2 pipeline, extract results at the desired significance threshold.

library(DESeq2)

counts <- read.delim('counts.tsv', row.names = 1, check.names = FALSE)
coldata <- data.frame(
    condition = factor(c(rep('treated', 3), rep('control', 3))),
    row.names = colnames(counts)
)

dds <- DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ condition)

# ChIP-seq peaks are already enriched regions; filter less aggressively than RNA-seq
keep <- rowSums(counts(dds)) >= 1
dds <- dds[keep,]

dds$condition <- relevel(dds$condition, ref = 'control')
dds <- DESeq(dds)

# alpha matches intended significance threshold (optimizes independent filtering)
res <- results(dds, alpha = 0.05)

Pre-filtering for ChIP-seq

Unlike RNA-seq where thousands of genes have zero expression, ChIP-seq peaks were called because they have signal. Aggressive filtering can remove truly differential peaks — a peak present in treatment but absent in control will have near-zero counts in control samples.

Feature set sizeRecommended filter
< 500 peaks
rowSums(counts(dds)) >= 1
(remove only all-zero rows)
500-5,000 peaks
rowSums(counts(dds)) >= 5
> 5,000 peaks
rowSums(counts(dds)) >= 10
or standard RNA-seq thresholds

The Alpha Parameter

The

alpha
argument in
results()
controls independent filtering optimization. Setting it to match the intended significance threshold (e.g.,
alpha = 0.05
when filtering at padj < 0.05) maximizes discoveries at that cutoff. Using the default
alpha = 0.1
while filtering at padj < 0.05 is slightly suboptimal but the effect is usually small.

Log Fold Change Shrinkage

LFC shrinkage (apeglm, ashr) improves fold-change estimates for ranking and visualization but does NOT change p-values or padj. Always use padj from

results()
for significance calls.

library(apeglm)
resLFC <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'apeglm')

From Count Matrix (Python - PyDESeq2)

Goal: Run differential binding analysis in Python using PyDESeq2's implementation of the DESeq2 negative binomial framework.

Approach: Load count data into a DeseqDataSet (samples as rows, features as columns), run the pipeline, extract results with BH-adjusted p-values.

import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
from pydeseq2.default_inference import DefaultInference

counts = pd.read_csv('counts.tsv', sep='\t', index_col=0)
metadata = pd.DataFrame(
    {'condition': ['treated'] * 3 + ['control'] * 3},
    index=counts.columns
)

inference = DefaultInference(n_cpus=4)
dds = DeseqDataSet(counts=counts.T, metadata=metadata, design='~ condition',
                   refit_cooks=True, inference=inference)
dds.deseq2()

ds = DeseqStats(dds, contrast=['condition', 'treated', 'control'], inference=inference)
ds.summary()
results = ds.results_df

PyDESeq2 expects counts with samples as rows and features as columns — transpose a peaks-by-samples matrix with

.T
. The
contrast
argument takes
['variable', 'numerator', 'denominator']
, so positive log2FC means higher in the numerator condition.

DiffBind Workflow (From BAMs)

Goal: Run the complete DiffBind pipeline from BAM and peak files to differential binding results.

Approach: Create a sample sheet linking BAMs, peaks, and metadata; load into DiffBind; count reads in consensus peaks; normalize; run statistical testing.

Sample Sheet

samples <- data.frame(
    SampleID = c('ctrl_1', 'ctrl_2', 'treat_1', 'treat_2'),
    Tissue = c('cell', 'cell', 'cell', 'cell'),
    Factor = c('H3K4me3', 'H3K4me3', 'H3K4me3', 'H3K4me3'),
    Condition = c('control', 'control', 'treatment', 'treatment'),
    Replicate = c(1, 2, 1, 2),
    bamReads = c('ctrl1.bam', 'ctrl2.bam', 'treat1.bam', 'treat2.bam'),
    Peaks = c('ctrl1_peaks.narrowPeak', 'ctrl2_peaks.narrowPeak',
              'treat1_peaks.narrowPeak', 'treat2_peaks.narrowPeak'),
    PeakCaller = c('macs', 'macs', 'macs', 'macs')
)
write.csv(samples, 'samples.csv', row.names = FALSE)

Load, Count, Normalize, Analyze

library(DiffBind)

dba_obj <- dba(sampleSheet = 'samples.csv')
dba_obj <- dba.count(dba_obj, summits = 250, minOverlap = 2)
dba_obj <- dba.normalize(dba_obj)
dba_obj <- dba.contrast(dba_obj, design = '~ Condition')
dba_obj <- dba.analyze(dba_obj, method = DBA_DESEQ2)

DiffBind Normalization Options

# Default: full library size (conservative, recommended starting point)
dba_obj <- dba.normalize(dba_obj)

# Background normalization (robust to composition bias; uses csaw bins)
dba_obj <- dba.normalize(dba_obj, background = TRUE)

# Inspect applied normalization
dba.normalize(dba_obj, bRetrieve = TRUE)
MethodWhen to use
DBA_NORM_LIB
+ full library (default)
Most analyses; conservative baseline
DBA_NORM_RLE
+ reads in peaks
Most peaks expected stable; typical TF ChIP
Background (
background=TRUE
)
Suspected composition bias or global changes
Spike-in (
spikein=TRUE
)
ChIP-Rx experiments with exogenous reference
Loess offsets (
offsets=TRUE
)
Abundance-dependent efficiency biases (use cautiously — can over-normalize)

Results and Export

db_all <- dba.report(dba_obj, th = 1)
results_df <- as.data.frame(db_all)
write.csv(results_df, 'differential_binding.csv', row.names = FALSE)

# Significant peaks at FDR < 0.05
db_sig <- dba.report(dba_obj, th = 0.05)

# Export to BED
library(rtracklayer)
export(db_sig, 'diff_peaks.bed', format = 'BED')

Visualization

dba.plotPCA(dba_obj, DBA_CONDITION, label = DBA_ID)
dba.plotMA(dba_obj)
dba.plotVolcano(dba_obj)
dba.plotHeatmap(dba_obj, contrast = 1, correlations = FALSE)

Multi-Factor Design

# DiffBind with blocking factor
dba_obj <- dba.contrast(dba_obj, design = '~ Batch + Condition')
dba_obj <- dba.analyze(dba_obj)

# DESeq2 directly with batch correction
dds <- DESeqDataSetFromMatrix(countData = counts, colData = coldata,
                               design = ~ batch + condition)
dds <- DESeq(dds)
res <- results(dds, name = 'condition_treated_vs_control')

DiffBind 3.0+ Notes

Key defaults changed in DiffBind 3.0:

  • summits=200
    recenters peaks (was
    FALSE
    ); set
    summits=FALSE
    for broad histone marks
  • dba.normalize()
    is now required before analysis
  • Blacklist filtering applied by default
  • Full library size normalization replaces reads-in-peaks (more conservative)
  • Use design formulas instead of
    group1
    /
    group2
    for contrasts

Sample Sheet Columns

ColumnRequiredDescription
SampleIDYesUnique identifier
ConditionYesExperimental condition
ReplicateYesReplicate number
bamReadsYesPath to BAM file
PeaksYesPath to peak file
PeakCallerYesmacs, bed, narrow
bamControlNoPath to input BAM
TissueNoTissue/cell type
FactorNoChIP target

Related Skills

  • peak-calling - Generate input peak files
  • peak-annotation - Annotate differential peaks to genes
  • differential-expression/deseq2-basics - DESeq2 fundamentals for count-based testing
  • differential-expression/edger-basics - Alternative statistical framework
  • pathway-analysis/go-enrichment - Functional enrichment of peak-associated genes