Encode-toolkit pipeline-chipseq

Execute ENCODE ChIP-seq processing pipeline from FASTQ to peaks and signal tracks. Child of pipeline-guide. Provides stage-by-stage Nextflow execution with Docker containers and cloud deployment. Use when users need to process ChIP-seq data following ENCODE standards, run peak calling with MACS2, perform IDR analysis, or generate signal tracks. Trigger on: ChIP-seq pipeline, run ChIP-seq, process ChIP-seq, MACS2 peak calling, IDR analysis, ChIP-seq FASTQ processing.

install
source · Clone the upstream repo
git clone https://github.com/ammawla/encode-toolkit
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/pipeline-chipseq" ~/.claude/skills/ammawla-encode-toolkit-pipeline-chipseq-e8968e && rm -rf "$T"
manifest: skills/pipeline-chipseq/SKILL.md
source content

ENCODE ChIP-seq Pipeline

When to Use

  • User wants to run a ChIP-seq processing pipeline from FASTQ to peaks and signal tracks
  • User asks about "ChIP-seq pipeline", "MACS2", "peak calling", "BWA alignment for ChIP", or "IDR"
  • User needs to process histone or TF ChIP-seq data following ENCODE standards
  • Example queries: "process my ChIP-seq FASTQs", "run the ENCODE ChIP-seq pipeline", "call peaks from ChIP-seq with MACS2 and IDR"

Execute the ENCODE ChIP-seq processing pipeline from raw FASTQ files through peak calling, IDR analysis, and signal track generation. This skill provides a complete Nextflow DSL2 implementation following ENCODE uniform analysis standards.

Overview

The ENCODE ChIP-seq pipeline processes chromatin immunoprecipitation sequencing data through a series of well-defined stages: quality control, adapter trimming, alignment to a reference genome, filtering and deduplication, peak calling with MACS2, replicate consistency analysis via IDR, and signal track generation. Each stage is parameterized according to ENCODE standards and produces QC metrics for comprehensive quality assessment.

This pipeline handles both transcription factor (TF) ChIP-seq and histone modification ChIP-seq, automatically selecting narrow or broad peak calling modes as appropriate.

Key Literature

ReferenceJournalYearDOIRelevance
Landt et al. "ChIP-seq guidelines and practices"Genome Research201210.1101/gr.136184.111ENCODE ChIP-seq standards (~4,000 citations)
ENCODE Project Consortium "Expanded encyclopaedias"Nature202010.1038/s41586-020-2493-4ENCODE Phase 3 standards
Zhang et al. "Model-based Analysis of ChIP-Seq (MACS)"Genome Biology200810.1186/gb-2008-9-9-r137Peak caller (~7,000 citations)
Li et al. "Measuring reproducibility (IDR)"Annals of Applied Statistics201110.1214/11-AOAS466Replicate consistency (~1,500 citations)
Amemiya et al. "ENCODE Blacklist"Scientific Reports201910.1038/s41598-019-45839-zArtifact regions (~1,372 citations)
Ramachandran et al. "phantompeakqualtools"2013NSC/RSC strand correlation metrics

Pipeline Stages

FASTQ ──> FastQC / Trim Galore ──> BWA-MEM ──> Samtools Filter ──> Picard MarkDup
  │                                                                       │
  │           ┌───────────────────────────────────────────────────────────┘
  │           v
  │     Blacklist Filter ──> MACS2 Peak Calling ──> IDR Analysis
  │                                │                     │
  │                                v                     v
  │                         Signal Tracks          QC Report (MultiQC)
  │                          (bigWig)
  v
 Raw QC Report

Stage Summary

StageToolInputOutputReference
1. QC & TrimmingFastQC, Trim GaloreRaw FASTQTrimmed FASTQreferences/01-qc-trimming.md
2. AlignmentBWA-MEMTrimmed FASTQSorted BAMreferences/02-alignment.md
3. FilteringPicard, Samtools, bedtoolsSorted BAMFiltered BAMreferences/03-filtering.md
4. Peak Calling & IDRMACS2, IDRFiltered BAMPeaks (narrowPeak/broadPeak)references/04-analysis.md
5. QC & Signaldeeptools, phantompeakqualtoolsFiltered BAM, PeaksbigWig, QC reportreferences/05-qc-metrics.md

Input Requirements

Required Files

  • Treatment FASTQ: ChIP sample reads (single-end or paired-end, gzipped)
  • Control FASTQ: Input/IgG control reads (matching single-end or paired-end)
  • Reference genome: BWA-indexed genome (GRCh38 for human, mm10 for mouse)

Sample Sheet Format

sample_id,treatment_r1,treatment_r2,control_r1,control_r2,target,peak_type
SAMPLE1,chip_R1.fq.gz,chip_R2.fq.gz,input_R1.fq.gz,input_R2.fq.gz,H3K27ac,narrow
SAMPLE2,chip_R1.fq.gz,chip_R2.fq.gz,input_R1.fq.gz,input_R2.fq.gz,H3K27me3,broad

Narrow vs Broad Peak Mode Decision

Peak TypeTargetsMACS2 Mode
NarrowH3K4me3, H3K4me1, H3K27ac, H3K9ac, all TFs, CTCF
--qvalue 0.05
(default)
BroadH3K27me3, H3K36me3, H3K9me3, H3K79me2
--broad --broad-cutoff 0.1

QC Thresholds

These thresholds follow ENCODE standards established by Landt et al. 2012 and the ENCODE DCC quality metrics documentation.

MetricThresholdCategorySource
Total sequenced reads≥20M (TF), ≥45M (histone)Read depthLandt 2012
Mapping rate>80%AlignmentENCODE
NRF (non-redundant fraction)≥0.8Library complexityENCODE
PBC1 (PCR bottleneck coeff 1)≥0.8Library complexityENCODE
PBC2 (PCR bottleneck coeff 2)≥3Library complexityENCODE
NSC (normalized strand coeff)>1.05Enrichmentphantompeakqualtools
RSC (relative strand corr)>0.8Enrichmentphantompeakqualtools
FRiP (fraction reads in peaks)≥1%Peak qualityLandt 2012
IDR optimal peaks>20,000 (TF)ReproducibilityENCODE
Duplication rate<30%Library complexityENCODE
Mitochondrial fraction<5%Sample qualityENCODE

Interpreting QC: Traffic Light System

ColorMeaningAction
GreenAll metrics passProceed to analysis
Yellow1-2 metrics marginalReview library prep, may be usable
RedMultiple failuresDo not use; re-do experiment

Important: No single metric is sufficient. Interpret QC collectively. A sample with borderline NRF but excellent FRiP may still be usable.

Execution

Quick Start (Local Docker)

nextflow run scripts/main.nf \
  -profile local \
  --reads 'fastq/*_R{1,2}.fq.gz' \
  --control 'fastq/input_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --peak_type narrow \
  --outdir results/

SLURM HPC

nextflow run scripts/main.nf \
  -profile slurm \
  --reads 'fastq/*_R{1,2}.fq.gz' \
  --control 'fastq/input_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --peak_type narrow \
  --outdir results/

Google Cloud

nextflow run scripts/main.nf \
  -profile gcp \
  --reads 'gs://bucket/fastq/*_R{1,2}.fq.gz' \
  --control 'gs://bucket/fastq/input_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir 'gs://bucket/results/'

AWS Batch

nextflow run scripts/main.nf \
  -profile aws \
  --reads 's3://bucket/fastq/*_R{1,2}.fq.gz' \
  --control 's3://bucket/fastq/input_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir 's3://bucket/results/'

Cloud Cost Estimates

PlatformInstanceCost/SampleTime/SampleNotes
GCPn1-standard-8~$2-52-4 hoursPreemptible recommended
AWSm5.2xlarge~$2-52-4 hoursSpot instances recommended
Local8 cores, 32GB$03-6 hoursDocker required
SLURM8 cores, 32GBVaries2-4 hoursSingularity recommended

Output Directory Structure

results/
  fastqc/                   # Raw and trimmed QC reports
  trimmed/                  # Trimmed FASTQ files
  aligned/                  # Sorted BAM files
  filtered/                 # Filtered, deduplicated BAM
  peaks/
    narrow/                 # narrowPeak files (TF, active histone marks)
    broad/                  # broadPeak files (repressive marks)
    idr/                    # IDR-filtered reproducible peaks
  signal/
    fold_change/            # Fold change over control (bigWig)
    pvalue/                 # Signal p-value tracks (bigWig)
  qc/
    phantompeakqualtools/   # NSC/RSC strand correlation
    multiqc/                # Aggregated QC report
  logs/                     # Nextflow execution logs

Common Pitfalls

1. Missing Input Control

ChIP-seq requires a matched input (or IgG) control for accurate peak calling. Without it, MACS2 will call peaks against a uniform background model, leading to high false positive rates. Always include a control sample from the same cell type and batch.

2. Narrow vs Broad Peak Mode Mismatch

Using narrow peak calling for broad marks (H3K27me3, H3K36me3) fragments the signal into many small peaks instead of capturing the broad domains. Use

--broad
for these marks. Conversely, broad mode on TF ChIP-seq over-merges distinct binding sites.

3. Adapter Contamination

Short insert libraries may have significant adapter read-through. Always run Trim Galore before alignment. Check FastQC adapter content plots: >5% adapter suggests a problem.

4. PCR Bottleneck

Low-input ChIP-seq libraries may have high duplication rates (>30%). This reduces the effective read depth and inflates apparent signal. Monitor NRF and PBC metrics. If NRF <0.8, consider re-doing library preparation with more input material.

5. Blacklist Region Artifacts

Repetitive and high-signal artifact regions inflate peak counts and FRiP. Always filter against the ENCODE blacklist (Amemiya et al. 2019). The hg38-blacklist.v2.bed contains ~900 regions covering ~40 Mb that should be excluded from all analyses.

Pipeline Scripts

FileDescriptionLines
scripts/main.nf
Nextflow DSL2 pipeline~120
scripts/nextflow.config
Execution profiles (local/slurm/gcp/aws)~60
scripts/Dockerfile
Multi-stage Docker build with all tools~30

ENCODE Data Integration

After running this pipeline on your own data, compare results with ENCODE:

# Find matching ENCODE experiments
encode_search_experiments(
    assay_title="Histone ChIP-seq",
    target="H3K27ac",
    organ="pancreas",
    biosample_type="tissue"
)

# Download ENCODE peaks for comparison
encode_batch_download(
    download_dir="/data/encode_reference/",
    output_type="IDR thresholded peaks",
    target="H3K27ac",
    organ="pancreas",
    assembly="GRCh38"
)

Pitfalls & Edge Cases

  • Input control is mandatory: ChIP-seq without an input/IgG control produces unreliable peaks. MACS2
    -c
    flag requires a matched control BAM. Never skip this — enrichment without background correction is meaningless.
  • Broad vs narrow peak mode: H3K27me3, H3K36me3, H3K9me3 require
    --broad
    flag in MACS2. Using narrow mode on broad marks fragments them into thousands of small, biologically meaningless peaks.
  • Duplicate marking ≠ duplicate removal: Mark duplicates with Picard but do NOT remove them before IDR. IDR needs duplicates to estimate signal reproducibility. Remove only after IDR filtering.
  • Cross-correlation QC can mislead: NSC/RSC values depend on fragment length distribution. Deeply sequenced libraries can have high NSC but poor enrichment. Always check FRiP alongside NSC/RSC.
  • IDR requires biological replicates: IDR is designed for true biological replicates, not technical replicates or pseudoreplicates. Using pseudoreplicates gives artificially high concordance and inflated peak counts.
  • Blacklist filtering order matters: Apply ENCODE Blacklist v2 AFTER peak calling, not before alignment. Removing blacklist reads before alignment can distort fragment size estimation and normalization.

Walkthrough: Processing ENCODE H3K27ac ChIP-seq from FASTQ to Peaks

Goal: Process raw H3K27ac ChIP-seq FASTQ files through the ENCODE uniform processing pipeline to generate IDR-thresholded peak calls and signal tracks. Context: The ENCODE ChIP-seq pipeline standardizes processing with BWA-MEM alignment, MACS2 peak calling, and IDR reproducibility analysis.

Step 1: Find the experiment and download FASTQs

encode_get_experiment(accession="ENCSR000AKA")

Expected output:

{
  "accession": "ENCSR000AKA",
  "assay_title": "Histone ChIP-seq",
  "target": "H3K27ac",
  "biosample_summary": "GM12878",
  "replicates": 2,
  "status": "released"
}

Step 2: List FASTQ files

encode_list_files(accession="ENCSR000AKA", file_format="fastq")

Expected output:

{
  "files": [
    {"accession": "ENCFF001FQ1", "output_type": "reads", "paired_end": "1", "biological_replicates": [1], "file_size_mb": 2400},
    {"accession": "ENCFF002FQ2", "output_type": "reads", "paired_end": "2", "biological_replicates": [1], "file_size_mb": 2500},
    {"accession": "ENCFF003FQ3", "output_type": "reads", "paired_end": "1", "biological_replicates": [2], "file_size_mb": 2200},
    {"accession": "ENCFF004FQ4", "output_type": "reads", "paired_end": "2", "biological_replicates": [2], "file_size_mb": 2300}
  ]
}

Interpretation: 2 biological replicates, each paired-end. Both replicates are needed for IDR analysis.

Step 3: Download FASTQs

encode_download_files(accessions=["ENCFF001FQ1", "ENCFF002FQ2", "ENCFF003FQ3", "ENCFF004FQ4"], download_dir="/data/chipseq/fastq")

Step 4: Run the ChIP-seq pipeline

nextflow run pipeline-chipseq/main.nf \
  --fastq_r1 ENCFF001FQ1.fastq.gz,ENCFF003FQ3.fastq.gz \
  --fastq_r2 ENCFF002FQ2.fastq.gz,ENCFF004FQ4.fastq.gz \
  --genome GRCh38 \
  --target H3K27ac \
  --broad_peak false \
  --blacklist encode_blacklist_v2.bed \
  -profile docker

Step 5: Validate output quality

Key QC metrics to check:

MetricThresholdPurpose
FRiP>= 1%Fraction of reads in peaks
NSC> 1.05Normalized strand coefficient
RSC> 0.8Relative strand coefficient
NRF>= 0.8Non-redundant fraction

Step 6: Log provenance

encode_log_derived_file(
  source_accessions=["ENCFF001FQ1", "ENCFF002FQ2", "ENCFF003FQ3", "ENCFF004FQ4"],
  derived_file="/data/chipseq/peaks/idr_peaks.narrowPeak",
  description="IDR-thresholded H3K27ac peaks from ENCODE pipeline",
  tool="ENCODE ChIP-seq pipeline v2.0 (BWA 0.7.17, MACS2 2.2.9.1, IDR 2.0.3)"
)

Integration with downstream skills

  • IDR peaks feed into -> peak-annotation for gene assignment
  • Signal tracks (bigWig) feed into -> visualization-workflow for genome browser display
  • Peak coordinates feed into -> histone-aggregation for cross-experiment union merge
  • QC metrics evaluated by -> quality-assessment against ENCODE standards
  • Pipeline provenance logged by -> data-provenance

Code Examples

1. Find ChIP-seq data to process

encode_search_experiments(
  assay_title="Histone ChIP-seq",
  organ="liver",
  target="H3K4me3"
)

Expected output:

{
  "total": 15,
  "experiments": [
    {
      "accession": "ENCSR456LIV",
      "target": "H3K4me3-human",
      "biosample_summary": "liver tissue male adult (54 years)",
      "status": "released"
    }
  ]
}

2. Download FASTQ files for pipeline input

encode_download_files(
  accessions=["ENCSR456LIV"],
  file_format="fastq",
  download_dir="/data/chipseq/liver_h3k4me3"
)

Expected output:

{
  "downloaded": 4,
  "total_size_mb": 12450.5,
  "files": [
    {"accession": "ENCFF001REP1", "md5_verified": true, "paired_end": "1"},
    {"accession": "ENCFF002REP1", "md5_verified": true, "paired_end": "2"}
  ]
}

Integration

This skill produces...Feed into...Purpose
IDR-thresholded peaks (narrowPeak)peak-annotationAssign peaks to nearest genes
IDR-thresholded peakshistone-aggregationCross-experiment union merge for histone marks
Signal tracks (bigWig)visualization-workflowGenome browser visualization
Peak coordinates (BED)motif-analysisDe novo motif discovery in peak regions
Filtered peaksregulatory-elementsClassify as enhancers, promoters, insulators
QC metricsquality-assessmentValidate against ENCODE ChIP-seq standards
Pipeline run logsdata-provenanceRecord tool versions and parameters
Peak filesvariant-annotationIdentify variants in ChIP-seq peaks

Related Skills

  • pipeline-guide (parent): General pipeline selection and resource assessment
  • histone-aggregation: Merge peaks across samples/replicates after peak calling
  • quality-assessment: Deep-dive QC analysis beyond basic metrics
  • regulatory-elements: Annotate peaks with regulatory element classifications
  • peak-annotation: Annotate peaks with gene associations
  • compare-biosamples: Compare ChIP-seq profiles across cell types
  • publication-trust: Verify literature claims backing analytical decisions

Presenting Results

When reporting ChIP-seq pipeline results:

  • Pipeline status: Report completion status for each stage (QC, alignment, filtering, peak calling, IDR, signal generation) with pass/fail indicators
  • Key QC metrics: Present FRiP (>=1%), NSC (>1.05), RSC (>0.8), NRF (>=0.8), duplication rate, and mapping rate in a summary table with ENCODE thresholds for comparison
  • Peak counts: Report IDR optimal peak count, conservative peak count, and pseudoreplicated peak count. Note narrow vs broad mode used
  • Signal tracks: Provide paths to fold-change-over-control bigWig and p-value bigWig files for genome browser visualization
  • Traffic light summary: Use green/yellow/red to indicate overall sample quality based on collective QC assessment
  • Output paths: List the key output directories (peaks/idr/, signal/fold_change/, qc/multiqc/)
  • Next steps: Suggest
    quality-assessment
    for deeper QC evaluation, or
    visualization-workflow
    for genome browser session generation

For the request: "$ARGUMENTS"