Encode-toolkit pipeline-cutandrun

Execute CUT&RUN processing pipeline from FASTQ to peaks and signal tracks. Child of pipeline-guide. Provides Nextflow execution with Docker and cloud deployment. Use when processing CUT&RUN or CUT&Tag data, an alternative to ChIP-seq with lower background. Trigger on: CUT&RUN pipeline, CUT&Tag, SEACR, Henikoff, targeted chromatin, pA-MNase, process CUT&RUN.

install
source · Clone the upstream repo
git clone https://github.com/ammawla/encode-toolkit
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/pipeline-cutandrun" ~/.claude/skills/ammawla-encode-toolkit-pipeline-cutandrun-6a9d8d && rm -rf "$T"
manifest: skills/pipeline-cutandrun/SKILL.md
source content

ENCODE CUT&RUN Pipeline: FASTQ to Peaks and Signal Tracks

When to Use

  • User wants to run a CUT&RUN or CUT&Tag processing pipeline from FASTQ to peaks
  • User asks about "CUT&RUN pipeline", "CUT&Tag", "SEACR", "spike-in normalization", or "targeted chromatin"
  • User needs to process CUT&RUN/CUT&Tag data with spike-in calibration and SEACR peak calling
  • Example queries: "process my CUT&RUN FASTQs", "run SEACR on CUT&Tag data", "normalize CUT&RUN with spike-in controls"

Execute the CUT&RUN/CUT&Tag processing pipeline for targeted chromatin profiling, producing peak calls with SEACR and spike-in normalized signal tracks.

Pipeline Overview

FASTQ -> Trim -> Bowtie2 align (genome) -> Filter/dedup -> SEACR peaks
                     |                          |              |
              Bowtie2 align (spike-in)   Spike-in normalize  Signal tracks
                     |
              Scale factor calculation

ENCODE Repository

  • GitHub:
    ENCODE-DCC/cutandrun-pipeline
  • Container:
    encodedcc/cutandrun-pipeline
  • This skill: Nextflow DSL2 reimplementation for portability

Core Tools and Versions

ToolVersionPurposeCitation
Bowtie22.5.3Alignment (genome + spike-in)Langmead & Salzberg 2012
SEACR1.3Peak calling (CUT&RUN-specific)Meers et al. 2019
MACS22.2.9.1Alternative peak callerZhang et al. 2008
Picard3.1.1Duplicate markingBroad Institute
samtools1.19BAM operationsLi et al. 2009
bedtools2.31.0Genomic arithmeticQuinlan & Hall 2010
deepTools3.5.4Signal track generationRamirez et al. 2016
FastQC0.12.1Read qualityAndrews (Babraham)
MultiQC1.21Aggregated QCEwels et al. 2016

Key Literature

  1. Skene & Henikoff 2017 - "An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites" (eLife, ~1,500 citations) DOI: 10.7554/eLife.21856

  2. Meers et al. 2019 - "Peak calling by Sparse Enrichment Analysis for CUT&RUN chromatin profiling" (Epigenetics & Chromatin, ~800 citations) DOI: 10.1186/s13072-019-0287-4

  3. Kaya-Okur et al. 2019 - "CUT&Tag for efficient epigenomic profiling of small samples and single cells" (Nature Communications, ~1,200 citations) DOI: 10.1038/s41467-019-09982-5

  4. Nordin et al. 2023 - "The CUT&RUN suspect list of problematic regions" (Genome Biology) DOI: 10.1186/s13059-023-02960-3

  5. Amemiya et al. 2019 - "The ENCODE Blacklist" (Scientific Reports, ~1,372 citations) DOI: 10.1038/s41598-019-45839-z

Execution

Quick Start (Local)

nextflow run main.nf \
    -profile local \
    --reads '/data/fastq/*_R{1,2}.fastq.gz' \
    --bowtie2_index '/ref/bowtie2_index/genome' \
    --spikein_index '/ref/bowtie2_ecoli/ecoli' \
    --chrom_sizes '/ref/hg38.chrom.sizes' \
    --blacklist '/ref/hg38-blacklist.v2.bed' \
    --outdir results/ \
    -resume

SLURM HPC

nextflow run main.nf \
    -profile slurm \
    --reads '/data/fastq/*_R{1,2}.fastq.gz' \
    --bowtie2_index '/ref/bowtie2_index/genome' \
    --spikein_index '/ref/bowtie2_ecoli/ecoli' \
    --chrom_sizes '/ref/hg38.chrom.sizes' \
    --blacklist '/ref/hg38-blacklist.v2.bed' \
    --outdir results/ \
    -resume

Cloud (GCP / AWS)

nextflow run main.nf \
    -profile gcp \
    --reads 'gs://bucket/fastq/*_R{1,2}.fastq.gz' \
    --bowtie2_index 'gs://bucket/ref/bowtie2_index/genome' \
    --spikein_index 'gs://bucket/ref/bowtie2_ecoli/ecoli' \
    --chrom_sizes 'gs://bucket/ref/hg38.chrom.sizes' \
    --blacklist 'gs://bucket/ref/hg38-blacklist.v2.bed' \
    --outdir 'gs://bucket/results/' \
    -resume

Resource Requirements

StepCPUsRAMTime (per sample)
Bowtie2 align (genome)88 GB30-60 min
Bowtie2 align (spike-in)44 GB10-20 min
Filter/dedup48 GB15-30 min
SEACR peaks24 GB10-20 min
Signal tracks48 GB15-30 min
Total88 GB1.5-3 hours

Pipeline Parameters

ParameterDefaultDescription
--reads
requiredGlob pattern to paired FASTQ files
--bowtie2_index
requiredBowtie2 genome index prefix
--spikein_index
requiredBowtie2 E. coli spike-in index prefix
--chrom_sizes
requiredChromosome sizes file
--blacklist
requiredENCODE blacklist BED file
--outdir
./results
Output directory
--seacr_mode
stringent
SEACR mode:
stringent
or
relaxed
--seacr_norm
norm
SEACR normalization:
norm
or
non
--control
null
IgG control BAM (if available)
--peak_caller
seacr
Peak caller:
seacr
or
macs2
or
both
--skip_spikein
false
Skip spike-in normalization

Output Files

results/
  fastqc/                           # Raw read quality
  alignment/
    {sample}.filtered.bam           # Filtered, deduplicated BAM
    {sample}.filtered.bam.bai
  spikein/
    {sample}.spikein_counts.txt     # Spike-in read counts
    {sample}.scale_factor.txt       # Computed scale factor
  peaks/
    {sample}.seacr.stringent.bed    # SEACR stringent peaks
    {sample}.seacr.relaxed.bed      # SEACR relaxed peaks
    {sample}.macs2_peaks.narrowPeak # MACS2 peaks (if requested)
  signal/
    {sample}.normalized.bw          # Spike-in normalized signal
    {sample}.fragments.bed          # Fragment BED file
  qc/
    {sample}.flagstat.txt
    {sample}.fragment_sizes.txt
    {sample}.frip.txt
  multiqc/
    multiqc_report.html

QC Thresholds

MetricPassWarningFail
Mapping rate (genome)>80%60-80%<60%
Spike-in reads1-10% of total0.1-1% or 10-30%<0.1% or >30%
Duplication rate<20%20-40%>40%
FRiP (peaks)>10%5-10%<5%
Peak count>5,0001,000-5,000<1,000
Fragment sizeNucleosomal patternIrregularNo pattern

Fragment Size Distribution

CUT&RUN produces a characteristic nucleosomal ladder:

  • <120 bp: Sub-nucleosomal (TF binding)
  • ~150 bp: Mononucleosomal (histone marks)
  • ~300 bp: Dinucleosomal
  • Absence of nucleosomal pattern suggests protocol issues

Spike-in Normalization

Spike-in normalization is CRITICAL for CUT&RUN quantitative comparison.

How It Works

  1. E. coli DNA is carried over from pA-MNase/pA-Tn5 production
  2. Each sample has a different amount of spike-in reads
  3. Samples with more target cleavage have fewer spike-in reads (proportionally)
  4. Scale factor = 1 / (spike-in reads / minimum spike-in reads across samples)

Scale Factor Calculation

Sample A: 200,000 spike-in reads -> scale = 1.0 (minimum)
Sample B: 400,000 spike-in reads -> scale = 0.5
Sample C: 100,000 spike-in reads -> scale = 2.0

Higher spike-in counts = less target enrichment = lower scale factor.

SEACR vs MACS2

FeatureSEACRMACS2
Designed forCUT&RUN/CUT&TagChIP-seq
Background modelSparse enrichmentDynamic Poisson
Control requiredOptional (IgG)Recommended
Low backgroundHandles wellMay overcall
Stringent modeVery conservativeVia q-value
ENCODE recommendationPrimary for CUT&RUNAlternative

SEACR is specifically designed for the sparse, low-background signal profile of CUT&RUN data. MACS2 may overcall peaks due to the low background.

Critical Pitfalls

Spike-in Calibration is CRITICAL

Without spike-in normalization, quantitative comparisons between samples are unreliable. The amount of pA-MNase (or pA-Tn5) varies between experiments, and spike-in reads provide the internal calibration standard.

IgG Control vs No-Antibody Control

  • IgG control: Non-specific antibody, captures background binding
  • No-antibody: No antibody, captures MNase accessibility background
  • IgG is preferred but not always available
  • SEACR can work without control (uses top 1% of signal as threshold)

SEACR Stringent vs Relaxed Mode

  • Stringent: Returns only the most enriched peaks (fewer, higher confidence)
  • Relaxed: Returns a broader set including weaker peaks
  • For initial analysis, use stringent mode
  • For comprehensive catalogs, use relaxed mode with downstream filtering

CUT&RUN Suspect List (Nordin 2023)

In addition to the ENCODE blacklist, filter CUT&RUN peaks against the suspect list (Nordin et al. 2023), which identifies regions with artifactual signal specific to CUT&RUN/CUT&Tag protocols:

# Download suspect list
wget https://github.com/Boyle-Lab/Blacklist/raw/master/lists/CUTandRUN.suspectlist.hg38.bed.gz

# Filter peaks
bedtools intersect \
    -a peaks.bed \
    -b hg38-blacklist.v2.bed CUTandRUN.suspectlist.hg38.bed \
    -v \
    > peaks_filtered.bed

CUT&RUN vs CUT&Tag

Both protocols are supported by this pipeline. Differences:

  • CUT&RUN: Uses pA-MNase, E. coli spike-in from MNase production
  • CUT&Tag: Uses pA-Tn5, E. coli spike-in from Tn5 production
  • CUT&Tag has higher background from Tn5 insertion preference
  • CUT&Tag may work better for histone marks; CUT&RUN for TFs

Provenance Integration

After pipeline completion, log all outputs:

encode_log_derived_file(
    file_path="/results/peaks/sample1.seacr.stringent.bed",
    source_accessions=["ENCSR...", "ENCFF..."],
    description="CUT&RUN peaks from ENCODE CUT&RUN pipeline",
    file_type="CUT&RUN_peaks",
    tool_used="Bowtie2 2.5.3 + SEACR 1.3",
    parameters="stringent mode, spike-in normalized, blacklist + suspect list filtered"
)

Reference Files

Detailed step-by-step documentation is provided in the

references/
directory:

  1. 01-qc-trimming.md
    -- Read QC and adapter trimming for CUT&RUN
  2. 02-bowtie2-alignment.md
    -- Bowtie2 alignment to genome and spike-in
  3. 03-filtering-spikein.md
    -- Filtering, dedup, and spike-in normalization
  4. 04-seacr-peaks.md
    -- SEACR peak calling and MACS2 alternative
  5. 05-qc-metrics.md
    -- Fragment sizes, FRiP, spike-in QC

Walkthrough: Processing ENCODE CUT&RUN from FASTQ to Peaks

Goal: Process CUT&RUN/CUT&Tag FASTQ files through the ENCODE-compatible pipeline to generate peak calls with spike-in normalization. Context: CUT&RUN uses targeted MNase digestion (lower background than ChIP-seq) but requires different peak calling (SEACR instead of MACS2) and spike-in normalization for quantitative comparisons.

Step 1: Find CUT&RUN experiment

encode_search_experiments(assay_title="CUT&RUN", organism="Homo sapiens")

Expected output:

{
  "total": 35,
  "results": [
    {"accession": "ENCSR900CUR", "assay_title": "CUT&RUN", "target": "H3K27me3", "biosample_summary": "K562", "status": "released"}
  ]
}

Step 2: List FASTQ files

encode_list_files(accession="ENCSR900CUR", file_format="fastq")

Expected output:

{
  "files": [
    {"accession": "ENCFF900CR1", "output_type": "reads", "paired_end": "1", "file_size_mb": 800},
    {"accession": "ENCFF901CR2", "output_type": "reads", "paired_end": "2", "file_size_mb": 850}
  ]
}

Interpretation: CUT&RUN yields smaller files than ChIP-seq (~800MB vs ~2.5GB) due to lower background.

Step 3: Run the CUT&RUN pipeline

nextflow run pipeline-cutandrun/main.nf \
  --fastq_r1 ENCFF900CR1.fastq.gz \
  --fastq_r2 ENCFF901CR2.fastq.gz \
  --genome GRCh38 \
  --spike_in_genome dm6 \
  --target H3K27me3 \
  --peak_caller seacr \
  -profile docker

Key pipeline steps:

  1. Adapter trimming (Trim Galore)
  2. Bowtie2 alignment (very-sensitive-local)
  3. Spike-in alignment (E. coli or Drosophila)
  4. Spike-in normalization (scale factor)
  5. SEACR peak calling (stringent mode)
  6. Signal track generation with spike-in scaling

Step 4: Validate output quality

MetricThresholdPurpose
Spike-in alignment0.5-5% of readsNormalization calibration
Fragment size< 150bp majorityCUT&RUN characteristic
FRiP (SEACR)>= 5%Higher than ChIP-seq due to lower background
Duplicate rate< 20%Library complexity

Key difference from ChIP-seq: CUT&RUN has inherently lower background, so peak callers like MACS2 overfit. Use SEACR (Meers et al. 2019) instead.

Step 5: Compare with ChIP-seq for the same target

encode_search_experiments(assay_title="Histone ChIP-seq", biosample_term_name="K562", target="H3K27me3", organism="Homo sapiens")

Interpretation: CUT&RUN typically identifies fewer but higher-confidence peaks than ChIP-seq. Concordant peaks between both methods are the highest confidence.

Integration with downstream skills

  • SEACR peaks feed into -> histone-aggregation for cross-experiment comparison
  • Spike-in normalized signals feed into -> visualization-workflow
  • Peak regions feed into -> regulatory-elements for chromatin state classification
  • QC uses different thresholds than ChIP-seq -> quality-assessment (see suspect list)
  • Pipeline provenance logged by -> data-provenance

Code Examples

1. Survey CUT&RUN/CUT&Tag availability

encode_get_facets(assay_title="CUT&RUN", facet_field="target.label", organism="Homo sapiens")

Expected output:

{
  "facets": {
    "target.label": {"H3K27me3": 15, "H3K4me3": 12, "H3K27ac": 8, "CTCF": 5}
  }
}

2. Find matching ChIP-seq for comparison

encode_search_experiments(assay_title="Histone ChIP-seq", biosample_term_name="K562", target="H3K27me3", organism="Homo sapiens")

Expected output:

{
  "total": 5,
  "results": [
    {"accession": "ENCSR000CHI", "assay_title": "Histone ChIP-seq", "target": "H3K27me3", "biosample_summary": "K562"}
  ]
}

3. Track CUT&RUN experiments

encode_track_experiment(accession="ENCSR900CUR", notes="K562 H3K27me3 CUT&RUN - SEACR peaks for comparison with ChIP-seq")

Expected output:

{
  "status": "tracked",
  "accession": "ENCSR900CUR",
  "notes": "K562 H3K27me3 CUT&RUN - SEACR peaks for comparison with ChIP-seq"
}

Integration

This skill produces...Feed into...Purpose
SEACR peakshistone-aggregationCross-experiment comparison (note: different caller than ChIP-seq)
Spike-in normalized signalvisualization-workflowQuantitatively comparable browser tracks
Peak regionsregulatory-elementsChromatin state classification
CUT&RUN-specific QCquality-assessmentValidate with CUT&RUN-appropriate thresholds
Peak coordinatesmotif-analysisTF motif discovery at CUT&RUN peaks
Pipeline parametersdata-provenanceRecord SEACR/spike-in normalization details
Peak filesvariant-annotationIdentify variants in CUT&RUN peaks
Comparison with ChIP-seqcompare-biosamplesCross-assay concordance analysis

Related Skills

  • pipeline-guide
    -- Parent skill with compute resource assessment and cloud setup
  • histone-aggregation
    -- Aggregate histone mark data across samples
  • quality-assessment
    -- Evaluate pipeline output quality metrics
  • data-provenance
    -- Track all pipeline inputs, outputs, and parameters
  • download-encode
    -- Download ENCODE CUT&RUN FASTQ files for pipeline input
  • publication-trust
    -- Verify literature claims backing analytical decisions

Presenting Results

When reporting CUT&RUN pipeline results:

  • SEACR peak counts: Report peak counts for both stringent and relaxed modes. If MACS2 was also run, include those counts for comparison
  • Spike-in normalization factor: Report the computed scale factor per sample and the spike-in read fraction (ideal 1-10% of total reads). Explain that higher spike-in counts indicate less target enrichment
  • FRiP: Report the fraction of reads in peaks (>10% pass, 5-10% warning, <5% fail). Note that CUT&RUN FRiP thresholds differ from ChIP-seq
  • Signal track paths: Provide paths to spike-in normalized bigWig files for genome browser visualization
  • Fragment size distribution: Confirm the expected nucleosomal ladder pattern and note the dominant fragment class (sub-nucleosomal for TFs, mononucleosomal for histone marks)
  • Key QC metrics: Present mapping rate (>80%), duplication rate (<20%), and spike-in calibration status in a summary table
  • Suspect list filtering: Note whether peaks were filtered against both the ENCODE blacklist and the CUT&RUN suspect list (Nordin 2023)
  • Next steps: Suggest
    peak-annotation
    for gene association of peaks, or
    visualization-workflow
    for genome browser session generation

For the request: "$ARGUMENTS"