Encode-toolkit pipeline-cutandrun
Execute CUT&RUN processing pipeline from FASTQ to peaks and signal tracks. Child of pipeline-guide. Provides Nextflow execution with Docker and cloud deployment. Use when processing CUT&RUN or CUT&Tag data, an alternative to ChIP-seq with lower background. Trigger on: CUT&RUN pipeline, CUT&Tag, SEACR, Henikoff, targeted chromatin, pA-MNase, process CUT&RUN.
git clone https://github.com/ammawla/encode-toolkit
T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/pipeline-cutandrun" ~/.claude/skills/ammawla-encode-toolkit-pipeline-cutandrun-6a9d8d && rm -rf "$T"
skills/pipeline-cutandrun/SKILL.mdENCODE CUT&RUN Pipeline: FASTQ to Peaks and Signal Tracks
When to Use
- User wants to run a CUT&RUN or CUT&Tag processing pipeline from FASTQ to peaks
- User asks about "CUT&RUN pipeline", "CUT&Tag", "SEACR", "spike-in normalization", or "targeted chromatin"
- User needs to process CUT&RUN/CUT&Tag data with spike-in calibration and SEACR peak calling
- Example queries: "process my CUT&RUN FASTQs", "run SEACR on CUT&Tag data", "normalize CUT&RUN with spike-in controls"
Execute the CUT&RUN/CUT&Tag processing pipeline for targeted chromatin profiling, producing peak calls with SEACR and spike-in normalized signal tracks.
Pipeline Overview
FASTQ -> Trim -> Bowtie2 align (genome) -> Filter/dedup -> SEACR peaks | | | Bowtie2 align (spike-in) Spike-in normalize Signal tracks | Scale factor calculation
ENCODE Repository
- GitHub:
ENCODE-DCC/cutandrun-pipeline - Container:
encodedcc/cutandrun-pipeline - This skill: Nextflow DSL2 reimplementation for portability
Core Tools and Versions
| Tool | Version | Purpose | Citation |
|---|---|---|---|
| Bowtie2 | 2.5.3 | Alignment (genome + spike-in) | Langmead & Salzberg 2012 |
| SEACR | 1.3 | Peak calling (CUT&RUN-specific) | Meers et al. 2019 |
| MACS2 | 2.2.9.1 | Alternative peak caller | Zhang et al. 2008 |
| Picard | 3.1.1 | Duplicate marking | Broad Institute |
| samtools | 1.19 | BAM operations | Li et al. 2009 |
| bedtools | 2.31.0 | Genomic arithmetic | Quinlan & Hall 2010 |
| deepTools | 3.5.4 | Signal track generation | Ramirez et al. 2016 |
| FastQC | 0.12.1 | Read quality | Andrews (Babraham) |
| MultiQC | 1.21 | Aggregated QC | Ewels et al. 2016 |
Key Literature
-
Skene & Henikoff 2017 - "An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites" (eLife, ~1,500 citations) DOI: 10.7554/eLife.21856
-
Meers et al. 2019 - "Peak calling by Sparse Enrichment Analysis for CUT&RUN chromatin profiling" (Epigenetics & Chromatin, ~800 citations) DOI: 10.1186/s13072-019-0287-4
-
Kaya-Okur et al. 2019 - "CUT&Tag for efficient epigenomic profiling of small samples and single cells" (Nature Communications, ~1,200 citations) DOI: 10.1038/s41467-019-09982-5
-
Nordin et al. 2023 - "The CUT&RUN suspect list of problematic regions" (Genome Biology) DOI: 10.1186/s13059-023-02960-3
-
Amemiya et al. 2019 - "The ENCODE Blacklist" (Scientific Reports, ~1,372 citations) DOI: 10.1038/s41598-019-45839-z
Execution
Quick Start (Local)
nextflow run main.nf \ -profile local \ --reads '/data/fastq/*_R{1,2}.fastq.gz' \ --bowtie2_index '/ref/bowtie2_index/genome' \ --spikein_index '/ref/bowtie2_ecoli/ecoli' \ --chrom_sizes '/ref/hg38.chrom.sizes' \ --blacklist '/ref/hg38-blacklist.v2.bed' \ --outdir results/ \ -resume
SLURM HPC
nextflow run main.nf \ -profile slurm \ --reads '/data/fastq/*_R{1,2}.fastq.gz' \ --bowtie2_index '/ref/bowtie2_index/genome' \ --spikein_index '/ref/bowtie2_ecoli/ecoli' \ --chrom_sizes '/ref/hg38.chrom.sizes' \ --blacklist '/ref/hg38-blacklist.v2.bed' \ --outdir results/ \ -resume
Cloud (GCP / AWS)
nextflow run main.nf \ -profile gcp \ --reads 'gs://bucket/fastq/*_R{1,2}.fastq.gz' \ --bowtie2_index 'gs://bucket/ref/bowtie2_index/genome' \ --spikein_index 'gs://bucket/ref/bowtie2_ecoli/ecoli' \ --chrom_sizes 'gs://bucket/ref/hg38.chrom.sizes' \ --blacklist 'gs://bucket/ref/hg38-blacklist.v2.bed' \ --outdir 'gs://bucket/results/' \ -resume
Resource Requirements
| Step | CPUs | RAM | Time (per sample) |
|---|---|---|---|
| Bowtie2 align (genome) | 8 | 8 GB | 30-60 min |
| Bowtie2 align (spike-in) | 4 | 4 GB | 10-20 min |
| Filter/dedup | 4 | 8 GB | 15-30 min |
| SEACR peaks | 2 | 4 GB | 10-20 min |
| Signal tracks | 4 | 8 GB | 15-30 min |
| Total | 8 | 8 GB | 1.5-3 hours |
Pipeline Parameters
| Parameter | Default | Description |
|---|---|---|
| required | Glob pattern to paired FASTQ files |
| required | Bowtie2 genome index prefix |
| required | Bowtie2 E. coli spike-in index prefix |
| required | Chromosome sizes file |
| required | ENCODE blacklist BED file |
| | Output directory |
| | SEACR mode: or |
| | SEACR normalization: or |
| | IgG control BAM (if available) |
| | Peak caller: or or |
| | Skip spike-in normalization |
Output Files
results/ fastqc/ # Raw read quality alignment/ {sample}.filtered.bam # Filtered, deduplicated BAM {sample}.filtered.bam.bai spikein/ {sample}.spikein_counts.txt # Spike-in read counts {sample}.scale_factor.txt # Computed scale factor peaks/ {sample}.seacr.stringent.bed # SEACR stringent peaks {sample}.seacr.relaxed.bed # SEACR relaxed peaks {sample}.macs2_peaks.narrowPeak # MACS2 peaks (if requested) signal/ {sample}.normalized.bw # Spike-in normalized signal {sample}.fragments.bed # Fragment BED file qc/ {sample}.flagstat.txt {sample}.fragment_sizes.txt {sample}.frip.txt multiqc/ multiqc_report.html
QC Thresholds
| Metric | Pass | Warning | Fail |
|---|---|---|---|
| Mapping rate (genome) | >80% | 60-80% | <60% |
| Spike-in reads | 1-10% of total | 0.1-1% or 10-30% | <0.1% or >30% |
| Duplication rate | <20% | 20-40% | >40% |
| FRiP (peaks) | >10% | 5-10% | <5% |
| Peak count | >5,000 | 1,000-5,000 | <1,000 |
| Fragment size | Nucleosomal pattern | Irregular | No pattern |
Fragment Size Distribution
CUT&RUN produces a characteristic nucleosomal ladder:
- <120 bp: Sub-nucleosomal (TF binding)
- ~150 bp: Mononucleosomal (histone marks)
- ~300 bp: Dinucleosomal
- Absence of nucleosomal pattern suggests protocol issues
Spike-in Normalization
Spike-in normalization is CRITICAL for CUT&RUN quantitative comparison.
How It Works
- E. coli DNA is carried over from pA-MNase/pA-Tn5 production
- Each sample has a different amount of spike-in reads
- Samples with more target cleavage have fewer spike-in reads (proportionally)
- Scale factor = 1 / (spike-in reads / minimum spike-in reads across samples)
Scale Factor Calculation
Sample A: 200,000 spike-in reads -> scale = 1.0 (minimum) Sample B: 400,000 spike-in reads -> scale = 0.5 Sample C: 100,000 spike-in reads -> scale = 2.0
Higher spike-in counts = less target enrichment = lower scale factor.
SEACR vs MACS2
| Feature | SEACR | MACS2 |
|---|---|---|
| Designed for | CUT&RUN/CUT&Tag | ChIP-seq |
| Background model | Sparse enrichment | Dynamic Poisson |
| Control required | Optional (IgG) | Recommended |
| Low background | Handles well | May overcall |
| Stringent mode | Very conservative | Via q-value |
| ENCODE recommendation | Primary for CUT&RUN | Alternative |
SEACR is specifically designed for the sparse, low-background signal profile of CUT&RUN data. MACS2 may overcall peaks due to the low background.
Critical Pitfalls
Spike-in Calibration is CRITICAL
Without spike-in normalization, quantitative comparisons between samples are unreliable. The amount of pA-MNase (or pA-Tn5) varies between experiments, and spike-in reads provide the internal calibration standard.
IgG Control vs No-Antibody Control
- IgG control: Non-specific antibody, captures background binding
- No-antibody: No antibody, captures MNase accessibility background
- IgG is preferred but not always available
- SEACR can work without control (uses top 1% of signal as threshold)
SEACR Stringent vs Relaxed Mode
- Stringent: Returns only the most enriched peaks (fewer, higher confidence)
- Relaxed: Returns a broader set including weaker peaks
- For initial analysis, use stringent mode
- For comprehensive catalogs, use relaxed mode with downstream filtering
CUT&RUN Suspect List (Nordin 2023)
In addition to the ENCODE blacklist, filter CUT&RUN peaks against the suspect list (Nordin et al. 2023), which identifies regions with artifactual signal specific to CUT&RUN/CUT&Tag protocols:
# Download suspect list wget https://github.com/Boyle-Lab/Blacklist/raw/master/lists/CUTandRUN.suspectlist.hg38.bed.gz # Filter peaks bedtools intersect \ -a peaks.bed \ -b hg38-blacklist.v2.bed CUTandRUN.suspectlist.hg38.bed \ -v \ > peaks_filtered.bed
CUT&RUN vs CUT&Tag
Both protocols are supported by this pipeline. Differences:
- CUT&RUN: Uses pA-MNase, E. coli spike-in from MNase production
- CUT&Tag: Uses pA-Tn5, E. coli spike-in from Tn5 production
- CUT&Tag has higher background from Tn5 insertion preference
- CUT&Tag may work better for histone marks; CUT&RUN for TFs
Provenance Integration
After pipeline completion, log all outputs:
encode_log_derived_file( file_path="/results/peaks/sample1.seacr.stringent.bed", source_accessions=["ENCSR...", "ENCFF..."], description="CUT&RUN peaks from ENCODE CUT&RUN pipeline", file_type="CUT&RUN_peaks", tool_used="Bowtie2 2.5.3 + SEACR 1.3", parameters="stringent mode, spike-in normalized, blacklist + suspect list filtered" )
Reference Files
Detailed step-by-step documentation is provided in the
references/ directory:
-- Read QC and adapter trimming for CUT&RUN01-qc-trimming.md
-- Bowtie2 alignment to genome and spike-in02-bowtie2-alignment.md
-- Filtering, dedup, and spike-in normalization03-filtering-spikein.md
-- SEACR peak calling and MACS2 alternative04-seacr-peaks.md
-- Fragment sizes, FRiP, spike-in QC05-qc-metrics.md
Walkthrough: Processing ENCODE CUT&RUN from FASTQ to Peaks
Goal: Process CUT&RUN/CUT&Tag FASTQ files through the ENCODE-compatible pipeline to generate peak calls with spike-in normalization. Context: CUT&RUN uses targeted MNase digestion (lower background than ChIP-seq) but requires different peak calling (SEACR instead of MACS2) and spike-in normalization for quantitative comparisons.
Step 1: Find CUT&RUN experiment
encode_search_experiments(assay_title="CUT&RUN", organism="Homo sapiens")
Expected output:
{ "total": 35, "results": [ {"accession": "ENCSR900CUR", "assay_title": "CUT&RUN", "target": "H3K27me3", "biosample_summary": "K562", "status": "released"} ] }
Step 2: List FASTQ files
encode_list_files(accession="ENCSR900CUR", file_format="fastq")
Expected output:
{ "files": [ {"accession": "ENCFF900CR1", "output_type": "reads", "paired_end": "1", "file_size_mb": 800}, {"accession": "ENCFF901CR2", "output_type": "reads", "paired_end": "2", "file_size_mb": 850} ] }
Interpretation: CUT&RUN yields smaller files than ChIP-seq (~800MB vs ~2.5GB) due to lower background.
Step 3: Run the CUT&RUN pipeline
nextflow run pipeline-cutandrun/main.nf \ --fastq_r1 ENCFF900CR1.fastq.gz \ --fastq_r2 ENCFF901CR2.fastq.gz \ --genome GRCh38 \ --spike_in_genome dm6 \ --target H3K27me3 \ --peak_caller seacr \ -profile docker
Key pipeline steps:
- Adapter trimming (Trim Galore)
- Bowtie2 alignment (very-sensitive-local)
- Spike-in alignment (E. coli or Drosophila)
- Spike-in normalization (scale factor)
- SEACR peak calling (stringent mode)
- Signal track generation with spike-in scaling
Step 4: Validate output quality
| Metric | Threshold | Purpose |
|---|---|---|
| Spike-in alignment | 0.5-5% of reads | Normalization calibration |
| Fragment size | < 150bp majority | CUT&RUN characteristic |
| FRiP (SEACR) | >= 5% | Higher than ChIP-seq due to lower background |
| Duplicate rate | < 20% | Library complexity |
Key difference from ChIP-seq: CUT&RUN has inherently lower background, so peak callers like MACS2 overfit. Use SEACR (Meers et al. 2019) instead.
Step 5: Compare with ChIP-seq for the same target
encode_search_experiments(assay_title="Histone ChIP-seq", biosample_term_name="K562", target="H3K27me3", organism="Homo sapiens")
Interpretation: CUT&RUN typically identifies fewer but higher-confidence peaks than ChIP-seq. Concordant peaks between both methods are the highest confidence.
Integration with downstream skills
- SEACR peaks feed into -> histone-aggregation for cross-experiment comparison
- Spike-in normalized signals feed into -> visualization-workflow
- Peak regions feed into -> regulatory-elements for chromatin state classification
- QC uses different thresholds than ChIP-seq -> quality-assessment (see suspect list)
- Pipeline provenance logged by -> data-provenance
Code Examples
1. Survey CUT&RUN/CUT&Tag availability
encode_get_facets(assay_title="CUT&RUN", facet_field="target.label", organism="Homo sapiens")
Expected output:
{ "facets": { "target.label": {"H3K27me3": 15, "H3K4me3": 12, "H3K27ac": 8, "CTCF": 5} } }
2. Find matching ChIP-seq for comparison
encode_search_experiments(assay_title="Histone ChIP-seq", biosample_term_name="K562", target="H3K27me3", organism="Homo sapiens")
Expected output:
{ "total": 5, "results": [ {"accession": "ENCSR000CHI", "assay_title": "Histone ChIP-seq", "target": "H3K27me3", "biosample_summary": "K562"} ] }
3. Track CUT&RUN experiments
encode_track_experiment(accession="ENCSR900CUR", notes="K562 H3K27me3 CUT&RUN - SEACR peaks for comparison with ChIP-seq")
Expected output:
{ "status": "tracked", "accession": "ENCSR900CUR", "notes": "K562 H3K27me3 CUT&RUN - SEACR peaks for comparison with ChIP-seq" }
Integration
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| SEACR peaks | histone-aggregation | Cross-experiment comparison (note: different caller than ChIP-seq) |
| Spike-in normalized signal | visualization-workflow | Quantitatively comparable browser tracks |
| Peak regions | regulatory-elements | Chromatin state classification |
| CUT&RUN-specific QC | quality-assessment | Validate with CUT&RUN-appropriate thresholds |
| Peak coordinates | motif-analysis | TF motif discovery at CUT&RUN peaks |
| Pipeline parameters | data-provenance | Record SEACR/spike-in normalization details |
| Peak files | variant-annotation | Identify variants in CUT&RUN peaks |
| Comparison with ChIP-seq | compare-biosamples | Cross-assay concordance analysis |
Related Skills
-- Parent skill with compute resource assessment and cloud setuppipeline-guide
-- Aggregate histone mark data across sampleshistone-aggregation
-- Evaluate pipeline output quality metricsquality-assessment
-- Track all pipeline inputs, outputs, and parametersdata-provenance
-- Download ENCODE CUT&RUN FASTQ files for pipeline inputdownload-encode
-- Verify literature claims backing analytical decisionspublication-trust
Presenting Results
When reporting CUT&RUN pipeline results:
- SEACR peak counts: Report peak counts for both stringent and relaxed modes. If MACS2 was also run, include those counts for comparison
- Spike-in normalization factor: Report the computed scale factor per sample and the spike-in read fraction (ideal 1-10% of total reads). Explain that higher spike-in counts indicate less target enrichment
- FRiP: Report the fraction of reads in peaks (>10% pass, 5-10% warning, <5% fail). Note that CUT&RUN FRiP thresholds differ from ChIP-seq
- Signal track paths: Provide paths to spike-in normalized bigWig files for genome browser visualization
- Fragment size distribution: Confirm the expected nucleosomal ladder pattern and note the dominant fragment class (sub-nucleosomal for TFs, mononucleosomal for histone marks)
- Key QC metrics: Present mapping rate (>80%), duplication rate (<20%), and spike-in calibration status in a summary table
- Suspect list filtering: Note whether peaks were filtered against both the ENCODE blacklist and the CUT&RUN suspect list (Nordin 2023)
- Next steps: Suggest
for gene association of peaks, orpeak-annotation
for genome browser session generationvisualization-workflow