Encode-toolkit pipeline-atacseq

Execute ENCODE ATAC-seq processing pipeline from FASTQ to peaks and signal tracks. Child of pipeline-guide. Provides stage-by-stage Nextflow execution with Docker containers and cloud deployment. Handles Tn5 transposase offset correction, mitochondrial read removal, nucleosome-free fragment selection, and TSS enrichment scoring. Use when users need to process ATAC-seq data following ENCODE standards. Trigger on: ATAC-seq pipeline, run ATAC-seq, process ATAC-seq, chromatin accessibility, open chromatin, Tn5 shift, TSS enrichment.

install
source · Clone the upstream repo
git clone https://github.com/ammawla/encode-toolkit
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/pipeline-atacseq" ~/.claude/skills/ammawla-encode-toolkit-pipeline-atacseq-a50740 && rm -rf "$T"
manifest: skills/pipeline-atacseq/SKILL.md
source content

ENCODE ATAC-seq Pipeline

When to Use

  • User wants to run an ATAC-seq processing pipeline from FASTQ to peaks and signal tracks
  • User asks about "ATAC-seq pipeline", "Tn5 shift", "chromatin accessibility pipeline", or "Bowtie2 for ATAC"
  • User needs to process ATAC-seq data with proper Tn5 insertion site correction
  • Example queries: "process my ATAC-seq FASTQs", "run ENCODE ATAC-seq pipeline", "call accessibility peaks from ATAC-seq"

Execute the ENCODE ATAC-seq processing pipeline from raw FASTQ files through Tn5 offset correction, peak calling, IDR analysis, and signal track generation. This skill provides a complete Nextflow DSL2 implementation following ENCODE uniform analysis standards.

Overview

ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) uses the Tn5 transposase to probe open chromatin regions. The ENCODE pipeline processes ATAC-seq data through quality control, alignment with Bowtie2, Tn5 insertion site correction (+4/-5 bp offset), mitochondrial read removal, nucleosome-free fragment selection, peak calling with MACS2, and IDR-based replicate consistency analysis.

Key differences from ChIP-seq: Bowtie2 aligner (optimized for short fragments), Tn5 transposase shift correction, aggressive mitochondrial read filtering (can be 30-80% of reads), nucleosomal fragment size distribution as a QC metric, and TSS enrichment score as the primary quality indicator.

Key Literature

ReferenceJournalYearDOIRelevance
Buenrostro et al. "Transposition of native chromatin (ATAC-seq)"Nature Methods201310.1038/nmeth.2688Original ATAC-seq method (~5,000 citations)
Corces et al. "An improved ATAC-seq protocol"Nature Methods201710.1038/nmeth.4396Omni-ATAC improvements (~2,500 citations)
ENCODE Project Consortium "Expanded encyclopaedias"Nature202010.1038/s41586-020-2493-4ENCODE Phase 3 standards
Amemiya et al. "ENCODE Blacklist"Scientific Reports201910.1038/s41598-019-45839-zArtifact regions (~1,372 citations)
Langmead & Salzberg "Fast gapped-read alignment with Bowtie 2"Nature Methods201210.1038/nmeth.1923Aligner (~30,000 citations)
Yan et al. "From reads to insight: ATAC-seq analysis"Genome Biology202010.1186/s13059-020-1929-3Analysis best practices

Pipeline Stages

FASTQ ──> FastQC / Trim Galore ──> Bowtie2 ──> Mito Removal + Tn5 Shift
  │                                                       │
  │           ┌──────────────────────────────────────────┘
  │           v
  │     Picard MarkDup ──> Blacklist Filter ──> Size Selection
  │                                                   │
  │                    ┌─────────────────┬────────────┘
  │                    v                 v
  │             NFR Fragments     Mono-Nucleosome
  │                    │
  │                    v
  │           MACS2 Peak Calling ──> IDR Analysis
  │                    │                    │
  │                    v                    v
  │             Signal Tracks         QC Report (MultiQC + ataqv)
  v
 Raw QC Report

Stage Summary

StageToolInputOutputReference
1. QC & TrimmingFastQC, Trim GaloreRaw FASTQTrimmed FASTQreferences/01-qc-trimming.md
2. AlignmentBowtie2Trimmed FASTQSorted BAMreferences/02-alignment.md
3. Tn5 Shift & FilteringSamtools, bedtools, PicardSorted BAMShifted, filtered BAMreferences/03-tn5-filtering.md
4. Peak Calling & IDRMACS2, IDRFiltered BAMPeaks (narrowPeak)references/04-peak-calling.md
5. QC & Signaldeeptools, ataqv, MultiQCFiltered BAM, PeaksbigWig, QC reportreferences/05-qc-metrics.md

Input Requirements

Required Files

  • ATAC-seq FASTQ: Paired-end reads (strongly recommended; single-end supported)
  • Reference genome: Bowtie2-indexed genome (GRCh38 for human, mm10 for mouse)

Sample Sheet Format

sample_id,read1,read2,replicate
SAMPLE1_rep1,atac_R1.fq.gz,atac_R2.fq.gz,1
SAMPLE1_rep2,atac_R1.fq.gz,atac_R2.fq.gz,2

No input control needed: Unlike ChIP-seq, ATAC-seq does not require a separate input or IgG control. MACS2 calls peaks against a local background model.

Tn5 Transposase Offset Correction

The Tn5 transposase inserts sequencing adapters with a 9-bp duplication. To center reads on the actual cut site:

  • Forward strand (+): shift +4 bp
  • Reverse strand (-): shift -5 bp

This correction is essential for accurate footprinting and motif analysis.

Fragment Size Distribution

ATAC-seq produces a characteristic nucleosomal ladder pattern:

Fragment ClassSize RangeBiological Meaning
Nucleosome-free (NFR)<150 bpOpen chromatin / TF binding
Mono-nucleosome150-300 bpSingle nucleosome wrapping
Di-nucleosome300-500 bpTwo nucleosomes
Tri-nucleosome500-700 bpThree nucleosomes

For peak calling, use nucleosome-free reads (<150 bp) only.

QC Thresholds

MetricThresholdCategorySource
Total sequenced reads>=50M (recommended)Read depthENCODE
Mapping rate>80%AlignmentENCODE
Mitochondrial fraction<20% (ideal <5%)Sample qualityENCODE
NRF (non-redundant fraction)>=0.8Library complexityENCODE
PBC1>=0.8Library complexityENCODE
TSS enrichment score>=5Signal qualityENCODE standard
FRiP>=0.3Peak qualityENCODE
NFR fraction>0.4 of fragments <150bpFragment distributionBuenrostro 2013
IDR optimal peaks>50,000ReproducibilityENCODE

TSS Enrichment Score

The TSS enrichment score measures the fold enrichment of ATAC-seq signal at transcription start sites compared to flanking regions. It is the single most informative QC metric for ATAC-seq:

ScoreQualityInterpretation
>=7ExcellentHigh signal-to-noise
5-7GoodAcceptable for most analyses
3-5MarginalReview other metrics carefully
<3PoorLikely failed; consider re-doing

Execution

Quick Start (Local Docker)

nextflow run scripts/main.nf \
  -profile local \
  --reads 'fastq/*_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir results/

SLURM HPC

nextflow run scripts/main.nf \
  -profile slurm \
  --reads 'fastq/*_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir results/

Google Cloud

nextflow run scripts/main.nf \
  -profile gcp \
  --reads 'gs://bucket/fastq/*_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir 'gs://bucket/results/'

AWS Batch

nextflow run scripts/main.nf \
  -profile aws \
  --reads 's3://bucket/fastq/*_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir 's3://bucket/results/'

Cloud Cost Estimates

PlatformInstanceCost/SampleTime/SampleNotes
GCPn1-standard-8~$2-42-3 hoursPreemptible recommended
AWSm5.2xlarge~$2-42-3 hoursSpot instances recommended
Local8 cores, 32GB$03-5 hoursDocker required
SLURM8 cores, 32GBVaries2-3 hoursSingularity recommended

Output Directory Structure

results/
  fastqc/                   # Raw and trimmed QC reports
  trimmed/                  # Trimmed FASTQ files
  aligned/                  # Sorted BAM files (pre-filtering)
  filtered/
    shifted/                # Tn5-corrected BAM files
    nfr/                    # Nucleosome-free fragments (<150 bp)
    mononuc/                # Mono-nucleosome fragments (150-300 bp)
  peaks/
    narrow/                 # MACS2 narrowPeak files
    idr/                    # IDR-filtered reproducible peaks
  signal/                   # bigWig signal tracks
  qc/
    tss_enrichment/         # TSS enrichment scores and plots
    fragment_size/          # Fragment size distribution plots
    ataqv/                  # Comprehensive ATAC-seq QC (ataqv)
    multiqc/                # Aggregated QC report
  logs/                     # Nextflow execution logs

Common Pitfalls

1. High Mitochondrial Read Fraction

Mitochondrial DNA lacks chromatin and is highly accessible, often capturing 30-80% of reads. This is the most common ATAC-seq quality issue. Filter chrM reads before analysis. If >50% mito, consider optimizing the cell lysis step.

2. Missing Tn5 Shift Correction

Without the +4/-5 bp offset correction, cut-site positions are shifted by ~4.5 bp. This matters for footprinting and motif analysis but has minimal effect on peak calling. Always apply the shift for publication-quality results.

3. Using BWA Instead of Bowtie2

Bowtie2 handles the short fragments from ATAC-seq (especially NFR <150bp) better than BWA-MEM. Use Bowtie2 with

--very-sensitive
for optimal ATAC-seq alignment.

4. Not Separating Nucleosomal Fractions

Peak calling on all fragments mixes nucleosome-free signal (TF binding) with nucleosomal signal. Always size-select NFR (<150 bp) for peak calling.

5. Ignoring TSS Enrichment

TSS enrichment is the most informative single metric for ATAC-seq quality. A score <5 indicates a failed experiment regardless of other metrics.

Pipeline Scripts

FileDescriptionLines
scripts/main.nf
Nextflow DSL2 pipeline~120
scripts/nextflow.config
Execution profiles (local/slurm/gcp/aws)~60
scripts/Dockerfile
Multi-stage Docker build with all tools~30

ENCODE Data Integration

After running on your own data, compare with ENCODE reference:

# Find matching ENCODE ATAC-seq experiments
encode_search_experiments(
    assay_title="ATAC-seq",
    organ="pancreas",
    biosample_type="tissue"
)

# Download ENCODE peaks for comparison
encode_batch_download(
    download_dir="/data/encode_reference/",
    output_type="IDR thresholded peaks",
    assay_title="ATAC-seq",
    organ="pancreas",
    assembly="GRCh38"
)

Pitfalls & Edge Cases

  • Tn5 shift is critical: ATAC-seq reads must be shifted +4/-5 bp to center on the Tn5 insertion site. Without this correction, footprinting analysis will be offset by ~5 bp and motif enrichment will be degraded.
  • Mitochondrial reads dominate: Expect 30-80% mitochondrial reads in ATAC-seq. Filter chrM reads AFTER alignment, BEFORE peak calling. High mitoChRM (>80%) indicates dead/dying cells or poor nuclei isolation.
  • Fragment size distribution is diagnostic: A nucleosomal ladder (sub-nucleosomal <150bp, mono-nucleosomal ~200bp, di-nucleosomal ~400bp) confirms successful transposition. Absence of the ladder suggests incomplete or failed transposition.
  • TSS enrichment threshold: ENCODE requires TSS enrichment ≥5 (GRCh38), ≥6 (hg19), or ≥10 (mm10) for ATAC-seq (ENCODE data standards). Values below 4 indicate poor signal-to-noise. This is the single most informative QC metric for ATAC-seq.
  • Peak caller choice matters: MACS2 with
    --nomodel --shift -100 --extsize 200
    is standard for ATAC-seq. Do NOT use the ChIP-seq default MACS2 settings — they assume sonicated fragment distributions.
  • Paired-end vs single-end: ATAC-seq should always be paired-end to capture fragment sizes. Single-end ATAC-seq cannot distinguish nucleosome-free from nucleosomal fragments.

Walkthrough: Processing ENCODE ATAC-seq from FASTQ to Accessible Chromatin Peaks

Goal: Process raw ATAC-seq FASTQ files through the ENCODE pipeline to generate nucleosome-free region peaks and signal tracks for chromatin accessibility analysis. Context: ATAC-seq requires Tn5 transposase insertion site correction (+4/-5 bp shift) and nucleosomal fragment size filtering, handled by the ENCODE ATAC-seq pipeline.

Step 1: Find ATAC-seq experiment

encode_get_experiment(accession="ENCSR637ENO")

Expected output:

{
  "accession": "ENCSR637ENO",
  "assay_title": "ATAC-seq",
  "biosample_summary": "GM12878",
  "replicates": 2,
  "status": "released"
}

Step 2: List FASTQ files

encode_list_files(accession="ENCSR637ENO", file_format="fastq")

Expected output:

{
  "files": [
    {"accession": "ENCFF100ATQ", "output_type": "reads", "paired_end": "1", "biological_replicates": [1], "file_size_mb": 1800},
    {"accession": "ENCFF101ATQ", "output_type": "reads", "paired_end": "2", "biological_replicates": [1], "file_size_mb": 1900}
  ]
}

Step 3: Run the ATAC-seq pipeline

nextflow run pipeline-atacseq/main.nf \
  --fastq_r1 ENCFF100ATQ.fastq.gz \
  --fastq_r2 ENCFF101ATQ.fastq.gz \
  --genome GRCh38 \
  --blacklist encode_blacklist_v2.bed \
  --mitochondrial_chr chrM \
  -profile docker

Key pipeline steps:

  1. Adapter trimming (Trimmomatic/cutadapt)
  2. Alignment (Bowtie2, very-sensitive mode)
  3. Tn5 shift correction (+4/-5 bp)
  4. Mitochondrial read removal
  5. Nucleosome-free fragment selection (<150 bp)
  6. Peak calling (MACS2, --nomodel --shift -75 --extsize 150)

Step 4: Validate output quality

MetricThresholdPurpose
TSS enrichment>= 5 (GRCh38), >= 6 (hg19), >= 10 (mm10)Signal enrichment at transcription start sites
Fragment size distributionNucleosomal ladder~200bp, ~400bp, ~600bp periodicity
Mitochondrial reads< 20%Excessive = failed library
FRiP>= 0.2Fraction of reads in peaks

Step 5: Track and log provenance

encode_track_experiment(accession="ENCSR637ENO", notes="GM12878 ATAC-seq processed through ENCODE pipeline")

Integration with downstream skills

  • Accessible chromatin peaks feed into -> accessibility-aggregation for cross-experiment union merge
  • Peak regions feed into -> motif-analysis for TF motif enrichment
  • Signal tracks feed into -> visualization-workflow for browser display
  • Peaks feed into -> regulatory-elements for cCRE classification
  • QC metrics validated by -> quality-assessment

Code Examples

1. Find ATAC-seq data for processing

encode_search_experiments(
  assay_title="ATAC-seq",
  organ="pancreas"
)

Expected output:

{
  "total": 8,
  "experiments": [
    {
      "accession": "ENCSR789PAN",
      "assay_title": "ATAC-seq",
      "biosample_summary": "pancreas tissue male adult (44 years)",
      "status": "released"
    }
  ]
}

2. Check file details before download

encode_list_files(
  accession="ENCSR789PAN",
  file_format="fastq"
)

Expected output:

{
  "total": 4,
  "files": [
    {
      "accession": "ENCFF100ATQ",
      "file_format": "fastq",
      "read_length": 50,
      "paired_end": "1",
      "file_size_mb": 3200.1
    }
  ]
}

Integration

This skill produces...Feed into...Purpose
Accessible chromatin peaksaccessibility-aggregationCross-experiment union merge
Peak regions (BED)motif-analysisTF motif enrichment in open chromatin
Signal tracks (bigWig)visualization-workflowGenome browser accessibility display
Nucleosome-free peaksregulatory-elementsClassify accessible regions as enhancers/promoters
Peak coordinatesvariant-annotationIdentify variants in accessible chromatin
TSS enrichment scoresquality-assessmentValidate against ENCODE ATAC-seq standards
Pipeline parametersdata-provenanceRecord Tn5 shift, fragment filters, tool versions
Peak filesjaspar-motifsScan accessible regions for known TF motifs

Related Skills

  • pipeline-guide (parent): General pipeline selection and resource assessment
  • accessibility-aggregation: Merge ATAC-seq peaks across samples
  • quality-assessment: Deep-dive QC analysis beyond basic metrics
  • regulatory-elements: Annotate peaks with regulatory element classifications
  • compare-biosamples: Compare accessibility profiles across cell types
  • pipeline-chipseq: Sibling pipeline for ChIP-seq data
  • publication-trust: Verify literature claims backing analytical decisions

Presenting Results

When reporting ATAC-seq pipeline results:

  • TSS enrichment score: Report the TSS enrichment score prominently -- this is the single most informative ATAC-seq QC metric. Include the quality tier (Excellent >=7, Good 5-7, Marginal 3-5, Poor <3)
  • Fragment size distribution: Report NFR fraction (% fragments <150 bp) and confirm the characteristic nucleosomal ladder pattern (NFR, mono-, di-, tri-nucleosome peaks)
  • Peak counts: Report IDR optimal peak count and total MACS2 peaks before IDR filtering
  • NFR/mono-nucleosome ratio: Present the ratio of nucleosome-free to mono-nucleosomal fragments as a library quality indicator
  • Mitochondrial fraction: Report % mitochondrial reads removed (ideal <5%, acceptable <20%)
  • Key QC metrics: Present mapping rate, FRiP (>=0.3 for ATAC-seq), NRF, and duplication rate in a summary table
  • Output paths: List key outputs (peaks/idr/, signal/, qc/tss_enrichment/, qc/fragment_size/)
  • Next steps: Suggest
    motif-analysis
    for TF footprinting and de novo motif discovery, or
    visualization-workflow
    for genome browser session generation

For the request: "$ARGUMENTS"