Encode-toolkit pipeline-atacseq
Execute ENCODE ATAC-seq processing pipeline from FASTQ to peaks and signal tracks. Child of pipeline-guide. Provides stage-by-stage Nextflow execution with Docker containers and cloud deployment. Handles Tn5 transposase offset correction, mitochondrial read removal, nucleosome-free fragment selection, and TSS enrichment scoring. Use when users need to process ATAC-seq data following ENCODE standards. Trigger on: ATAC-seq pipeline, run ATAC-seq, process ATAC-seq, chromatin accessibility, open chromatin, Tn5 shift, TSS enrichment.
git clone https://github.com/ammawla/encode-toolkit
T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/pipeline-atacseq" ~/.claude/skills/ammawla-encode-toolkit-pipeline-atacseq-a50740 && rm -rf "$T"
skills/pipeline-atacseq/SKILL.mdENCODE ATAC-seq Pipeline
When to Use
- User wants to run an ATAC-seq processing pipeline from FASTQ to peaks and signal tracks
- User asks about "ATAC-seq pipeline", "Tn5 shift", "chromatin accessibility pipeline", or "Bowtie2 for ATAC"
- User needs to process ATAC-seq data with proper Tn5 insertion site correction
- Example queries: "process my ATAC-seq FASTQs", "run ENCODE ATAC-seq pipeline", "call accessibility peaks from ATAC-seq"
Execute the ENCODE ATAC-seq processing pipeline from raw FASTQ files through Tn5 offset correction, peak calling, IDR analysis, and signal track generation. This skill provides a complete Nextflow DSL2 implementation following ENCODE uniform analysis standards.
Overview
ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) uses the Tn5 transposase to probe open chromatin regions. The ENCODE pipeline processes ATAC-seq data through quality control, alignment with Bowtie2, Tn5 insertion site correction (+4/-5 bp offset), mitochondrial read removal, nucleosome-free fragment selection, peak calling with MACS2, and IDR-based replicate consistency analysis.
Key differences from ChIP-seq: Bowtie2 aligner (optimized for short fragments), Tn5 transposase shift correction, aggressive mitochondrial read filtering (can be 30-80% of reads), nucleosomal fragment size distribution as a QC metric, and TSS enrichment score as the primary quality indicator.
Key Literature
| Reference | Journal | Year | DOI | Relevance |
|---|---|---|---|---|
| Buenrostro et al. "Transposition of native chromatin (ATAC-seq)" | Nature Methods | 2013 | 10.1038/nmeth.2688 | Original ATAC-seq method (~5,000 citations) |
| Corces et al. "An improved ATAC-seq protocol" | Nature Methods | 2017 | 10.1038/nmeth.4396 | Omni-ATAC improvements (~2,500 citations) |
| ENCODE Project Consortium "Expanded encyclopaedias" | Nature | 2020 | 10.1038/s41586-020-2493-4 | ENCODE Phase 3 standards |
| Amemiya et al. "ENCODE Blacklist" | Scientific Reports | 2019 | 10.1038/s41598-019-45839-z | Artifact regions (~1,372 citations) |
| Langmead & Salzberg "Fast gapped-read alignment with Bowtie 2" | Nature Methods | 2012 | 10.1038/nmeth.1923 | Aligner (~30,000 citations) |
| Yan et al. "From reads to insight: ATAC-seq analysis" | Genome Biology | 2020 | 10.1186/s13059-020-1929-3 | Analysis best practices |
Pipeline Stages
FASTQ ──> FastQC / Trim Galore ──> Bowtie2 ──> Mito Removal + Tn5 Shift │ │ │ ┌──────────────────────────────────────────┘ │ v │ Picard MarkDup ──> Blacklist Filter ──> Size Selection │ │ │ ┌─────────────────┬────────────┘ │ v v │ NFR Fragments Mono-Nucleosome │ │ │ v │ MACS2 Peak Calling ──> IDR Analysis │ │ │ │ v v │ Signal Tracks QC Report (MultiQC + ataqv) v Raw QC Report
Stage Summary
| Stage | Tool | Input | Output | Reference |
|---|---|---|---|---|
| 1. QC & Trimming | FastQC, Trim Galore | Raw FASTQ | Trimmed FASTQ | references/01-qc-trimming.md |
| 2. Alignment | Bowtie2 | Trimmed FASTQ | Sorted BAM | references/02-alignment.md |
| 3. Tn5 Shift & Filtering | Samtools, bedtools, Picard | Sorted BAM | Shifted, filtered BAM | references/03-tn5-filtering.md |
| 4. Peak Calling & IDR | MACS2, IDR | Filtered BAM | Peaks (narrowPeak) | references/04-peak-calling.md |
| 5. QC & Signal | deeptools, ataqv, MultiQC | Filtered BAM, Peaks | bigWig, QC report | references/05-qc-metrics.md |
Input Requirements
Required Files
- ATAC-seq FASTQ: Paired-end reads (strongly recommended; single-end supported)
- Reference genome: Bowtie2-indexed genome (GRCh38 for human, mm10 for mouse)
Sample Sheet Format
sample_id,read1,read2,replicate SAMPLE1_rep1,atac_R1.fq.gz,atac_R2.fq.gz,1 SAMPLE1_rep2,atac_R1.fq.gz,atac_R2.fq.gz,2
No input control needed: Unlike ChIP-seq, ATAC-seq does not require a separate input or IgG control. MACS2 calls peaks against a local background model.
Tn5 Transposase Offset Correction
The Tn5 transposase inserts sequencing adapters with a 9-bp duplication. To center reads on the actual cut site:
- Forward strand (+): shift +4 bp
- Reverse strand (-): shift -5 bp
This correction is essential for accurate footprinting and motif analysis.
Fragment Size Distribution
ATAC-seq produces a characteristic nucleosomal ladder pattern:
| Fragment Class | Size Range | Biological Meaning |
|---|---|---|
| Nucleosome-free (NFR) | <150 bp | Open chromatin / TF binding |
| Mono-nucleosome | 150-300 bp | Single nucleosome wrapping |
| Di-nucleosome | 300-500 bp | Two nucleosomes |
| Tri-nucleosome | 500-700 bp | Three nucleosomes |
For peak calling, use nucleosome-free reads (<150 bp) only.
QC Thresholds
| Metric | Threshold | Category | Source |
|---|---|---|---|
| Total sequenced reads | >=50M (recommended) | Read depth | ENCODE |
| Mapping rate | >80% | Alignment | ENCODE |
| Mitochondrial fraction | <20% (ideal <5%) | Sample quality | ENCODE |
| NRF (non-redundant fraction) | >=0.8 | Library complexity | ENCODE |
| PBC1 | >=0.8 | Library complexity | ENCODE |
| TSS enrichment score | >=5 | Signal quality | ENCODE standard |
| FRiP | >=0.3 | Peak quality | ENCODE |
| NFR fraction | >0.4 of fragments <150bp | Fragment distribution | Buenrostro 2013 |
| IDR optimal peaks | >50,000 | Reproducibility | ENCODE |
TSS Enrichment Score
The TSS enrichment score measures the fold enrichment of ATAC-seq signal at transcription start sites compared to flanking regions. It is the single most informative QC metric for ATAC-seq:
| Score | Quality | Interpretation |
|---|---|---|
| >=7 | Excellent | High signal-to-noise |
| 5-7 | Good | Acceptable for most analyses |
| 3-5 | Marginal | Review other metrics carefully |
| <3 | Poor | Likely failed; consider re-doing |
Execution
Quick Start (Local Docker)
nextflow run scripts/main.nf \ -profile local \ --reads 'fastq/*_R{1,2}.fq.gz' \ --genome GRCh38 \ --outdir results/
SLURM HPC
nextflow run scripts/main.nf \ -profile slurm \ --reads 'fastq/*_R{1,2}.fq.gz' \ --genome GRCh38 \ --outdir results/
Google Cloud
nextflow run scripts/main.nf \ -profile gcp \ --reads 'gs://bucket/fastq/*_R{1,2}.fq.gz' \ --genome GRCh38 \ --outdir 'gs://bucket/results/'
AWS Batch
nextflow run scripts/main.nf \ -profile aws \ --reads 's3://bucket/fastq/*_R{1,2}.fq.gz' \ --genome GRCh38 \ --outdir 's3://bucket/results/'
Cloud Cost Estimates
| Platform | Instance | Cost/Sample | Time/Sample | Notes |
|---|---|---|---|---|
| GCP | n1-standard-8 | ~$2-4 | 2-3 hours | Preemptible recommended |
| AWS | m5.2xlarge | ~$2-4 | 2-3 hours | Spot instances recommended |
| Local | 8 cores, 32GB | $0 | 3-5 hours | Docker required |
| SLURM | 8 cores, 32GB | Varies | 2-3 hours | Singularity recommended |
Output Directory Structure
results/ fastqc/ # Raw and trimmed QC reports trimmed/ # Trimmed FASTQ files aligned/ # Sorted BAM files (pre-filtering) filtered/ shifted/ # Tn5-corrected BAM files nfr/ # Nucleosome-free fragments (<150 bp) mononuc/ # Mono-nucleosome fragments (150-300 bp) peaks/ narrow/ # MACS2 narrowPeak files idr/ # IDR-filtered reproducible peaks signal/ # bigWig signal tracks qc/ tss_enrichment/ # TSS enrichment scores and plots fragment_size/ # Fragment size distribution plots ataqv/ # Comprehensive ATAC-seq QC (ataqv) multiqc/ # Aggregated QC report logs/ # Nextflow execution logs
Common Pitfalls
1. High Mitochondrial Read Fraction
Mitochondrial DNA lacks chromatin and is highly accessible, often capturing 30-80% of reads. This is the most common ATAC-seq quality issue. Filter chrM reads before analysis. If >50% mito, consider optimizing the cell lysis step.
2. Missing Tn5 Shift Correction
Without the +4/-5 bp offset correction, cut-site positions are shifted by ~4.5 bp. This matters for footprinting and motif analysis but has minimal effect on peak calling. Always apply the shift for publication-quality results.
3. Using BWA Instead of Bowtie2
Bowtie2 handles the short fragments from ATAC-seq (especially NFR <150bp) better than BWA-MEM. Use Bowtie2 with
--very-sensitive for optimal ATAC-seq alignment.
4. Not Separating Nucleosomal Fractions
Peak calling on all fragments mixes nucleosome-free signal (TF binding) with nucleosomal signal. Always size-select NFR (<150 bp) for peak calling.
5. Ignoring TSS Enrichment
TSS enrichment is the most informative single metric for ATAC-seq quality. A score <5 indicates a failed experiment regardless of other metrics.
Pipeline Scripts
| File | Description | Lines |
|---|---|---|
| Nextflow DSL2 pipeline | ~120 |
| Execution profiles (local/slurm/gcp/aws) | ~60 |
| Multi-stage Docker build with all tools | ~30 |
ENCODE Data Integration
After running on your own data, compare with ENCODE reference:
# Find matching ENCODE ATAC-seq experiments encode_search_experiments( assay_title="ATAC-seq", organ="pancreas", biosample_type="tissue" ) # Download ENCODE peaks for comparison encode_batch_download( download_dir="/data/encode_reference/", output_type="IDR thresholded peaks", assay_title="ATAC-seq", organ="pancreas", assembly="GRCh38" )
Pitfalls & Edge Cases
- Tn5 shift is critical: ATAC-seq reads must be shifted +4/-5 bp to center on the Tn5 insertion site. Without this correction, footprinting analysis will be offset by ~5 bp and motif enrichment will be degraded.
- Mitochondrial reads dominate: Expect 30-80% mitochondrial reads in ATAC-seq. Filter chrM reads AFTER alignment, BEFORE peak calling. High mitoChRM (>80%) indicates dead/dying cells or poor nuclei isolation.
- Fragment size distribution is diagnostic: A nucleosomal ladder (sub-nucleosomal <150bp, mono-nucleosomal ~200bp, di-nucleosomal ~400bp) confirms successful transposition. Absence of the ladder suggests incomplete or failed transposition.
- TSS enrichment threshold: ENCODE requires TSS enrichment ≥5 (GRCh38), ≥6 (hg19), or ≥10 (mm10) for ATAC-seq (ENCODE data standards). Values below 4 indicate poor signal-to-noise. This is the single most informative QC metric for ATAC-seq.
- Peak caller choice matters: MACS2 with
is standard for ATAC-seq. Do NOT use the ChIP-seq default MACS2 settings — they assume sonicated fragment distributions.--nomodel --shift -100 --extsize 200 - Paired-end vs single-end: ATAC-seq should always be paired-end to capture fragment sizes. Single-end ATAC-seq cannot distinguish nucleosome-free from nucleosomal fragments.
Walkthrough: Processing ENCODE ATAC-seq from FASTQ to Accessible Chromatin Peaks
Goal: Process raw ATAC-seq FASTQ files through the ENCODE pipeline to generate nucleosome-free region peaks and signal tracks for chromatin accessibility analysis. Context: ATAC-seq requires Tn5 transposase insertion site correction (+4/-5 bp shift) and nucleosomal fragment size filtering, handled by the ENCODE ATAC-seq pipeline.
Step 1: Find ATAC-seq experiment
encode_get_experiment(accession="ENCSR637ENO")
Expected output:
{ "accession": "ENCSR637ENO", "assay_title": "ATAC-seq", "biosample_summary": "GM12878", "replicates": 2, "status": "released" }
Step 2: List FASTQ files
encode_list_files(accession="ENCSR637ENO", file_format="fastq")
Expected output:
{ "files": [ {"accession": "ENCFF100ATQ", "output_type": "reads", "paired_end": "1", "biological_replicates": [1], "file_size_mb": 1800}, {"accession": "ENCFF101ATQ", "output_type": "reads", "paired_end": "2", "biological_replicates": [1], "file_size_mb": 1900} ] }
Step 3: Run the ATAC-seq pipeline
nextflow run pipeline-atacseq/main.nf \ --fastq_r1 ENCFF100ATQ.fastq.gz \ --fastq_r2 ENCFF101ATQ.fastq.gz \ --genome GRCh38 \ --blacklist encode_blacklist_v2.bed \ --mitochondrial_chr chrM \ -profile docker
Key pipeline steps:
- Adapter trimming (Trimmomatic/cutadapt)
- Alignment (Bowtie2, very-sensitive mode)
- Tn5 shift correction (+4/-5 bp)
- Mitochondrial read removal
- Nucleosome-free fragment selection (<150 bp)
- Peak calling (MACS2, --nomodel --shift -75 --extsize 150)
Step 4: Validate output quality
| Metric | Threshold | Purpose |
|---|---|---|
| TSS enrichment | >= 5 (GRCh38), >= 6 (hg19), >= 10 (mm10) | Signal enrichment at transcription start sites |
| Fragment size distribution | Nucleosomal ladder | ~200bp, ~400bp, ~600bp periodicity |
| Mitochondrial reads | < 20% | Excessive = failed library |
| FRiP | >= 0.2 | Fraction of reads in peaks |
Step 5: Track and log provenance
encode_track_experiment(accession="ENCSR637ENO", notes="GM12878 ATAC-seq processed through ENCODE pipeline")
Integration with downstream skills
- Accessible chromatin peaks feed into -> accessibility-aggregation for cross-experiment union merge
- Peak regions feed into -> motif-analysis for TF motif enrichment
- Signal tracks feed into -> visualization-workflow for browser display
- Peaks feed into -> regulatory-elements for cCRE classification
- QC metrics validated by -> quality-assessment
Code Examples
1. Find ATAC-seq data for processing
encode_search_experiments( assay_title="ATAC-seq", organ="pancreas" )
Expected output:
{ "total": 8, "experiments": [ { "accession": "ENCSR789PAN", "assay_title": "ATAC-seq", "biosample_summary": "pancreas tissue male adult (44 years)", "status": "released" } ] }
2. Check file details before download
encode_list_files( accession="ENCSR789PAN", file_format="fastq" )
Expected output:
{ "total": 4, "files": [ { "accession": "ENCFF100ATQ", "file_format": "fastq", "read_length": 50, "paired_end": "1", "file_size_mb": 3200.1 } ] }
Integration
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Accessible chromatin peaks | accessibility-aggregation | Cross-experiment union merge |
| Peak regions (BED) | motif-analysis | TF motif enrichment in open chromatin |
| Signal tracks (bigWig) | visualization-workflow | Genome browser accessibility display |
| Nucleosome-free peaks | regulatory-elements | Classify accessible regions as enhancers/promoters |
| Peak coordinates | variant-annotation | Identify variants in accessible chromatin |
| TSS enrichment scores | quality-assessment | Validate against ENCODE ATAC-seq standards |
| Pipeline parameters | data-provenance | Record Tn5 shift, fragment filters, tool versions |
| Peak files | jaspar-motifs | Scan accessible regions for known TF motifs |
Related Skills
- pipeline-guide (parent): General pipeline selection and resource assessment
- accessibility-aggregation: Merge ATAC-seq peaks across samples
- quality-assessment: Deep-dive QC analysis beyond basic metrics
- regulatory-elements: Annotate peaks with regulatory element classifications
- compare-biosamples: Compare accessibility profiles across cell types
- pipeline-chipseq: Sibling pipeline for ChIP-seq data
- publication-trust: Verify literature claims backing analytical decisions
Presenting Results
When reporting ATAC-seq pipeline results:
- TSS enrichment score: Report the TSS enrichment score prominently -- this is the single most informative ATAC-seq QC metric. Include the quality tier (Excellent >=7, Good 5-7, Marginal 3-5, Poor <3)
- Fragment size distribution: Report NFR fraction (% fragments <150 bp) and confirm the characteristic nucleosomal ladder pattern (NFR, mono-, di-, tri-nucleosome peaks)
- Peak counts: Report IDR optimal peak count and total MACS2 peaks before IDR filtering
- NFR/mono-nucleosome ratio: Present the ratio of nucleosome-free to mono-nucleosomal fragments as a library quality indicator
- Mitochondrial fraction: Report % mitochondrial reads removed (ideal <5%, acceptable <20%)
- Key QC metrics: Present mapping rate, FRiP (>=0.3 for ATAC-seq), NRF, and duplication rate in a summary table
- Output paths: List key outputs (peaks/idr/, signal/, qc/tss_enrichment/, qc/fragment_size/)
- Next steps: Suggest
for TF footprinting and de novo motif discovery, ormotif-analysis
for genome browser session generationvisualization-workflow