Encode-toolkit pipeline-dnaseseq

Execute ENCODE DNase-seq pipeline from FASTQ to hotspots and footprints. Child of pipeline-guide. Provides Nextflow execution with Docker and cloud deployment. Use when processing DNase-seq data, calling DNase hypersensitive sites, performing footprinting analysis. Trigger on: DNase-seq pipeline, DNase hypersensitive, DHS, Hotspot2, footprinting, DNase I, chromatin accessibility DNase.

install

source · Clone the upstream repo

git clone https://github.com/ammawla/encode-toolkit

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/pipeline-dnaseseq" ~/.claude/skills/ammawla-encode-toolkit-pipeline-dnaseseq-c2e00f && rm -rf "$T"

manifest: skills/pipeline-dnaseseq/SKILL.md

source content

ENCODE DNase-seq Pipeline: FASTQ to Hotspots and Footprints

When to Use

User wants to run a DNase-seq processing pipeline from FASTQ to hotspots and footprints
User asks about "DNase-seq pipeline", "DNase hypersensitive sites", "Hotspot2", "footprinting", or "DHS"
User needs to process DNase-seq data for chromatin accessibility and TF footprint analysis
Example queries: "process my DNase-seq FASTQs", "call DNase hypersensitive sites", "run footprinting analysis on DNase-seq"

Execute the ENCODE DNase-seq pipeline for chromatin accessibility profiling, producing DNase hypersensitive sites (DHSs) via Hotspot2 and transcription factor footprints.

Pipeline Overview

FASTQ -> Trim -> BWA-MEM align -> Filter/dedup -> Hotspot2 -> DHS peaks
                                       |                        |
                                    Signal track         Footprinting (HINT)

ENCODE Repository

GitHub:
```
ENCODE-DCC/dnase-seq-pipeline
```
Container:
```
encodedcc/dnase-seq-pipeline
```
WDL: Available for Cromwell execution
This skill: Nextflow DSL2 reimplementation for portability

Core Tools and Versions

Tool	Version	Purpose	Citation
BWA-MEM	0.7.17	Alignment	Li & Durbin 2009
samtools	1.19	BAM operations	Li et al. 2009
Picard	3.1.1	Duplicate marking	Broad Institute
Hotspot2	2.3.1	DHS calling (ENCODE standard)	John et al. 2011
bedtools	2.31.0	Genomic arithmetic	Quinlan & Hall 2010
HINT-ATAC	0.13.2	TF footprinting	Li et al. 2019
FastQC	0.12.1	Read quality	Andrews (Babraham)
MultiQC	1.21	Aggregated QC	Ewels et al. 2016

Key Literature

John et al. 2011 - "Chromatin accessibility pre-determines glucocorticoid receptor binding patterns" (Nature Genetics, ~600 citations) DOI: 10.1038/ng.759
Thurman et al. 2012 - "The accessible chromatin landscape of the human genome" (Nature, ~3,000 citations) DOI: 10.1038/nature11232
Vierstra et al. 2020 - "Global reference mapping of human transcription factor footprints" (Nature, ~600 citations) DOI: 10.1038/s41586-020-2528-x
Amemiya et al. 2019 - "The ENCODE Blacklist" (Scientific Reports, ~1,372 citations) DOI: 10.1038/s41598-019-45839-z
Li et al. 2019 - "Identification of transcription factor binding sites using ATAC-seq" (Genome Biology) -- HINT-ATAC footprinting DOI: 10.1186/s13059-019-1642-2

Execution

Quick Start (Local)

nextflow run main.nf \
    -profile local \
    --reads '/data/fastq/*_R{1,2}.fastq.gz' \
    --bwa_index '/ref/bwa_index/genome.fa' \
    --chrom_sizes '/ref/hg38.chrom.sizes' \
    --hotspot_index '/ref/hotspot2_index/' \
    --blacklist '/ref/hg38-blacklist.v2.bed' \
    --outdir results/ \
    -resume

SLURM HPC

nextflow run main.nf \
    -profile slurm \
    --reads '/data/fastq/*_R{1,2}.fastq.gz' \
    --bwa_index '/ref/bwa_index/genome.fa' \
    --chrom_sizes '/ref/hg38.chrom.sizes' \
    --hotspot_index '/ref/hotspot2_index/' \
    --blacklist '/ref/hg38-blacklist.v2.bed' \
    --outdir results/ \
    -resume

Cloud (GCP / AWS)

nextflow run main.nf \
    -profile gcp \
    --reads 'gs://bucket/fastq/*_R{1,2}.fastq.gz' \
    --bwa_index 'gs://bucket/ref/genome.fa' \
    --chrom_sizes 'gs://bucket/ref/hg38.chrom.sizes' \
    --hotspot_index 'gs://bucket/ref/hotspot2_index/' \
    --blacklist 'gs://bucket/ref/hg38-blacklist.v2.bed' \
    --outdir 'gs://bucket/results/' \
    -resume

Resource Requirements

Step	CPUs	RAM	Time (per sample)
BWA-MEM align	8	16 GB	1-2 hours
Filter/dedup	4	8 GB	30-60 min
Hotspot2	4	8 GB	30-60 min
Signal generation	2	4 GB	15-30 min
Footprinting	4	8 GB	1-2 hours
Total	8	16 GB	3-6 hours

Pipeline Parameters

Parameter	Default	Description
`--reads`	required	Glob pattern to paired FASTQ files
`--bwa_index`	required	Path to BWA genome index (.fa)
`--chrom_sizes`	required	Chromosome sizes file
`--hotspot_index`	required	Hotspot2 mappability index directory
`--blacklist`	required	ENCODE blacklist BED file
`--outdir`	`./results`	Output directory
`--fdr`	`0.05`	Hotspot2 FDR threshold
`--skip_footprint`	`false`	Skip footprinting analysis
`--motif_db`	`null`	JASPAR motif database for footprinting

Output Files

results/
  fastqc/                          # Raw read quality
  alignment/
    {sample}.filtered.bam          # Filtered, deduplicated BAM
    {sample}.filtered.bam.bai
  hotspots/
    {sample}.hotspots.fdr0.05.bed  # DHS peaks (primary output)
    {sample}.peaks.narrowPeak      # narrowPeak format
    {sample}.density.bw            # Signal track (bigWig)
    {sample}.allcalls.bed          # All hotspot calls (unfiltered)
    {sample}.SPOT.txt              # SPOT score
  footprints/
    {sample}.footprints.bed        # TF footprints
    {sample}.footprint_scores.txt  # Per-motif footprint scores
  qc/
    {sample}.flagstat.txt
    {sample}.insert_sizes.txt
  multiqc/
    multiqc_report.html

QC Thresholds (ENCODE Standards)

Metric	Pass	Warning	Fail
SPOT score (Signal Portion of Tags)	>0.4	0.2-0.4	<0.2
Hotspot count	>50,000	20,000-50,000	<20,000
Mapping rate	>80%	60-80%	<60%
Duplication rate	<30%	30-50%	>50%
NRF (Non-Redundant Fraction)	>0.8	0.7-0.8	<0.7
PBC1 (PCR Bottleneck Coefficient 1)	>0.9	0.7-0.9	<0.7
Insert size peak	50-150 bp	Variable	Abnormal

SPOT Score

The SPOT score (Signal Portion of Tags) is the fraction of reads falling within hotspots. It is the DNase-seq equivalent of FRiP for ChIP-seq.

Higher SPOT = more enrichment in accessible regions = better library quality.

Hotspot2 vs MACS2

IMPORTANT: ENCODE uses Hotspot2 for DNase-seq, NOT MACS2.

Feature	Hotspot2	MACS2
Designed for	DNase-seq	ChIP-seq
Background model	Local tag density + mappability	Dynamic Poisson
ENCODE standard	Yes (DNase-seq)	Yes (ChIP-seq/ATAC-seq)
Mappability correction	Built-in	Not available
Output	Hotspots + peaks	Peaks only

Hotspot2 accounts for mappability variation across the genome, which is critical for DNase-seq because DNase I cuts accessible chromatin regardless of whether it is uniquely mappable.

Critical Pitfalls

DNase-seq vs ATAC-seq

These are different assays measuring the same biology (chromatin accessibility):

DNase-seq: Uses DNase I enzyme, requires more input material
ATAC-seq: Uses Tn5 transposase, works on fewer cells
Analysis pipelines differ: Hotspot2 for DNase-seq, MACS2 for ATAC-seq
Data are largely concordant but not identical

Fragment Size Distribution

DNase-seq produces a characteristic fragment size distribution:

Peak at ~50-100 bp (sub-nucleosomal fragments at DHS)
Secondary peak at ~150-200 bp (mononucleosomal fragments)
Long tail of larger fragments
If distribution is abnormal, check library preparation protocol

Mappability Index

Hotspot2 requires a pre-computed mappability index. These are read-length and genome-build specific:

hg38 / 36 bp: Use ENCODE-provided index
hg38 / 76 bp: Use ENCODE-provided index
hg38 / 150 bp: May need to generate custom index
Wrong mappability index = incorrect peak calls

Blacklist Filtering

Always filter peaks against the ENCODE blacklist (Amemiya et al. 2019):

bedtools intersect -a hotspots.bed -b hg38-blacklist.v2.bed -v > hotspots_filtered.bed

Blacklist regions produce artifactual signal in accessibility assays.

Footprinting Analysis

Transcription factor footprinting detects bound TFs from DNase-seq signal:

HINT-ATAC Footprinting

rgt-hint footprinting \
    --atac-seq \
    --paired-end \
    --organism hg38 \
    --output-location footprints/ \
    sample.filtered.bam \
    hotspots.narrowPeak

Interpretation

Footprints are depressions in the DNase signal where a bound TF protects DNA
Requires deep sequencing (>100M reads) for reliable footprints
Sensitivity varies by TF: pioneer factors have shallow footprints
Vierstra et al. 2020 provides a global reference map for comparison

Provenance Integration

After pipeline completion, log all outputs:

encode_log_derived_file(
    file_path="/results/hotspots/sample1.hotspots.fdr0.05.bed",
    source_accessions=["ENCSR...", "ENCFF..."],
    description="DNase hypersensitive sites from ENCODE DNase-seq pipeline",
    file_type="DHS_peaks",
    tool_used="BWA 0.7.17 + Hotspot2 2.3.1",
    parameters="FDR 0.05, blacklist filtered, ENCODE hg38 mappability index"
)

Reference Files

Detailed step-by-step documentation is provided in the

references/

directory:

```
01-qc-trimming.md
```
-- Read QC and adapter trimming
```
02-alignment.md
```
-- BWA-MEM alignment for DNase-seq
```
03-filtering.md
```
-- BAM filtering, deduplication, blacklist removal
```
04-hotspot-calling.md
```
-- Hotspot2 DHS detection and signal generation
```
05-footprinting.md
```
-- TF footprint detection with HINT-ATAC

Walkthrough: Processing ENCODE DNase-seq from FASTQ to Hypersensitive Sites

Goal: Process raw DNase-seq FASTQ files through the ENCODE pipeline to generate DNase I hypersensitive site (DHS) peak calls. Context: DNase-seq identifies open chromatin via DNase I enzyme digestion. The pipeline uses BWA alignment and Hotspot2 for DHS identification.

Step 1: Find DNase-seq experiment

encode_search_experiments(assay_title="DNase-seq", biosample_term_name="K562", organism="Homo sapiens")

Expected output:

{
  "total": 8,
  "results": [
    {"accession": "ENCSR000DNS", "assay_title": "DNase-seq", "biosample_summary": "K562", "status": "released"}
  ]
}

Step 2: List and download FASTQ files

encode_list_files(accession="ENCSR000DNS", file_format="fastq")

Step 3: Run the DNase-seq pipeline

nextflow run pipeline-dnaseseq/main.nf \
  --fastq_r1 ENCFF700DN1.fastq.gz \
  --genome GRCh38 \
  --blacklist encode_blacklist_v2.bed \
  -profile docker

Key pipeline steps:

Quality trimming
BWA-MEM alignment
Duplicate removal
Hotspot2 DHS calling
Signal track generation
Footprint analysis (HINT-ATAC)

Step 4: Validate output quality

Metric	Threshold	Purpose
SPOT score	> 0.4	Signal portion of tags
Hotspot count	> 100,000	Sensitivity
Duplicate rate	< 30%	Library complexity

Step 5: Compare with ATAC-seq

encode_search_experiments(assay_title="ATAC-seq", biosample_term_name="K562", organism="Homo sapiens")

Interpretation: DNase-seq and ATAC-seq both measure accessibility but with different biases. Compare peaks from both assays -- concordant peaks are high confidence.

Integration with downstream skills

DHS peaks feed into -> accessibility-aggregation alongside ATAC-seq peaks
Footprint data feeds into -> motif-analysis for TF binding prediction
Signal tracks feed into -> visualization-workflow
Peaks integrate with -> regulatory-elements for cCRE classification

Code Examples

1. Survey DNase-seq availability

encode_get_facets(assay_title="DNase-seq", facet_field="organ", organism="Homo sapiens")

Expected output:

{
  "facets": {
    "organ": {"blood": 45, "brain": 30, "liver": 20, "heart": 15, "lung": 12}
  }
}

2. Check for existing DHS peaks

encode_list_files(accession="ENCSR000DNS", file_format="bed", output_type="peaks", assembly="GRCh38")

Expected output:

{
  "files": [
    {"accession": "ENCFF800DHS", "output_type": "peaks", "file_format": "bed narrowPeak", "file_size_mb": 1.5}
  ]
}

3. Track DNase-seq experiments

encode_track_experiment(accession="ENCSR000DNS", notes="K562 DNase-seq for accessibility comparison with ATAC-seq")

Expected output:

{
  "status": "tracked",
  "accession": "ENCSR000DNS",
  "notes": "K562 DNase-seq for accessibility comparison with ATAC-seq"
}

Integration

This skill produces...	Feed into...	Purpose
DHS peaks (narrowPeak)	accessibility-aggregation	Union merge with ATAC-seq peaks
TF footprints	motif-analysis	Validate motif predictions with footprint evidence
Signal tracks (bigWig)	visualization-workflow	Genome browser display
Accessible regions	regulatory-elements	cCRE classification
DHS coordinates	variant-annotation	Annotate variants in hypersensitive sites
QC metrics	quality-assessment	Validate SPOT score and sensitivity
Pipeline parameters	data-provenance	Record BWA/Hotspot2 versions
DHS peak regions	jaspar-motifs	Scan accessible sites for known TF motifs

Related Skills

```
pipeline-guide
```
-- Parent skill with compute resource assessment and cloud setup
```
accessibility-aggregation
```
-- Aggregate DHS data across samples/tissues
```
quality-assessment
```
-- Evaluate pipeline output quality metrics
```
data-provenance
```
-- Track all pipeline inputs, outputs, and parameters
```
download-encode
```
-- Download ENCODE DNase-seq FASTQ files for pipeline input
```
publication-trust
```
-- Verify literature claims backing analytical decisions

Presenting Results

When reporting DNase-seq pipeline results:

Hotspot counts: Report total Hotspot2 DHS calls at the specified FDR threshold and the number remaining after blacklist filtering
Signal-to-noise (SPOT score): Report the SPOT score prominently (>0.4 pass, 0.2-0.4 warning, <0.2 fail). This is the DNase-seq equivalent of FRiP
Footprint depth: If footprinting was performed, report the number of TF footprints detected and note the sequencing depth (>100M reads recommended for reliable footprints)
FRiP equivalent: Report the fraction of reads in hotspots as a complementary enrichment metric
Key QC metrics: Present mapping rate (>80%), duplication rate (<30%), NRF (>0.8), PBC1 (>0.9), and insert size peak in a summary table
Output paths: Provide paths to hotspot BED files, narrowPeak files, signal bigWig tracks, and footprint results
Mappability note: Confirm which Hotspot2 mappability index was used and that it matches the read length
Next steps: Suggest
```
motif-analysis
```
for TF motif enrichment in DHS peaks, or
```
accessibility-aggregation
```
for merging DHS data across samples

Encode-toolkit pipeline-dnaseseq

ENCODE DNase-seq Pipeline: FASTQ to Hotspots and Footprints

When to Use

Pipeline Overview

ENCODE Repository

Core Tools and Versions

Key Literature

Execution

Quick Start (Local)

SLURM HPC

Cloud (GCP / AWS)

Resource Requirements

Pipeline Parameters

Output Files

QC Thresholds (ENCODE Standards)

SPOT Score

Hotspot2 vs MACS2

Critical Pitfalls

DNase-seq vs ATAC-seq

Fragment Size Distribution

Mappability Index

Blacklist Filtering

Footprinting Analysis

HINT-ATAC Footprinting

Interpretation

Provenance Integration

Reference Files

Walkthrough: Processing ENCODE DNase-seq from FASTQ to Hypersensitive Sites

Step 1: Find DNase-seq experiment

Step 2: List and download FASTQ files

Step 3: Run the DNase-seq pipeline

Step 4: Validate output quality

Step 5: Compare with ATAC-seq

Integration with downstream skills

Code Examples

1. Survey DNase-seq availability

2. Check for existing DHS peaks

3. Track DNase-seq experiments

Integration

Related Skills

Presenting Results

For the request: "$ARGUMENTS"