Encode-toolkit accessibility-aggregation

Build comprehensive chromatin accessibility maps by aggregating ATAC-seq and DNase-seq narrowPeak data across multiple ENCODE experiments, donors, and labs. Use when the user wants to answer "where is chromatin accessible in my tissue?" by combining peak calls into a union peak set. Handles cross-lab variation, ATAC vs DNase platform differences, and ENCODE blocklist filtering.

install

source · Clone the upstream repo

git clone https://github.com/ammawla/encode-toolkit

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/accessibility-aggregation" ~/.claude/skills/ammawla-encode-toolkit-accessibility-aggregation-ce003b && rm -rf "$T"

manifest: skills/accessibility-aggregation/SKILL.md

source content

When to Use

User wants to combine ATAC-seq or DNase-seq peaks across multiple experiments for a tissue
User asks "where is chromatin accessible in my tissue?" or "build an open chromatin map"
User needs to merge accessibility data from different labs, donors, or platforms (ATAC vs DNase)
User wants a comprehensive set of open chromatin regions for regulatory element discovery
Example queries: "aggregate ATAC-seq peaks for pancreas", "combine DNase-seq across donors", "find all accessible regions in liver"

Aggregate Chromatin Accessibility Peaks Across Studies

Build a comprehensive map of open chromatin for a tissue/cell type by merging ATAC-seq and/or DNase-seq narrowPeak files from multiple ENCODE experiments.

Scientific Rationale

The question: "Where is chromatin accessible in my tissue?"

Like histone marks, chromatin accessibility is a detection question. An open chromatin region detected in one donor but not another is still a real accessible site — individual variation, sequencing depth, and technical factors explain absence. We want the union of all detections.

ATAC-seq vs DNase-seq

Both measure open chromatin but with different biases:

Property	ATAC-seq	DNase-seq
Method	Tn5 transposase insertion	DNase I hypersensitivity
Input required	~50K cells	~1M cells
Resolution	High	High
GC bias	Moderate (Tn5 preference)	Low
Mitochondrial reads	High (filter needed)	None
ENCODE availability	Newer experiments	Extensive historical catalog
Comparability	Generally comparable at open regions

Literature Support

Corces et al. 2017 (Nature Methods, 733 citations): Established that ATAC-seq and DNase-seq identify largely overlapping accessible regions, with ATAC capturing ~75% of DNase sites. Both are valid for union maps.
ENCODE Blacklist (Amemiya et al. 2019, Scientific Reports, 1,372 citations): Comprehensive set of problematic genomic regions to filter. Essential for all functional genomics analyses. DOI
F-Seq2 (Zhao & Boyle 2020, NAR Genomics): Improved peak caller for DNase-seq and ATAC-seq with proper test statistics for IDR compatibility.
ENCODE Phase 3 (Gorkin et al. 2020, Nature, 301 citations): Integrated accessibility data with histone marks across tissues for chromatin state annotation.

Recommendation: If combining ATAC-seq and DNase-seq peaks, treat them as equivalent signal sources for accessibility. The union is appropriate because both detect the same biological signal (open chromatin) through different enzymatic mechanisms.

Step 1: Find All Available Accessibility Data

# ATAC-seq
encode_search_experiments(
    assay_title="ATAC-seq",
    organ="pancreas",
    biosample_type="tissue",
    limit=100
)

# DNase-seq
encode_search_experiments(
    assay_title="DNase-seq",
    organ="pancreas",
    biosample_type="tissue",
    limit=100
)

Present a summary to the user:

Total ATAC-seq experiments
Total DNase-seq experiments
Labs represented
Whether to use one or both assay types

Combining ATAC + DNase?

Ask the user:

Same assay only (purest comparison, no cross-platform effects)
Both assays combined (maximum coverage, slight platform variation)

For a comprehensive accessibility catalog, combining both is scientifically justified.

Step 2: Quality-Gate Each Experiment

encode_get_experiment(accession="ENCSR...")

ATAC-seq Quality Checks

Audit status: no ERROR flags
Has IDR thresholded peaks
Low mitochondrial read fraction (ENCODE pipeline removes these)
Good TSS enrichment score
Nucleosome-free fragment enrichment visible

DNase-seq Quality Checks

Audit status: no ERROR flags
Has Hotspot2 peaks or IDR thresholded peaks
Adequate sequencing depth (20M+ mapped reads)
Signal-to-noise ratio

Track all included experiments:

encode_track_experiment(accession="ENCSR...")

Step 3: Download Peak Files

For each experiment:

# ATAC-seq — IDR thresholded peaks
encode_list_files(
    experiment_accession="ENCSR...",
    file_format="bed",
    output_type="IDR thresholded peaks",
    assembly="GRCh38"
)

# DNase-seq — may use different output types
encode_list_files(
    experiment_accession="ENCSR...",
    file_format="bed",
    output_type="peaks",
    assembly="GRCh38"
)

Prefer

preferred_default=True

files.

encode_download_files(
    file_accessions=["ENCFF...", ...],
    download_dir="/path/to/data/accessibility",
    organize_by="flat"
)

Step 4: Per-Sample Noise Filtering

4a. ENCODE Blocklist Filtering (Amemiya et al. 2019)

# Download from: https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz
bedtools intersect -a sample.narrowPeak -b hg38-blacklist.v2.bed -v > sample.filtered.narrowPeak

4b. SignalValue Filtering (Perna et al. 2024)

Same logic as histone aggregation — filter per-sample to top 75% by signalValue (column 7):

# Per-sample: remove bottom 25% by signalValue (true distribution quantile)
TOTAL=$(wc -l < sample.filtered.narrowPeak)
LINE_25=$(echo "$TOTAL" | awk '{printf "%d", $1 * 0.25}')
THRESHOLD=$(sort -k7,7n sample.filtered.narrowPeak | awk -v line="$LINE_25" 'NR==line{print $7}')
awk -v t="$THRESHOLD" '$7 >= t' sample.filtered.narrowPeak > sample.qfiltered.narrowPeak

4c. ATAC-specific: Remove Sub-nucleosomal Artifacts (optional)

For ATAC-seq, very narrow peaks (<50bp) can be Tn5 insertion artifacts:

awk '($3-$2) >= 50' sample.qfiltered.narrowPeak > sample.clean.narrowPeak

Step 5: Union Merge

Accessibility peaks are narrow/point-source (like H3K4me3). Use default merge (overlap only, no gap tolerance).

CRITICAL: Tag peaks by sample before concatenation to count unique SAMPLES, not overlapping peaks:

# Tag each sample's peaks with a unique sample ID
awk -v sid="atac_s1" 'BEGIN{OFS="\t"} {$4=sid; print}' atac_sample1.qfiltered.narrowPeak > atac_s1.tagged.bed
awk -v sid="dnase_s1" 'BEGIN{OFS="\t"} {$4=sid; print}' dnase_sample1.qfiltered.narrowPeak > dnase_s1.tagged.bed
# ... repeat for all samples

# Concatenate all tagged peaks (ATAC + DNase combined or separate)
cat *.tagged.bed > all_accessibility.bed

# Sort
bedtools sort -i all_accessibility.bed > all_accessibility.sorted.bed

# Union merge — count UNIQUE SAMPLES (not peaks)
bedtools merge \
    -i all_accessibility.sorted.bed \
    -c 4,7,9 \
    -o count_distinct,max,max \
    > union_accessible_regions.bed
# Columns: chr, start, end, n_unique_samples, max_signalValue, max_qValue

If Tracking Assay Source

To annotate whether peaks came from ATAC, DNase, or both:

# Add assay tag to each peak before concatenation
awk '{print $0"\tATAC"}' atac_peaks.bed > tagged.bed
awk '{print $0"\tDNase"}' dnase_peaks.bed >> tagged.bed

# After merge, use bedtools multiIntersect to track sources
bedtools multiIntersect \
    -i atac_sample1.bed atac_sample2.bed dnase_sample1.bed ... \
    -header \
    -names ATAC_1 ATAC_2 DNase_1 ... \
    > multi_intersect.bed

Step 6: Confidence Annotation

Same logic as histone aggregation. Given N total samples:

Confidence	Criteria	Interpretation
High	≥50% of samples	Constitutive accessible region
Supported	2+ samples	Likely real, some variation
Singleton	1 sample only	Keep — may be individual-specific or condition-specific

awk -v N=8 '{
    if ($4 >= N*0.5) conf="HIGH";
    else if ($4 >= 2) conf="SUPPORTED";
    else conf="SINGLETON";
    print $0"\t"conf"\t"$4"/"N
}' union_accessible_regions.bed > union_accessible_regions.annotated.bed

Step 7: Log Provenance

encode_log_derived_file(
    file_path="/path/to/union_accessible_regions.annotated.bed",
    source_accessions=["ENCSR...", "ENCSR...", ...],
    description="Union chromatin accessibility peaks (ATAC-seq + DNase-seq) across N pancreas samples",
    file_type="aggregated_accessibility",
    tool_used="bedtools merge v2.31.0",
    parameters="blocklist filtered, signalValue >= 25th pctl per sample, ATAC min width 50bp, bedtools merge -d 0"
)

Step 8: Summary Statistics

Report to the user:

Total input experiments: N (ATAC: X, DNase: Y)
Experiments passing QC: M
Total peaks before merge: X
Union peaks after merge: Y
High-confidence regions: Z (≥50% support)
Supported regions: W (2+ support)
Singleton regions: V (1 sample only)
Genome coverage: bp covered / total genome
Overlap between ATAC-only and DNase-only peaks (if both assays used)

Pitfalls Specific to Accessibility Data

ATAC mitochondrial reads: ENCODE pipeline removes these, but verify in QC metrics. High mitochondrial fraction indicates poor nuclear chromatin enrichment.
Tn5 sequence bias: ATAC-seq Tn5 has mild sequence preference. For union maps this is acceptable — bias affects peak intensity, not presence.
DNase hypersensitivity saturation: Deeply sequenced DNase-seq detects more sites. Shallowly sequenced samples contribute fewer peaks but are not wrong — they just miss weaker sites.
Promoter enrichment: Both assays are enriched at promoters. When comparing accessibility across tissues, note that promoter accessibility is largely constitutive while enhancer accessibility is tissue-specific.
Cell-type heterogeneity in tissue samples: Bulk ATAC/DNase from tissue captures accessibility across ALL cell types. A peak may represent a minor cell population. This is correct for a tissue-level map but important to note.
Do NOT mix assemblies: All files must be GRCh38 or all hg19. Use
```
encode_compare_experiments
```
to verify.
Peak summits lost after merge: NarrowPeak column 10 (summit offset) is discarded by
```
bedtools merge
```
. If you need summits for motif analysis, extract them before merging and map back afterward.
CUT&RUN/CUT&Tag accessibility data: If ENCODE adds CUT&RUN-based accessibility data in the future, apply the CUT&RUN suspect list (Nordin et al. 2023, Genome Biology) in addition to the ENCODE blacklist.

Walkthrough: Building a Pan-Donor Accessibility Map for Brain Cortex

Goal: Merge ATAC-seq peaks from 4 brain cortex experiments into a union accessibility map. Context: User needs comprehensive open chromatin regions for regulatory element discovery.

Step 1: Search for brain ATAC-seq experiments

encode_search_experiments(
  assay_title="ATAC-seq",
  organ="brain"
)

Expected output:

{
  "total": 24,
  "experiments": [
    {"accession": "ENCSR001BRN", "biosample_summary": "brain cortex tissue male adult (53 years)"},
    {"accession": "ENCSR002BRN", "biosample_summary": "brain cortex tissue female adult (49 years)"}
  ]
}

Step 2: Download narrowPeak files

encode_search_files(
  assay_title="ATAC-seq",
  organ="brain",
  file_format="bed",
  output_type="IDR thresholded peaks",
  assembly="GRCh38"
)

Expected output:

{
  "total": 8,
  "files": [{"accession": "ENCFF001ATQ", "output_type": "IDR thresholded peaks", "assembly": "GRCh38"}]
}

Step 3: Merge into union peak set

cat *.narrowPeak | sort -k1,1 -k2,2n | bedtools merge -i - -c 4,5 -o count,mean > union_atac_brain.bed

Interpretation: Union peaks represent all genomic positions where chromatin is accessible in brain cortex. Peaks found in all 4 donors are constitutive regulatory elements.

Code Examples

1. Find accessibility data for aggregation

encode_get_facets(organ="pancreas", assay_title="ATAC-seq")

Expected output:

{
  "facets": {"biosample_term_name": {"pancreas": 4, "pancreatic islet": 3}, "status": {"released": 6}}
}

Integration

This skill produces...	Feed into...	Using tool/skill
Union open chromatin map (BED)	Enhancer identification	regulatory-elements skill
Accessible regions for motif analysis	TF motif discovery	motif-analysis skill
Tissue accessibility catalog	Cross-tissue comparison	compare-biosamples skill
Open chromatin at variant sites	Variant functional annotation	variant-annotation skill
Accessible peak coordinates	Visualization signal anchors	visualization-workflow skill

Related Skills

histone-aggregation: Same union approach for histone ChIP-seq narrowPeak data
methylation-aggregation: Different approach (averaging) for continuous methylation signal; HMRs + accessibility peaks mark active regulatory elements
hic-aggregation: Union approach for BEDPE chromatin loops; loops often anchor at accessible regions
regulatory-elements: Use union accessibility maps to define active regulatory elements with histone mark combinations
motif-analysis: Find enriched TF motifs in accessible regions using HOMER and MEME
pipeline-atacseq: Process raw ATAC-seq data through the full ENCODE-aligned pipeline
batch-analysis: Batch processing workflows for systematic accessibility aggregation
publication-trust: Verify literature claims backing analytical decisions

Presenting Results

Present merged accessibility regions as: chr | start | end | assay_type | sample_count. Show ATAC vs DNase contribution. Suggest: "Would you like to run motif analysis on these accessible regions?"

Encode-toolkit accessibility-aggregation

When to Use

Aggregate Chromatin Accessibility Peaks Across Studies

Scientific Rationale

ATAC-seq vs DNase-seq

Literature Support

Step 1: Find All Available Accessibility Data

Combining ATAC + DNase?

Step 2: Quality-Gate Each Experiment

ATAC-seq Quality Checks

DNase-seq Quality Checks

Step 3: Download Peak Files

Step 4: Per-Sample Noise Filtering

4a. ENCODE Blocklist Filtering (Amemiya et al. 2019)

4b. SignalValue Filtering (Perna et al. 2024)

4c. ATAC-specific: Remove Sub-nucleosomal Artifacts (optional)

Step 5: Union Merge

If Tracking Assay Source

Step 6: Confidence Annotation

Step 7: Log Provenance

Step 8: Summary Statistics

Pitfalls Specific to Accessibility Data

Walkthrough: Building a Pan-Donor Accessibility Map for Brain Cortex

Step 1: Search for brain ATAC-seq experiments

Step 2: Download narrowPeak files

Step 3: Merge into union peak set

Code Examples

1. Find accessibility data for aggregation

Integration

Related Skills

Presenting Results

For the request: "$ARGUMENTS"