Encode-toolkit accessibility-aggregation
Build comprehensive chromatin accessibility maps by aggregating ATAC-seq and DNase-seq narrowPeak data across multiple ENCODE experiments, donors, and labs. Use when the user wants to answer "where is chromatin accessible in my tissue?" by combining peak calls into a union peak set. Handles cross-lab variation, ATAC vs DNase platform differences, and ENCODE blocklist filtering.
git clone https://github.com/ammawla/encode-toolkit
T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/accessibility-aggregation" ~/.claude/skills/ammawla-encode-toolkit-accessibility-aggregation-ce003b && rm -rf "$T"
skills/accessibility-aggregation/SKILL.mdWhen to Use
- User wants to combine ATAC-seq or DNase-seq peaks across multiple experiments for a tissue
- User asks "where is chromatin accessible in my tissue?" or "build an open chromatin map"
- User needs to merge accessibility data from different labs, donors, or platforms (ATAC vs DNase)
- User wants a comprehensive set of open chromatin regions for regulatory element discovery
- Example queries: "aggregate ATAC-seq peaks for pancreas", "combine DNase-seq across donors", "find all accessible regions in liver"
Aggregate Chromatin Accessibility Peaks Across Studies
Build a comprehensive map of open chromatin for a tissue/cell type by merging ATAC-seq and/or DNase-seq narrowPeak files from multiple ENCODE experiments.
Scientific Rationale
The question: "Where is chromatin accessible in my tissue?"
Like histone marks, chromatin accessibility is a detection question. An open chromatin region detected in one donor but not another is still a real accessible site — individual variation, sequencing depth, and technical factors explain absence. We want the union of all detections.
ATAC-seq vs DNase-seq
Both measure open chromatin but with different biases:
| Property | ATAC-seq | DNase-seq |
|---|---|---|
| Method | Tn5 transposase insertion | DNase I hypersensitivity |
| Input required | ~50K cells | ~1M cells |
| Resolution | High | High |
| GC bias | Moderate (Tn5 preference) | Low |
| Mitochondrial reads | High (filter needed) | None |
| ENCODE availability | Newer experiments | Extensive historical catalog |
| Comparability | Generally comparable at open regions |
Literature Support
- Corces et al. 2017 (Nature Methods, 733 citations): Established that ATAC-seq and DNase-seq identify largely overlapping accessible regions, with ATAC capturing ~75% of DNase sites. Both are valid for union maps.
- ENCODE Blacklist (Amemiya et al. 2019, Scientific Reports, 1,372 citations): Comprehensive set of problematic genomic regions to filter. Essential for all functional genomics analyses. DOI
- F-Seq2 (Zhao & Boyle 2020, NAR Genomics): Improved peak caller for DNase-seq and ATAC-seq with proper test statistics for IDR compatibility.
- ENCODE Phase 3 (Gorkin et al. 2020, Nature, 301 citations): Integrated accessibility data with histone marks across tissues for chromatin state annotation.
Recommendation: If combining ATAC-seq and DNase-seq peaks, treat them as equivalent signal sources for accessibility. The union is appropriate because both detect the same biological signal (open chromatin) through different enzymatic mechanisms.
Step 1: Find All Available Accessibility Data
# ATAC-seq encode_search_experiments( assay_title="ATAC-seq", organ="pancreas", biosample_type="tissue", limit=100 ) # DNase-seq encode_search_experiments( assay_title="DNase-seq", organ="pancreas", biosample_type="tissue", limit=100 )
Present a summary to the user:
- Total ATAC-seq experiments
- Total DNase-seq experiments
- Labs represented
- Whether to use one or both assay types
Combining ATAC + DNase?
Ask the user:
- Same assay only (purest comparison, no cross-platform effects)
- Both assays combined (maximum coverage, slight platform variation)
For a comprehensive accessibility catalog, combining both is scientifically justified.
Step 2: Quality-Gate Each Experiment
encode_get_experiment(accession="ENCSR...")
ATAC-seq Quality Checks
- Audit status: no ERROR flags
- Has IDR thresholded peaks
- Low mitochondrial read fraction (ENCODE pipeline removes these)
- Good TSS enrichment score
- Nucleosome-free fragment enrichment visible
DNase-seq Quality Checks
- Audit status: no ERROR flags
- Has Hotspot2 peaks or IDR thresholded peaks
- Adequate sequencing depth (20M+ mapped reads)
- Signal-to-noise ratio
Track all included experiments:
encode_track_experiment(accession="ENCSR...")
Step 3: Download Peak Files
For each experiment:
# ATAC-seq — IDR thresholded peaks encode_list_files( experiment_accession="ENCSR...", file_format="bed", output_type="IDR thresholded peaks", assembly="GRCh38" ) # DNase-seq — may use different output types encode_list_files( experiment_accession="ENCSR...", file_format="bed", output_type="peaks", assembly="GRCh38" )
Prefer
preferred_default=True files.
encode_download_files( file_accessions=["ENCFF...", ...], download_dir="/path/to/data/accessibility", organize_by="flat" )
Step 4: Per-Sample Noise Filtering
4a. ENCODE Blocklist Filtering (Amemiya et al. 2019)
# Download from: https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz bedtools intersect -a sample.narrowPeak -b hg38-blacklist.v2.bed -v > sample.filtered.narrowPeak
4b. SignalValue Filtering (Perna et al. 2024)
Same logic as histone aggregation — filter per-sample to top 75% by signalValue (column 7):
# Per-sample: remove bottom 25% by signalValue (true distribution quantile) TOTAL=$(wc -l < sample.filtered.narrowPeak) LINE_25=$(echo "$TOTAL" | awk '{printf "%d", $1 * 0.25}') THRESHOLD=$(sort -k7,7n sample.filtered.narrowPeak | awk -v line="$LINE_25" 'NR==line{print $7}') awk -v t="$THRESHOLD" '$7 >= t' sample.filtered.narrowPeak > sample.qfiltered.narrowPeak
4c. ATAC-specific: Remove Sub-nucleosomal Artifacts (optional)
For ATAC-seq, very narrow peaks (<50bp) can be Tn5 insertion artifacts:
awk '($3-$2) >= 50' sample.qfiltered.narrowPeak > sample.clean.narrowPeak
Step 5: Union Merge
Accessibility peaks are narrow/point-source (like H3K4me3). Use default merge (overlap only, no gap tolerance).
CRITICAL: Tag peaks by sample before concatenation to count unique SAMPLES, not overlapping peaks:
# Tag each sample's peaks with a unique sample ID awk -v sid="atac_s1" 'BEGIN{OFS="\t"} {$4=sid; print}' atac_sample1.qfiltered.narrowPeak > atac_s1.tagged.bed awk -v sid="dnase_s1" 'BEGIN{OFS="\t"} {$4=sid; print}' dnase_sample1.qfiltered.narrowPeak > dnase_s1.tagged.bed # ... repeat for all samples # Concatenate all tagged peaks (ATAC + DNase combined or separate) cat *.tagged.bed > all_accessibility.bed # Sort bedtools sort -i all_accessibility.bed > all_accessibility.sorted.bed # Union merge — count UNIQUE SAMPLES (not peaks) bedtools merge \ -i all_accessibility.sorted.bed \ -c 4,7,9 \ -o count_distinct,max,max \ > union_accessible_regions.bed # Columns: chr, start, end, n_unique_samples, max_signalValue, max_qValue
If Tracking Assay Source
To annotate whether peaks came from ATAC, DNase, or both:
# Add assay tag to each peak before concatenation awk '{print $0"\tATAC"}' atac_peaks.bed > tagged.bed awk '{print $0"\tDNase"}' dnase_peaks.bed >> tagged.bed # After merge, use bedtools multiIntersect to track sources bedtools multiIntersect \ -i atac_sample1.bed atac_sample2.bed dnase_sample1.bed ... \ -header \ -names ATAC_1 ATAC_2 DNase_1 ... \ > multi_intersect.bed
Step 6: Confidence Annotation
Same logic as histone aggregation. Given N total samples:
| Confidence | Criteria | Interpretation |
|---|---|---|
| High | ≥50% of samples | Constitutive accessible region |
| Supported | 2+ samples | Likely real, some variation |
| Singleton | 1 sample only | Keep — may be individual-specific or condition-specific |
awk -v N=8 '{ if ($4 >= N*0.5) conf="HIGH"; else if ($4 >= 2) conf="SUPPORTED"; else conf="SINGLETON"; print $0"\t"conf"\t"$4"/"N }' union_accessible_regions.bed > union_accessible_regions.annotated.bed
Step 7: Log Provenance
encode_log_derived_file( file_path="/path/to/union_accessible_regions.annotated.bed", source_accessions=["ENCSR...", "ENCSR...", ...], description="Union chromatin accessibility peaks (ATAC-seq + DNase-seq) across N pancreas samples", file_type="aggregated_accessibility", tool_used="bedtools merge v2.31.0", parameters="blocklist filtered, signalValue >= 25th pctl per sample, ATAC min width 50bp, bedtools merge -d 0" )
Step 8: Summary Statistics
Report to the user:
- Total input experiments: N (ATAC: X, DNase: Y)
- Experiments passing QC: M
- Total peaks before merge: X
- Union peaks after merge: Y
- High-confidence regions: Z (≥50% support)
- Supported regions: W (2+ support)
- Singleton regions: V (1 sample only)
- Genome coverage: bp covered / total genome
- Overlap between ATAC-only and DNase-only peaks (if both assays used)
Pitfalls Specific to Accessibility Data
-
ATAC mitochondrial reads: ENCODE pipeline removes these, but verify in QC metrics. High mitochondrial fraction indicates poor nuclear chromatin enrichment.
-
Tn5 sequence bias: ATAC-seq Tn5 has mild sequence preference. For union maps this is acceptable — bias affects peak intensity, not presence.
-
DNase hypersensitivity saturation: Deeply sequenced DNase-seq detects more sites. Shallowly sequenced samples contribute fewer peaks but are not wrong — they just miss weaker sites.
-
Promoter enrichment: Both assays are enriched at promoters. When comparing accessibility across tissues, note that promoter accessibility is largely constitutive while enhancer accessibility is tissue-specific.
-
Cell-type heterogeneity in tissue samples: Bulk ATAC/DNase from tissue captures accessibility across ALL cell types. A peak may represent a minor cell population. This is correct for a tissue-level map but important to note.
-
Do NOT mix assemblies: All files must be GRCh38 or all hg19. Use
to verify.encode_compare_experiments -
Peak summits lost after merge: NarrowPeak column 10 (summit offset) is discarded by
. If you need summits for motif analysis, extract them before merging and map back afterward.bedtools merge -
CUT&RUN/CUT&Tag accessibility data: If ENCODE adds CUT&RUN-based accessibility data in the future, apply the CUT&RUN suspect list (Nordin et al. 2023, Genome Biology) in addition to the ENCODE blacklist.
Walkthrough: Building a Pan-Donor Accessibility Map for Brain Cortex
Goal: Merge ATAC-seq peaks from 4 brain cortex experiments into a union accessibility map. Context: User needs comprehensive open chromatin regions for regulatory element discovery.
Step 1: Search for brain ATAC-seq experiments
encode_search_experiments( assay_title="ATAC-seq", organ="brain" )
Expected output:
{ "total": 24, "experiments": [ {"accession": "ENCSR001BRN", "biosample_summary": "brain cortex tissue male adult (53 years)"}, {"accession": "ENCSR002BRN", "biosample_summary": "brain cortex tissue female adult (49 years)"} ] }
Step 2: Download narrowPeak files
encode_search_files( assay_title="ATAC-seq", organ="brain", file_format="bed", output_type="IDR thresholded peaks", assembly="GRCh38" )
Expected output:
{ "total": 8, "files": [{"accession": "ENCFF001ATQ", "output_type": "IDR thresholded peaks", "assembly": "GRCh38"}] }
Step 3: Merge into union peak set
cat *.narrowPeak | sort -k1,1 -k2,2n | bedtools merge -i - -c 4,5 -o count,mean > union_atac_brain.bed
Interpretation: Union peaks represent all genomic positions where chromatin is accessible in brain cortex. Peaks found in all 4 donors are constitutive regulatory elements.
Code Examples
1. Find accessibility data for aggregation
encode_get_facets(organ="pancreas", assay_title="ATAC-seq")
Expected output:
{ "facets": {"biosample_term_name": {"pancreas": 4, "pancreatic islet": 3}, "status": {"released": 6}} }
Integration
| This skill produces... | Feed into... | Using tool/skill |
|---|---|---|
| Union open chromatin map (BED) | Enhancer identification | regulatory-elements skill |
| Accessible regions for motif analysis | TF motif discovery | motif-analysis skill |
| Tissue accessibility catalog | Cross-tissue comparison | compare-biosamples skill |
| Open chromatin at variant sites | Variant functional annotation | variant-annotation skill |
| Accessible peak coordinates | Visualization signal anchors | visualization-workflow skill |
Related Skills
- histone-aggregation: Same union approach for histone ChIP-seq narrowPeak data
- methylation-aggregation: Different approach (averaging) for continuous methylation signal; HMRs + accessibility peaks mark active regulatory elements
- hic-aggregation: Union approach for BEDPE chromatin loops; loops often anchor at accessible regions
- regulatory-elements: Use union accessibility maps to define active regulatory elements with histone mark combinations
- motif-analysis: Find enriched TF motifs in accessible regions using HOMER and MEME
- pipeline-atacseq: Process raw ATAC-seq data through the full ENCODE-aligned pipeline
- batch-analysis: Batch processing workflows for systematic accessibility aggregation
- publication-trust: Verify literature claims backing analytical decisions
Presenting Results
- Present merged accessibility regions as: chr | start | end | assay_type | sample_count. Show ATAC vs DNase contribution. Suggest: "Would you like to run motif analysis on these accessible regions?"