Encode-toolkit hic-aggregation
Build comprehensive chromatin contact maps by aggregating Hi-C loop calls (BEDPE) across multiple ENCODE experiments, donors, and labs. Use when the user wants to answer "what regions are in 3D contact in my tissue?" by creating a union catalog of chromatin loops. Handles resolution-aware anchor matching, cross-lab variation, and Hi-C-specific quality metrics.
git clone https://github.com/ammawla/encode-toolkit
T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugin/skills/hic-aggregation" ~/.claude/skills/ammawla-encode-toolkit-hic-aggregation && rm -rf "$T"
plugin/skills/hic-aggregation/SKILL.mdAggregate Hi-C Chromatin Contacts Across Studies
When to Use
- User wants to build a comprehensive catalog of chromatin loops from multiple Hi-C experiments
- User asks "what regions are in 3D contact in my tissue?" or "aggregate loop calls across donors"
- User needs a union catalog of BEDPE loops with resolution-aware anchor matching
- User wants to identify high-confidence loops supported by multiple experiments
- Example queries: "aggregate Hi-C loops for K562", "combine chromatin contacts across labs", "find consensus TAD boundaries in liver"
Build a comprehensive catalog of chromatin loops for a tissue/cell type by merging BEDPE loop calls from multiple ENCODE Hi-C experiments.
Scientific Rationale
The question: "What regions are in 3D physical contact in my tissue?"
Like histone marks and accessibility, chromatin loops are a detection question. If a loop between Region A and Region B is detected in one donor but not another, the contact is still real — individual variation, sequencing depth, and computational resolution explain absence. We want the union of all detected contacts.
Key Concepts
Hi-C data measures pairwise chromatin interactions genome-wide. After processing:
- Contact matrix (
file): Genome-wide interaction frequencies at multiple resolutions.hic - Loop calls (BEDPE): Statistically significant point interactions (loops) identified by algorithms like HICCUPS or Juicer
- TAD boundaries: Topologically associating domain boundaries
- Compartments: A/B compartment assignments
BEDPE format (Paired-End BED):
chr1 start1 end1 chr2 start2 end2 name score strand1 strand2
Each row represents a contact between two genomic anchor regions.
Literature Support
- Loop Catalog (Reyna et al. 2025, Nucleic Acids Research): Created a union catalog of 4.19M unique loops across 1,089 Hi-C datasets. Demonstrated that union approach captures tissue-specific and constitutive loops. Used resolution-aware merging at 5kb, 10kb, and 25kb bins.
- AQuA Tools (Chakraborty et al. 2025): Toolkit for BEDPE intersection, union, and annotation. Handles paired-region arithmetic.
- mariner (Flores et al. 2024, Bioinformatics): R/Bioconductor package for BEDPE manipulation including merging loops across experiments with configurable anchor tolerance.
- ENCODE Phase 3 (Gorkin et al. 2020, Nature, 301 citations): Integrated Hi-C data across tissues to define regulatory loops connecting enhancers to promoters.
- ENCODE Blacklist (Amemiya et al. 2019, Scientific Reports, 1,372 citations): Problematic genomic regions to filter from loop anchors. DOI
- Mustache (Roayaei Ardakany et al. 2020, Genome Biology, 165 citations): Multi-scale loop caller that recovers more validated loops than HICCUPS. Different callers produce discordant loop sets.
- Wolff et al. 2022 (GigaScience): Benchmark showing loop callers intersect by ~50% at most — critical context for why union approach is necessary.
Step 1: Find All Available Hi-C Data
encode_search_experiments( assay_title="Hi-C", organ="pancreas", # user's tissue of interest biosample_type="tissue", limit=100 )
Present a summary to the user:
- Total Hi-C experiments
- Labs represented
- Unique donors/biosamples
- Resolution(s) available (check experiment metadata)
Use
encode_get_facets to check availability:
encode_get_facets(assay_title="Hi-C", organ="pancreas")
Note: Hi-C data is computationally expensive to produce, so there are typically fewer experiments per tissue than ChIP-seq or ATAC-seq. Even 2-3 experiments can be valuable for union catalogs.
Step 2: Quality-Gate Each Experiment
encode_get_experiment(accession="ENCSR...")
Hi-C Quality Checks
- Audit status: no ERROR flags
- Sequencing depth: 400M+ valid read pairs for loop calling (ENCODE standard)
- Cis/trans ratio: >60% cis contacts expected (low cis suggests noisy library)
- Hi-C-specific QC: Library complexity, PCR duplicate rate
- Has loop calls (BEDPE output) — not all Hi-C experiments have called loops
- Resolution: at least 5-10kb resolution for loop detection
Include if:
- Has BEDPE loop calls at consistent resolution
- Passes ENCODE audit (no ERROR flags)
- Adequate sequencing depth for loop resolution
Exclude if:
- ERROR audit flags
- Only contact matrices without loop calls
- Very low sequencing depth (<200M valid pairs — insufficient for loop calling)
Track all included experiments:
encode_track_experiment(accession="ENCSR...")
Step 3: Download Loop Call Files
For each experiment, get BEDPE loop calls:
# Search for loop/interaction files encode_list_files( experiment_accession="ENCSR...", file_format="bedpe", assembly="GRCh38" ) # Also check for BED-formatted loop files encode_list_files( experiment_accession="ENCSR...", output_type="chromatin interactions", assembly="GRCh38" ) # Or contact domains encode_list_files( experiment_accession="ENCSR...", output_type="contact domains", assembly="GRCh38" )
File selection priority:
- Chromatin interactions (loop calls from HICCUPS or similar)
- Contact domains (TADs — different analysis, handle separately)
- Replicated loops (if available)
Prefer
preferred_default=True files when available.
encode_download_files( file_accessions=["ENCFF...", ...], download_dir="/path/to/data/hic_loops", organize_by="flat" )
Step 4: Understanding Hi-C Resolution and Anchors
Critical: Resolution-Aware Processing
Hi-C loop anchors are binned regions, not precise positions. The resolution determines anchor size:
| Resolution | Anchor Width | Best For | Typical Loop Count |
|---|---|---|---|
| 5 kb | 5,000 bp | Fine-scale promoter-enhancer loops | More loops |
| 10 kb | 10,000 bp | Standard analysis | Moderate |
| 25 kb | 25,000 bp | Large-scale domain contacts | Fewer loops |
All loops being merged must be at the same resolution, or anchors must be harmonized to a common resolution.
Harmonizing Resolution
If experiments have loops called at different resolutions:
# Expand 5kb anchors to 10kb resolution awk -v res=10000 'BEGIN{OFS="\t"} { # Bin anchor 1 bin1_start = int($2/res) * res bin1_end = bin1_start + res # Bin anchor 2 bin2_start = int($5/res) * res bin2_end = bin2_start + res print $1, bin1_start, bin1_end, $4, bin2_start, bin2_end, $7, $8, $9, $10 }' fine_res_loops.bedpe > harmonized_loops.bedpe
Step 5: Per-Sample Filtering
5a. ENCODE Blocklist Filtering (Amemiya et al. 2019)
Remove loops with anchors in artifact-prone regions (download from https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz):
# Filter loops where EITHER anchor overlaps a blocklist region # First, extract anchor 1 and anchor 2 as separate BED files awk 'BEGIN{OFS="\t"} {print $1,$2,$3,NR}' sample.bedpe > anchors1.bed awk 'BEGIN{OFS="\t"} {print $4,$5,$6,NR}' sample.bedpe > anchors2.bed # Find anchor rows NOT in blocklist bedtools intersect -a anchors1.bed -b ENCODE_blocklist.bed -v | cut -f4 > clean_rows_1.txt bedtools intersect -a anchors2.bed -b ENCODE_blocklist.bed -v | cut -f4 > clean_rows_2.txt # Keep only rows where BOTH anchors pass comm -12 <(sort clean_rows_1.txt) <(sort clean_rows_2.txt) > clean_rows.txt awk 'NR==FNR{a[$1];next} FNR in a' clean_rows.txt sample.bedpe > sample.filtered.bedpe
5b. Score Filtering
Filter by interaction score/significance:
# If BEDPE has a score column (col 8), filter to significant interactions # Keep top 75% by score (true distribution quantile, not range-based) TOTAL=$(wc -l < sample.filtered.bedpe) LINE_25=$(echo "$TOTAL" | awk '{printf "%d", $1 * 0.25}') THRESHOLD=$(sort -k8,8n sample.filtered.bedpe | awk -v line="$LINE_25" 'NR==line{print $8}') awk -v t="$THRESHOLD" '$8 >= t' sample.filtered.bedpe > sample.qfiltered.bedpe
5c. Remove Self-Ligation Artifacts
Loops where both anchors are very close are likely artifacts:
# Remove loops where anchors are on same chromosome and < 20kb apart awk '{ if ($1 != $4) print $0; # inter-chromosomal: keep (rare but real) else if (($5 - $3) >= 20000) print $0; # > 20kb apart: keep }' sample.qfiltered.bedpe > sample.clean.bedpe
Step 6: Union Merge of Loops
The Paired-Region Matching Problem
Unlike peaks (single regions), loops are pairs of regions. Two loops match if both anchors overlap:
Loop 1: [anchor1A]--------[anchor1B] Loop 2: [anchor2A]------[anchor2B]
These should merge if anchor1A overlaps anchor2A AND anchor1B overlaps anchor2B.
Method A: bedtools pairToPair (Recommended for simple union)
# Concatenate all filtered loops cat sample1.clean.bedpe sample2.clean.bedpe ... > all_loops.bedpe # Sort by anchor 1 coordinates sort -k1,1 -k2,2n -k4,4 -k5,5n all_loops.bedpe > all_loops.sorted.bedpe # Use a custom merge approach: # 1. Bin anchors to resolution, creating a loop ID # 2. Group by loop ID # 3. Count support awk -v res=10000 'BEGIN{OFS="\t"} { # Create binned anchor coordinates as loop identifier a1_bin = $1 ":" int($2/res)*res a2_bin = $4 ":" int($5/res)*res # Canonical order (smaller coordinate first) to handle orientation if (a1_bin < a2_bin) loop_id = a1_bin "-" a2_bin else loop_id = a2_bin "-" a1_bin print loop_id, $0 }' all_loops.sorted.bedpe | \ sort -k1,1 | \ awk 'BEGIN{OFS="\t"} { if ($1 != prev_id) { if (NR > 1) print chr1, start1, end1, chr2, start2, end2, count, max_score prev_id = $1 chr1=$2; start1=$3; end1=$4; chr2=$5; start2=$6; end2=$7 count = 1; max_score = $9 } else { count++ if ($9 > max_score) max_score = $9 # Expand anchors to encompass all overlapping calls if ($3 < start1) start1 = $3 if ($4 > end1) end1 = $4 if ($6 < start2) start2 = $6 if ($7 > end2) end2 = $7 } } END { print chr1, start1, end1, chr2, start2, end2, count, max_score }' > union_loops.bedpe
Method B: Resolution-Binned Approach (Loop Catalog method)
Following the Loop Catalog (Reyna et al. 2025) approach:
# Bin all loop anchors to a fixed resolution awk -v res=10000 'BEGIN{OFS="\t"} { a1_start = int($2/res) * res a1_end = a1_start + res a2_start = int($5/res) * res a2_end = a2_start + res # Canonical ordering if ($1 < $4 || ($1 == $4 && a1_start <= a2_start)) print $1, a1_start, a1_end, $4, a2_start, a2_end else print $4, a2_start, a2_end, $1, a1_start, a1_end }' all_loops.sorted.bedpe | \ sort -u | \ sort -k1,1 -k2,2n -k4,4 -k5,5n | \ uniq -c | \ awk 'BEGIN{OFS="\t"} {print $2,$3,$4,$5,$6,$7,$1}' > union_loops_binned.bedpe # Columns: chr1, start1, end1, chr2, start2, end2, n_supporting_samples
Method C: Using Specialized Tools
mariner (R/Bioconductor):
library(mariner) # Read BEDPE files as GInteractions loops <- lapply(bedpe_files, read.table) # Convert to GInteractions and merge gi <- as_ginteractions(loops) merged <- mergePairs(gi, radius = 10000) # 10kb tolerance
AQuA Tools (Python):
# BEDPE union with anchor overlap tolerance aqua bedpe-union -i sample1.bedpe sample2.bedpe -o union.bedpe --slop 5000
Step 7: Confidence Annotation
Given N total experiments:
| Confidence | Criteria | Interpretation |
|---|---|---|
| High | Detected in >=50% of samples | Constitutive loop, present across individuals |
| Supported | Detected in 2+ samples | Likely real, some variation |
| Singleton | Detected in 1 sample only | May be individual-specific or depth-dependent |
awk -v N=4 '{ if ($7 >= N*0.5) conf="HIGH"; else if ($7 >= 2) conf="SUPPORTED"; else conf="SINGLETON"; print $0"\t"conf"\t"$7"/"N }' union_loops_binned.bedpe > union_loops.annotated.bedpe
Context for singletons: Hi-C loop detection is very sensitive to sequencing depth. Many singletons may simply be under-powered in other samples rather than biologically absent. The Loop Catalog found that a union approach captures ~3x more loops than any individual experiment.
Step 8: Separate Analysis for TADs and Compartments
TAD boundaries and A/B compartments require different aggregation than loops:
TAD Boundaries
TAD boundaries are single genomic positions. Aggregate like narrow peaks:
# Extract TAD boundary BED from contact domain files # Each boundary is a narrow region cat tad_boundaries_sample*.bed | \ bedtools sort -i - | \ bedtools merge -i - -d 40000 -c 1 -o count > union_tad_boundaries.bed # 40kb gap tolerance because TAD boundaries are resolution-dependent
A/B Compartments
Compartment calls (eigenvector sign at each bin) should be aggregated by majority vote:
# For each resolution bin, assign A or B based on majority of samples # This is more complex and typically done in R/Python
Step 9: Log Provenance
encode_log_derived_file( file_path="/path/to/union_loops.annotated.bedpe", source_accessions=["ENCSR...", "ENCSR...", ...], description="Union chromatin loops across N pancreas Hi-C experiments", file_type="aggregated_loops", tool_used="bedtools + custom merge at 10kb resolution", parameters="blocklist filtered, score >= 25th pctl, self-ligation >= 20kb removed, 10kb resolution binning" )
Step 10: Summary Statistics
Report to the user:
- Total input experiments: N
- Experiments passing QC: M
- Resolution used: Xkb
- Total loops before merge: X
- Union loops after merge: Y
- High-confidence loops: Z (≥50% support)
- Supported loops: W (2+ support)
- Singleton loops: V (1 sample only)
- Distance distribution: median and range of loop sizes (anchor-to-anchor)
- Inter-chromosomal loops: count (expect very few)
Pitfalls Specific to Hi-C Data
-
Resolution mismatch: Loop calls at 5kb vs 25kb resolution will have very different anchor sizes. Always harmonize to a common resolution before merging.
-
Sequencing depth sensitivity: Loop calling requires deep sequencing (400M+ valid pairs). Shallowly sequenced experiments will call far fewer loops — this is under-detection, not absence.
-
Algorithm differences are LARGE: Wolff et al. 2022 (GigaScience) found that HICCUPS, Mustache, Fit-Hi-C, and HiCExplorer loop callers intersect by ~50% at most. Mustache tends to recover more validated loops (Roayaei Ardakany et al. 2020). If mixing callers, note this in provenance — and this discordance is itself a reason to prefer the union approach.
-
Orientation matters: BEDPE anchors should be canonically ordered (anchor1 < anchor2 by genomic coordinate) before merging to avoid duplicate counting.
-
Inter-chromosomal contacts: These are rare but real. Handle separately — they cannot be distance-filtered.
-
Distance distribution: Most loops are 100kb-2Mb. Very short-range contacts (<20kb) are often noise from undigested chromatin. Very long-range (>10Mb) are rare.
-
Do NOT mix assemblies: All files must be GRCh38 or all hg19. Hi-C resolution binning makes liftOver of loops particularly error-prone.
-
TADs vs loops: These are different features. TADs are domains (regions), loops are point contacts (pairs). Do not mix them in the same union.
-
Micro-C as complement: Micro-C achieves higher resolution than Hi-C and can detect sub-TAD loops. Treat Micro-C loops as compatible with Hi-C loops in a union (Mustache works on both).
Walkthrough: Building a Cross-Tissue Loop Catalog for the MYC Locus
Goal: Aggregate Hi-C chromatin loops across tissues to identify conserved and tissue-specific 3D contacts at the MYC gene locus. Context: Cancer research — MYC is regulated by distal enhancers via chromatin looping.
Step 1: Find Hi-C experiments across tissues
encode_search_experiments(assay_title="Hi-C", organism="Homo sapiens", limit=50)
Expected output:
{ "total": 89, "results": [ {"accession": "ENCSR000AKA", "assay_title": "Hi-C", "biosample_summary": "GM12878", "status": "released"}, {"accession": "ENCSR489OCU", "assay_title": "Hi-C", "biosample_summary": "K562", "status": "released"}, {"accession": "ENCSR382RFU", "assay_title": "Hi-C", "biosample_summary": "liver", "status": "released"} ] }
Interpretation: 89 Hi-C experiments available. Select 5–10 spanning diverse tissue types for cross-tissue comparison.
Step 2: List loop files for each experiment
encode_list_files(accession="ENCSR000AKA", file_format="bedpe", assembly="GRCh38")
Expected output:
{ "files": [ {"accession": "ENCFF001ABC", "output_type": "contact domains", "file_format": "bedpe", "file_size_mb": 2.4}, {"accession": "ENCFF002DEF", "output_type": "chromatin interactions", "file_format": "bedpe", "file_size_mb": 1.8} ] }
Interpretation: Use "chromatin interactions" files for loop aggregation. Contact domains are TADs, not loops.
Step 3: Download loop files
encode_download_files(accessions=["ENCFF002DEF", "ENCFF003GHI", "ENCFF004JKL"], download_dir="/data/hic_loops")
Expected output:
{ "downloaded": 3, "total_size_mb": 5.6, "md5_verified": true, "files": ["/data/hic_loops/ENCFF002DEF.bedpe", "/data/hic_loops/ENCFF003GHI.bedpe", "/data/hic_loops/ENCFF004JKL.bedpe"] }
Step 4: Aggregate loops with resolution-aware anchor matching
Apply union merge across tissues:
- Expand loop anchors by ±resolution (e.g., ±5kb for 5kb resolution data)
- Merge overlapping anchors using bedtools pairToPair
- Assign tissue support counts to each union loop
- Filter: require ≥2 tissue support for conserved loops
Step 5: Filter to MYC locus
# MYC locus: chr8:127,700,000-128,000,000 awk '$1=="chr8" && $2>=127700000 && $3<=128000000' union_loops.bedpe > myc_loops.bedpe
Interpretation: Loops anchored at the MYC promoter connecting to distal enhancers. Conserved loops (≥3 tissues) likely represent fundamental regulatory architecture; tissue-specific loops may drive context-dependent MYC activation.
Integration with downstream skills
- Feed loop anchors into → peak-annotation for gene assignment at anchor regions
- Overlay with → histone-aggregation H3K27ac peaks to identify active enhancer-promoter loops
- Cross-reference loop-disrupting variants via → variant-annotation
- Visualize in → ucsc-browser as interaction tracks
Code Examples
1. Survey available Hi-C data by tissue
encode_get_facets(assay_title="Hi-C", facet_field="organ", organism="Homo sapiens")
Expected output:
{ "facets": { "organ": {"brain": 24, "heart": 12, "liver": 8, "lung": 6, "kidney": 4, "blood": 15} } }
2. Get details for a specific Hi-C experiment
encode_get_experiment(accession="ENCSR000AKA")
Expected output:
{ "accession": "ENCSR000AKA", "assay_title": "Hi-C", "biosample_summary": "GM12878", "replicates": 2, "status": "released", "lab": "/labs/erez-lieberman-aiden/", "audit": {"WARNING": 1, "ERROR": 0} }
3. Compare loop sets between two cell types
encode_compare_experiments(accession_1="ENCSR000AKA", accession_2="ENCSR489OCU")
Expected output:
{ "comparison": { "shared": {"assay": "Hi-C", "organism": "Homo sapiens", "assembly": "GRCh38"}, "differences": { "biosample": ["GM12878", "K562"], "lab": ["/labs/erez-lieberman-aiden/", "/labs/erez-lieberman-aiden/"] } } }
Integration
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Union loop catalog (BEDPE) | peak-annotation | Assign genes to loop anchors |
| Conserved loop coordinates | histone-aggregation | Overlay H3K27ac at anchors to find active enhancer-promoter loops |
| Tissue-specific loops | accessibility-aggregation | Check if loop anchors overlap open chromatin |
| Loop anchor BED intervals | variant-annotation | Find GWAS/clinical variants disrupting loop anchors |
| Loop anchor coordinates | liftover-coordinates | Convert hg19 loops to GRCh38 |
| Aggregated loop statistics | visualization-workflow | Generate loop frequency heatmaps |
| Loop-gene assignments | disease-research | Connect loop disruptions to disease phenotypes |
Related Skills
- histone-aggregation: Loop anchors often overlap with H3K27ac/H3K4me1 peaks — integrate with histone union sets to annotate loop function
- accessibility-aggregation: Loop anchors frequently coincide with accessible chromatin — validate loops by requiring anchor accessibility
- regulatory-elements: Use loops to connect distal enhancers (H3K27ac) to target promoters (H3K4me3)
- epigenome-profiling: Loops add 3D context to 1D chromatin state maps
- pipeline-hic: Process raw Hi-C data through the full ENCODE-aligned pipeline
- batch-analysis: Batch processing workflows for systematic Hi-C loop aggregation
- publication-trust: Verify literature claims backing analytical decisions
Presenting Results
- Present aggregated loops as: chr | anchor1_start | anchor1_end | anchor2_start | anchor2_end | sample_count | resolution. Show loop statistics. Suggest: "Would you like to check if any GWAS variants overlap loop anchors?"