Encode-toolkit hic-aggregation

Build comprehensive chromatin contact maps by aggregating Hi-C loop calls (BEDPE) across multiple ENCODE experiments, donors, and labs. Use when the user wants to answer "what regions are in 3D contact in my tissue?" by creating a union catalog of chromatin loops. Handles resolution-aware anchor matching, cross-lab variation, and Hi-C-specific quality metrics.

install

source · Clone the upstream repo

git clone https://github.com/ammawla/encode-toolkit

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugin/skills/hic-aggregation" ~/.claude/skills/ammawla-encode-toolkit-hic-aggregation && rm -rf "$T"

manifest: plugin/skills/hic-aggregation/SKILL.md

source content

Aggregate Hi-C Chromatin Contacts Across Studies

When to Use

User wants to build a comprehensive catalog of chromatin loops from multiple Hi-C experiments
User asks "what regions are in 3D contact in my tissue?" or "aggregate loop calls across donors"
User needs a union catalog of BEDPE loops with resolution-aware anchor matching
User wants to identify high-confidence loops supported by multiple experiments
Example queries: "aggregate Hi-C loops for K562", "combine chromatin contacts across labs", "find consensus TAD boundaries in liver"

Build a comprehensive catalog of chromatin loops for a tissue/cell type by merging BEDPE loop calls from multiple ENCODE Hi-C experiments.

Scientific Rationale

The question: "What regions are in 3D physical contact in my tissue?"

Like histone marks and accessibility, chromatin loops are a detection question. If a loop between Region A and Region B is detected in one donor but not another, the contact is still real — individual variation, sequencing depth, and computational resolution explain absence. We want the union of all detected contacts.

Key Concepts

Hi-C data measures pairwise chromatin interactions genome-wide. After processing:

Contact matrix (
```
.hic
```
file): Genome-wide interaction frequencies at multiple resolutions
Loop calls (BEDPE): Statistically significant point interactions (loops) identified by algorithms like HICCUPS or Juicer
TAD boundaries: Topologically associating domain boundaries
Compartments: A/B compartment assignments

BEDPE format (Paired-End BED):

chr1  start1  end1  chr2  start2  end2  name  score  strand1  strand2

Each row represents a contact between two genomic anchor regions.

Literature Support

Loop Catalog (Reyna et al. 2025, Nucleic Acids Research): Created a union catalog of 4.19M unique loops across 1,089 Hi-C datasets. Demonstrated that union approach captures tissue-specific and constitutive loops. Used resolution-aware merging at 5kb, 10kb, and 25kb bins.
AQuA Tools (Chakraborty et al. 2025): Toolkit for BEDPE intersection, union, and annotation. Handles paired-region arithmetic.
mariner (Flores et al. 2024, Bioinformatics): R/Bioconductor package for BEDPE manipulation including merging loops across experiments with configurable anchor tolerance.
ENCODE Phase 3 (Gorkin et al. 2020, Nature, 301 citations): Integrated Hi-C data across tissues to define regulatory loops connecting enhancers to promoters.
ENCODE Blacklist (Amemiya et al. 2019, Scientific Reports, 1,372 citations): Problematic genomic regions to filter from loop anchors. DOI
Mustache (Roayaei Ardakany et al. 2020, Genome Biology, 165 citations): Multi-scale loop caller that recovers more validated loops than HICCUPS. Different callers produce discordant loop sets.
Wolff et al. 2022 (GigaScience): Benchmark showing loop callers intersect by ~50% at most — critical context for why union approach is necessary.

Step 1: Find All Available Hi-C Data

encode_search_experiments(
    assay_title="Hi-C",
    organ="pancreas",           # user's tissue of interest
    biosample_type="tissue",
    limit=100
)

Present a summary to the user:

Total Hi-C experiments
Labs represented
Unique donors/biosamples
Resolution(s) available (check experiment metadata)

Use

encode_get_facets

to check availability:

encode_get_facets(assay_title="Hi-C", organ="pancreas")

Note: Hi-C data is computationally expensive to produce, so there are typically fewer experiments per tissue than ChIP-seq or ATAC-seq. Even 2-3 experiments can be valuable for union catalogs.

Step 2: Quality-Gate Each Experiment

encode_get_experiment(accession="ENCSR...")

Hi-C Quality Checks

Audit status: no ERROR flags
Sequencing depth: 400M+ valid read pairs for loop calling (ENCODE standard)
Cis/trans ratio: >60% cis contacts expected (low cis suggests noisy library)
Hi-C-specific QC: Library complexity, PCR duplicate rate
Has loop calls (BEDPE output) — not all Hi-C experiments have called loops
Resolution: at least 5-10kb resolution for loop detection

Include if:

Has BEDPE loop calls at consistent resolution
Passes ENCODE audit (no ERROR flags)
Adequate sequencing depth for loop resolution

Exclude if:

ERROR audit flags
Only contact matrices without loop calls
Very low sequencing depth (<200M valid pairs — insufficient for loop calling)

Track all included experiments:

encode_track_experiment(accession="ENCSR...")

Step 3: Download Loop Call Files

For each experiment, get BEDPE loop calls:

# Search for loop/interaction files
encode_list_files(
    experiment_accession="ENCSR...",
    file_format="bedpe",
    assembly="GRCh38"
)

# Also check for BED-formatted loop files
encode_list_files(
    experiment_accession="ENCSR...",
    output_type="chromatin interactions",
    assembly="GRCh38"
)

# Or contact domains
encode_list_files(
    experiment_accession="ENCSR...",
    output_type="contact domains",
    assembly="GRCh38"
)

File selection priority:

Chromatin interactions (loop calls from HICCUPS or similar)
Contact domains (TADs — different analysis, handle separately)
Replicated loops (if available)

Prefer

preferred_default=True

files when available.

encode_download_files(
    file_accessions=["ENCFF...", ...],
    download_dir="/path/to/data/hic_loops",
    organize_by="flat"
)

Step 4: Understanding Hi-C Resolution and Anchors

Critical: Resolution-Aware Processing

Hi-C loop anchors are binned regions, not precise positions. The resolution determines anchor size:

Resolution	Anchor Width	Best For	Typical Loop Count
5 kb	5,000 bp	Fine-scale promoter-enhancer loops	More loops
10 kb	10,000 bp	Standard analysis	Moderate
25 kb	25,000 bp	Large-scale domain contacts	Fewer loops

All loops being merged must be at the same resolution, or anchors must be harmonized to a common resolution.

Harmonizing Resolution

If experiments have loops called at different resolutions:

# Expand 5kb anchors to 10kb resolution
awk -v res=10000 'BEGIN{OFS="\t"} {
    # Bin anchor 1
    bin1_start = int($2/res) * res
    bin1_end = bin1_start + res
    # Bin anchor 2
    bin2_start = int($5/res) * res
    bin2_end = bin2_start + res
    print $1, bin1_start, bin1_end, $4, bin2_start, bin2_end, $7, $8, $9, $10
}' fine_res_loops.bedpe > harmonized_loops.bedpe

Step 5: Per-Sample Filtering

5a. ENCODE Blocklist Filtering (Amemiya et al. 2019)

Remove loops with anchors in artifact-prone regions (download from https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz):

# Filter loops where EITHER anchor overlaps a blocklist region
# First, extract anchor 1 and anchor 2 as separate BED files
awk 'BEGIN{OFS="\t"} {print $1,$2,$3,NR}' sample.bedpe > anchors1.bed
awk 'BEGIN{OFS="\t"} {print $4,$5,$6,NR}' sample.bedpe > anchors2.bed

# Find anchor rows NOT in blocklist
bedtools intersect -a anchors1.bed -b ENCODE_blocklist.bed -v | cut -f4 > clean_rows_1.txt
bedtools intersect -a anchors2.bed -b ENCODE_blocklist.bed -v | cut -f4 > clean_rows_2.txt

# Keep only rows where BOTH anchors pass
comm -12 <(sort clean_rows_1.txt) <(sort clean_rows_2.txt) > clean_rows.txt
awk 'NR==FNR{a[$1];next} FNR in a' clean_rows.txt sample.bedpe > sample.filtered.bedpe

5b. Score Filtering

Filter by interaction score/significance:

# If BEDPE has a score column (col 8), filter to significant interactions
# Keep top 75% by score (true distribution quantile, not range-based)
TOTAL=$(wc -l < sample.filtered.bedpe)
LINE_25=$(echo "$TOTAL" | awk '{printf "%d", $1 * 0.25}')
THRESHOLD=$(sort -k8,8n sample.filtered.bedpe | awk -v line="$LINE_25" 'NR==line{print $8}')
awk -v t="$THRESHOLD" '$8 >= t' sample.filtered.bedpe > sample.qfiltered.bedpe

5c. Remove Self-Ligation Artifacts

Loops where both anchors are very close are likely artifacts:

# Remove loops where anchors are on same chromosome and < 20kb apart
awk '{
    if ($1 != $4) print $0;  # inter-chromosomal: keep (rare but real)
    else if (($5 - $3) >= 20000) print $0;  # > 20kb apart: keep
}' sample.qfiltered.bedpe > sample.clean.bedpe

Step 6: Union Merge of Loops

The Paired-Region Matching Problem

Unlike peaks (single regions), loops are pairs of regions. Two loops match if both anchors overlap:

Loop 1:  [anchor1A]--------[anchor1B]
Loop 2:    [anchor2A]------[anchor2B]

These should merge if anchor1A overlaps anchor2A AND anchor1B overlaps anchor2B.

Method A: bedtools pairToPair (Recommended for simple union)

# Concatenate all filtered loops
cat sample1.clean.bedpe sample2.clean.bedpe ... > all_loops.bedpe

# Sort by anchor 1 coordinates
sort -k1,1 -k2,2n -k4,4 -k5,5n all_loops.bedpe > all_loops.sorted.bedpe

# Use a custom merge approach:
# 1. Bin anchors to resolution, creating a loop ID
# 2. Group by loop ID
# 3. Count support

awk -v res=10000 'BEGIN{OFS="\t"} {
    # Create binned anchor coordinates as loop identifier
    a1_bin = $1 ":" int($2/res)*res
    a2_bin = $4 ":" int($5/res)*res
    # Canonical order (smaller coordinate first) to handle orientation
    if (a1_bin < a2_bin) loop_id = a1_bin "-" a2_bin
    else loop_id = a2_bin "-" a1_bin
    print loop_id, $0
}' all_loops.sorted.bedpe | \
sort -k1,1 | \
awk 'BEGIN{OFS="\t"} {
    if ($1 != prev_id) {
        if (NR > 1) print chr1, start1, end1, chr2, start2, end2, count, max_score
        prev_id = $1
        chr1=$2; start1=$3; end1=$4; chr2=$5; start2=$6; end2=$7
        count = 1; max_score = $9
    } else {
        count++
        if ($9 > max_score) max_score = $9
        # Expand anchors to encompass all overlapping calls
        if ($3 < start1) start1 = $3
        if ($4 > end1) end1 = $4
        if ($6 < start2) start2 = $6
        if ($7 > end2) end2 = $7
    }
} END {
    print chr1, start1, end1, chr2, start2, end2, count, max_score
}' > union_loops.bedpe

Method B: Resolution-Binned Approach (Loop Catalog method)

Following the Loop Catalog (Reyna et al. 2025) approach:

# Bin all loop anchors to a fixed resolution
awk -v res=10000 'BEGIN{OFS="\t"} {
    a1_start = int($2/res) * res
    a1_end = a1_start + res
    a2_start = int($5/res) * res
    a2_end = a2_start + res
    # Canonical ordering
    if ($1 < $4 || ($1 == $4 && a1_start <= a2_start))
        print $1, a1_start, a1_end, $4, a2_start, a2_end
    else
        print $4, a2_start, a2_end, $1, a1_start, a1_end
}' all_loops.sorted.bedpe | \
sort -u | \
sort -k1,1 -k2,2n -k4,4 -k5,5n | \
uniq -c | \
awk 'BEGIN{OFS="\t"} {print $2,$3,$4,$5,$6,$7,$1}' > union_loops_binned.bedpe
# Columns: chr1, start1, end1, chr2, start2, end2, n_supporting_samples

Method C: Using Specialized Tools

mariner (R/Bioconductor):

library(mariner)
# Read BEDPE files as GInteractions
loops <- lapply(bedpe_files, read.table)
# Convert to GInteractions and merge
gi <- as_ginteractions(loops)
merged <- mergePairs(gi, radius = 10000)  # 10kb tolerance

AQuA Tools (Python):

# BEDPE union with anchor overlap tolerance
aqua bedpe-union -i sample1.bedpe sample2.bedpe -o union.bedpe --slop 5000

Step 7: Confidence Annotation

Given N total experiments:

Confidence	Criteria	Interpretation
High	Detected in >=50% of samples	Constitutive loop, present across individuals
Supported	Detected in 2+ samples	Likely real, some variation
Singleton	Detected in 1 sample only	May be individual-specific or depth-dependent

awk -v N=4 '{
    if ($7 >= N*0.5) conf="HIGH";
    else if ($7 >= 2) conf="SUPPORTED";
    else conf="SINGLETON";
    print $0"\t"conf"\t"$7"/"N
}' union_loops_binned.bedpe > union_loops.annotated.bedpe

Context for singletons: Hi-C loop detection is very sensitive to sequencing depth. Many singletons may simply be under-powered in other samples rather than biologically absent. The Loop Catalog found that a union approach captures ~3x more loops than any individual experiment.

Step 8: Separate Analysis for TADs and Compartments

TAD boundaries and A/B compartments require different aggregation than loops:

TAD Boundaries

TAD boundaries are single genomic positions. Aggregate like narrow peaks:

# Extract TAD boundary BED from contact domain files
# Each boundary is a narrow region
cat tad_boundaries_sample*.bed | \
bedtools sort -i - | \
bedtools merge -i - -d 40000 -c 1 -o count > union_tad_boundaries.bed
# 40kb gap tolerance because TAD boundaries are resolution-dependent

A/B Compartments

Compartment calls (eigenvector sign at each bin) should be aggregated by majority vote:

# For each resolution bin, assign A or B based on majority of samples
# This is more complex and typically done in R/Python

Step 9: Log Provenance

encode_log_derived_file(
    file_path="/path/to/union_loops.annotated.bedpe",
    source_accessions=["ENCSR...", "ENCSR...", ...],
    description="Union chromatin loops across N pancreas Hi-C experiments",
    file_type="aggregated_loops",
    tool_used="bedtools + custom merge at 10kb resolution",
    parameters="blocklist filtered, score >= 25th pctl, self-ligation >= 20kb removed, 10kb resolution binning"
)

Step 10: Summary Statistics

Report to the user:

Total input experiments: N
Experiments passing QC: M
Resolution used: Xkb
Total loops before merge: X
Union loops after merge: Y
High-confidence loops: Z (≥50% support)
Supported loops: W (2+ support)
Singleton loops: V (1 sample only)
Distance distribution: median and range of loop sizes (anchor-to-anchor)
Inter-chromosomal loops: count (expect very few)

Pitfalls Specific to Hi-C Data

Resolution mismatch: Loop calls at 5kb vs 25kb resolution will have very different anchor sizes. Always harmonize to a common resolution before merging.
Sequencing depth sensitivity: Loop calling requires deep sequencing (400M+ valid pairs). Shallowly sequenced experiments will call far fewer loops — this is under-detection, not absence.
Algorithm differences are LARGE: Wolff et al. 2022 (GigaScience) found that HICCUPS, Mustache, Fit-Hi-C, and HiCExplorer loop callers intersect by ~50% at most. Mustache tends to recover more validated loops (Roayaei Ardakany et al. 2020). If mixing callers, note this in provenance — and this discordance is itself a reason to prefer the union approach.
Orientation matters: BEDPE anchors should be canonically ordered (anchor1 < anchor2 by genomic coordinate) before merging to avoid duplicate counting.
Inter-chromosomal contacts: These are rare but real. Handle separately — they cannot be distance-filtered.
Distance distribution: Most loops are 100kb-2Mb. Very short-range contacts (<20kb) are often noise from undigested chromatin. Very long-range (>10Mb) are rare.
Do NOT mix assemblies: All files must be GRCh38 or all hg19. Hi-C resolution binning makes liftOver of loops particularly error-prone.
TADs vs loops: These are different features. TADs are domains (regions), loops are point contacts (pairs). Do not mix them in the same union.
Micro-C as complement: Micro-C achieves higher resolution than Hi-C and can detect sub-TAD loops. Treat Micro-C loops as compatible with Hi-C loops in a union (Mustache works on both).

Walkthrough: Building a Cross-Tissue Loop Catalog for the MYC Locus

Goal: Aggregate Hi-C chromatin loops across tissues to identify conserved and tissue-specific 3D contacts at the MYC gene locus. Context: Cancer research — MYC is regulated by distal enhancers via chromatin looping.

Step 1: Find Hi-C experiments across tissues

encode_search_experiments(assay_title="Hi-C", organism="Homo sapiens", limit=50)

Expected output:

{
  "total": 89,
  "results": [
    {"accession": "ENCSR000AKA", "assay_title": "Hi-C", "biosample_summary": "GM12878", "status": "released"},
    {"accession": "ENCSR489OCU", "assay_title": "Hi-C", "biosample_summary": "K562", "status": "released"},
    {"accession": "ENCSR382RFU", "assay_title": "Hi-C", "biosample_summary": "liver", "status": "released"}
  ]
}

Interpretation: 89 Hi-C experiments available. Select 5–10 spanning diverse tissue types for cross-tissue comparison.

Step 2: List loop files for each experiment

encode_list_files(accession="ENCSR000AKA", file_format="bedpe", assembly="GRCh38")

Expected output:

{
  "files": [
    {"accession": "ENCFF001ABC", "output_type": "contact domains", "file_format": "bedpe", "file_size_mb": 2.4},
    {"accession": "ENCFF002DEF", "output_type": "chromatin interactions", "file_format": "bedpe", "file_size_mb": 1.8}
  ]
}

Interpretation: Use "chromatin interactions" files for loop aggregation. Contact domains are TADs, not loops.

Step 3: Download loop files

encode_download_files(accessions=["ENCFF002DEF", "ENCFF003GHI", "ENCFF004JKL"], download_dir="/data/hic_loops")

Expected output:

{
  "downloaded": 3,
  "total_size_mb": 5.6,
  "md5_verified": true,
  "files": ["/data/hic_loops/ENCFF002DEF.bedpe", "/data/hic_loops/ENCFF003GHI.bedpe", "/data/hic_loops/ENCFF004JKL.bedpe"]
}

Step 4: Aggregate loops with resolution-aware anchor matching

Apply union merge across tissues:

Expand loop anchors by ±resolution (e.g., ±5kb for 5kb resolution data)
Merge overlapping anchors using bedtools pairToPair
Assign tissue support counts to each union loop
Filter: require ≥2 tissue support for conserved loops

Step 5: Filter to MYC locus

# MYC locus: chr8:127,700,000-128,000,000
awk '$1=="chr8" && $2>=127700000 && $3<=128000000' union_loops.bedpe > myc_loops.bedpe

Interpretation: Loops anchored at the MYC promoter connecting to distal enhancers. Conserved loops (≥3 tissues) likely represent fundamental regulatory architecture; tissue-specific loops may drive context-dependent MYC activation.

Integration with downstream skills

Feed loop anchors into → peak-annotation for gene assignment at anchor regions
Overlay with → histone-aggregation H3K27ac peaks to identify active enhancer-promoter loops
Cross-reference loop-disrupting variants via → variant-annotation
Visualize in → ucsc-browser as interaction tracks

Code Examples

1. Survey available Hi-C data by tissue

encode_get_facets(assay_title="Hi-C", facet_field="organ", organism="Homo sapiens")

Expected output:

{
  "facets": {
    "organ": {"brain": 24, "heart": 12, "liver": 8, "lung": 6, "kidney": 4, "blood": 15}
  }
}

2. Get details for a specific Hi-C experiment

encode_get_experiment(accession="ENCSR000AKA")

Expected output:

{
  "accession": "ENCSR000AKA",
  "assay_title": "Hi-C",
  "biosample_summary": "GM12878",
  "replicates": 2,
  "status": "released",
  "lab": "/labs/erez-lieberman-aiden/",
  "audit": {"WARNING": 1, "ERROR": 0}
}

3. Compare loop sets between two cell types

encode_compare_experiments(accession_1="ENCSR000AKA", accession_2="ENCSR489OCU")

Expected output:

{
  "comparison": {
    "shared": {"assay": "Hi-C", "organism": "Homo sapiens", "assembly": "GRCh38"},
    "differences": {
      "biosample": ["GM12878", "K562"],
      "lab": ["/labs/erez-lieberman-aiden/", "/labs/erez-lieberman-aiden/"]
    }
  }
}

Integration

This skill produces...	Feed into...	Purpose
Union loop catalog (BEDPE)	peak-annotation	Assign genes to loop anchors
Conserved loop coordinates	histone-aggregation	Overlay H3K27ac at anchors to find active enhancer-promoter loops
Tissue-specific loops	accessibility-aggregation	Check if loop anchors overlap open chromatin
Loop anchor BED intervals	variant-annotation	Find GWAS/clinical variants disrupting loop anchors
Loop anchor coordinates	liftover-coordinates	Convert hg19 loops to GRCh38
Aggregated loop statistics	visualization-workflow	Generate loop frequency heatmaps
Loop-gene assignments	disease-research	Connect loop disruptions to disease phenotypes

Related Skills

histone-aggregation: Loop anchors often overlap with H3K27ac/H3K4me1 peaks — integrate with histone union sets to annotate loop function
accessibility-aggregation: Loop anchors frequently coincide with accessible chromatin — validate loops by requiring anchor accessibility
regulatory-elements: Use loops to connect distal enhancers (H3K27ac) to target promoters (H3K4me3)
epigenome-profiling: Loops add 3D context to 1D chromatin state maps
pipeline-hic: Process raw Hi-C data through the full ENCODE-aligned pipeline
batch-analysis: Batch processing workflows for systematic Hi-C loop aggregation
publication-trust: Verify literature claims backing analytical decisions

Presenting Results

Present aggregated loops as: chr | anchor1_start | anchor1_end | anchor2_start | anchor2_end | sample_count | resolution. Show loop statistics. Suggest: "Would you like to check if any GWAS variants overlap loop anchors?"

Encode-toolkit hic-aggregation

Aggregate Hi-C Chromatin Contacts Across Studies

When to Use

Scientific Rationale

Key Concepts

Literature Support

Step 1: Find All Available Hi-C Data

Step 2: Quality-Gate Each Experiment

Hi-C Quality Checks

Include if:

Exclude if:

Step 3: Download Loop Call Files

Step 4: Understanding Hi-C Resolution and Anchors

Critical: Resolution-Aware Processing

Harmonizing Resolution

Step 5: Per-Sample Filtering

5a. ENCODE Blocklist Filtering (Amemiya et al. 2019)

5b. Score Filtering

5c. Remove Self-Ligation Artifacts

Step 6: Union Merge of Loops

The Paired-Region Matching Problem

Method A: bedtools pairToPair (Recommended for simple union)

Method B: Resolution-Binned Approach (Loop Catalog method)

Method C: Using Specialized Tools

Step 7: Confidence Annotation

Step 8: Separate Analysis for TADs and Compartments

TAD Boundaries

A/B Compartments

Step 9: Log Provenance

Step 10: Summary Statistics

Pitfalls Specific to Hi-C Data

Walkthrough: Building a Cross-Tissue Loop Catalog for the MYC Locus

Step 1: Find Hi-C experiments across tissues

Step 2: List loop files for each experiment

Step 3: Download loop files

Step 4: Aggregate loops with resolution-aware anchor matching

Step 5: Filter to MYC locus

Integration with downstream skills

Code Examples

1. Survey available Hi-C data by tissue

2. Get details for a specific Hi-C experiment

3. Compare loop sets between two cell types

Integration

Related Skills

Presenting Results

For the request: "$ARGUMENTS"