Encode-toolkit pipeline-hic
Execute ENCODE Hi-C pipeline from FASTQ to contact matrices and loop calls. Child of pipeline-guide. Provides Nextflow execution with Docker and cloud deployment. Use when processing Hi-C data, generating contact matrices, calling loops or TADs. Trigger on: Hi-C pipeline, chromatin conformation, contact matrix, loop calling, TAD detection, Juicer, HiCCUPS, 3D genome.
git clone https://github.com/ammawla/encode-toolkit
T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugin/skills/pipeline-hic" ~/.claude/skills/ammawla-encode-toolkit-pipeline-hic && rm -rf "$T"
plugin/skills/pipeline-hic/SKILL.mdENCODE Hi-C Pipeline: FASTQ to Contact Matrices and Loops
When to Use
- User wants to run a Hi-C processing pipeline from FASTQ to contact matrices and loop calls
- User asks about "Hi-C pipeline", "contact matrix", "loop calling", "Juicer", "HiCCUPS", or "TAD detection"
- User needs to process Hi-C data for 3D genome structure analysis
- Example queries: "process my Hi-C FASTQs", "generate contact matrices from Hi-C", "call chromatin loops with HiCCUPS"
Execute the ENCODE Hi-C pipeline for chromatin conformation capture data, producing multi-resolution contact matrices, loop calls, and compartment annotations.
Pipeline Overview
FASTQ -> Trim -> BWA (per-mate) -> pairtools parse -> dedup -> .pairs | +------------+------------+ | | Juicer pre -> .hic cooler -> .mcool | | HiCCUPS loops Compartments
ENCODE Repository
- GitHub:
ENCODE-DCC/hic-pipeline - Container:
encodedcc/hic-pipeline - WDL: Available for Cromwell execution
- This skill: Nextflow DSL2 reimplementation for portability
Core Tools and Versions
| Tool | Version | Purpose | Citation |
|---|---|---|---|
| BWA-MEM | 0.7.17 | Alignment (per-mate) | Li & Durbin 2009 |
| pairtools | 1.0.3 | Pair classification, dedup | Open2C |
| Juicer tools | 2.20.00 | .hic generation, HiCCUPS | Durand et al. 2016 |
| cooler | 0.9.3 | .cool/.mcool generation | Abdennur & Mirny 2020 |
| samtools | 1.19 | BAM operations | Li et al. 2009 |
| FastQC | 0.12.1 | Read quality | Andrews (Babraham) |
| MultiQC | 1.21 | Aggregated QC | Ewels et al. 2016 |
Key Literature
-
Rao et al. 2014 - "A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping" (Cell, ~5,000 citations) DOI: 10.1016/j.cell.2014.11.021
-
Lieberman-Aiden et al. 2009 - "Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome" (Science, ~6,000 citations) DOI: 10.1126/science.1181369
-
Durand et al. 2016 - "Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments" (Cell Systems, ~2,000 citations) DOI: 10.1016/j.cels.2016.07.002
-
Abdennur & Mirny 2020 - "Cooler: scalable storage for Hi-C data and other genomically labeled arrays" (Bioinformatics) DOI: 10.1093/bioinformatics/btz540
-
Amemiya et al. 2019 - "The ENCODE Blacklist" (Scientific Reports, ~1,372 citations) DOI: 10.1038/s41598-019-45839-z
Execution
Quick Start (Local)
nextflow run main.nf \ -profile local \ --reads '/data/fastq/*_R{1,2}.fastq.gz' \ --bwa_index '/ref/bwa_index/genome.fa' \ --chrom_sizes '/ref/hg38.chrom.sizes' \ --restriction_site 'GATC' \ --outdir results/ \ -resume
SLURM HPC
nextflow run main.nf \ -profile slurm \ --reads '/data/fastq/*_R{1,2}.fastq.gz' \ --bwa_index '/ref/bwa_index/genome.fa' \ --chrom_sizes '/ref/hg38.chrom.sizes' \ --restriction_site 'GATC' \ --outdir results/ \ -resume
Cloud (GCP / AWS)
nextflow run main.nf \ -profile gcp \ --reads 'gs://bucket/fastq/*_R{1,2}.fastq.gz' \ --bwa_index 'gs://bucket/ref/genome.fa' \ --chrom_sizes 'gs://bucket/ref/hg38.chrom.sizes' \ --restriction_site 'GATC' \ --outdir 'gs://bucket/results/' \ -resume
Resource Requirements
| Step | CPUs | RAM | Time (2B contacts) |
|---|---|---|---|
| BWA alignment | 8 | 16 GB | 4-6 hours |
| pairtools parse | 4 | 8 GB | 2-3 hours |
| pairtools dedup | 4 | 16 GB | 1-2 hours |
| Juicer pre + hic | 4 | 64 GB | 2-4 hours |
| HiCCUPS | 4 | 16 GB (+ GPU optional) | 1-2 hours |
| Total | 8 | 64 GB | 8-16 hours |
Pipeline Parameters
| Parameter | Default | Description |
|---|---|---|
| required | Glob pattern to paired FASTQ files |
| required | Path to BWA genome index (.fa with .bwt etc.) |
| required | Chromosome sizes file |
| | Restriction enzyme site (GATC for MboI/DpnII) |
| | Output directory |
| | Matrix resolutions |
| | Minimum MAPQ for pair filtering |
| | Genome assembly name for .hic header |
Output Files
results/ fastqc/ # Raw read quality alignment/ {sample}.R1.bam # Per-mate alignments {sample}.R2.bam pairs/ {sample}.pairs.gz # Classified, deduplicated pairs {sample}.dedup_stats.txt # Duplication metrics {sample}.pair_stats.txt # Pair type classification matrices/ {sample}.hic # Juicer .hic file (primary output) {sample}.mcool # Cooler multi-resolution matrix loops/ {sample}.hiccups_loops.bedpe # Called loops (HiCCUPS) qc/ {sample}.contact_stats.txt # Contact statistics multiqc/ multiqc_report.html
.hic File Format
The .hic format (Juicer) stores multi-resolution contact matrices with normalization vectors. Can be visualized in Juicebox and loaded by
hic-straw in Python/R.
.mcool File Format
The .mcool format (cooler) is an HDF5-based multi-resolution contact matrix. Widely supported by
cooler, cooltools, HiGlass, and FAN-C.
QC Thresholds (ENCODE Standards)
| Metric | Pass | Warning | Fail |
|---|---|---|---|
| Valid pair fraction | >40% | 25-40% | <25% |
| Cis contacts (>20kb) | >40% | 25-40% | <25% |
| Cis/trans ratio | >1.5 | 1.0-1.5 | <1.0 |
| Library complexity (unique/total) | >0.7 | 0.5-0.7 | <0.5 |
| Contacts per resolution | See below | - | - |
Resolution vs Depth Requirements
| Resolution | Minimum Contacts Needed | Typical Depth |
|---|---|---|
| 1 kb | >2 billion | Very deep |
| 5 kb | >500 million | Deep |
| 10 kb | >200 million | Standard |
| 25 kb | >50 million | Moderate |
| 100 kb | >10 million | Low |
Pair Classification
pairtools classifies read pairs into categories:
| Category | Description | Use |
|---|---|---|
| UU | Both uniquely mapped | Valid contact |
| UR/RU | One unique, one rescued | Valid (rescued) |
| UX/XU | One unique, one unmapped | Not used |
| DD | Both duplicate | Removed |
| WW | Walk pair (same strand) | Indicates ligation artifact |
| NR | Null/rescue pair | Not used |
Only UU pairs (and optionally UR) are used for contact matrices.
Critical Pitfalls
Restriction Enzyme Choice
The restriction enzyme determines fragment size and resolution:
- MboI/DpnII (GATC): 4-cutter, ~256 bp average fragment -- higher resolution
- HindIII (AAGCTT): 6-cutter, ~4 kb average fragment -- lower resolution
- Arima (proprietary): Two enzymes, ~160 bp average -- highest resolution
- Always verify which enzyme was used before processing
Normalization Method
Different normalization methods yield different results:
- KR (Knight-Ruiz): Default in Juicer, balanced normalization
- ICE (Imakaev et al.): Used by cooler/cooltools, iterative correction
- VC (Vanilla Coverage): Simple coverage normalization
- ENCODE standard: KR normalization. Always document which was used.
Resolution Depends on Depth
Do not call features at resolutions unsupported by sequencing depth:
- Calling 1 kb loops from 100M contacts will produce noise
- Check the Juicer resolution QC to determine achievable resolution
- Loop calling (HiCCUPS) typically requires 5-10 kb resolution
Ligation Artifacts
Same-strand pairs (WW) indicate self-ligation or undigested fragments:
- High WW fraction (>30%) suggests poor digestion
- Monitor pair type distribution as a QC metric
- Re-ligation distance plots should show enrichment at restriction sites
Provenance Integration
After pipeline completion, log all outputs:
encode_log_derived_file( file_path="/results/matrices/sample1.hic", source_accessions=["ENCSR...", "ENCFF..."], description="Hi-C contact matrix from ENCODE Hi-C pipeline", file_type="hic", tool_used="BWA 0.7.17 + pairtools 1.0.3 + Juicer 2.20.00", parameters="MboI digestion, KR normalization, resolutions 1kb-1Mb" )
Reference Files
Detailed step-by-step documentation is provided in the
references/ directory:
-- Read QC and adapter trimming01-qc-trimming.md
-- BWA per-mate alignment strategy02-alignment.md
-- pairtools parse, sort, and dedup03-pair-processing.md
-- Juicer .hic and cooler .mcool generation04-matrix-generation.md
-- HiCCUPS loop detection and QC05-loop-calling.md
Walkthrough: Processing ENCODE Hi-C from FASTQ to Contact Maps and Loops
Goal: Process raw Hi-C FASTQ files through the ENCODE pipeline to generate contact matrices, TAD calls, and chromatin loop predictions. Context: Hi-C captures 3D chromatin organization. The pipeline uses BWA for chimeric read alignment, pairtools for pair processing, and Juicer/HiCCUPS for loop calling.
Step 1: Find Hi-C experiment
encode_get_experiment(accession="ENCSR000AKA")
Expected output:
{ "accession": "ENCSR000AKA", "assay_title": "Hi-C", "biosample_summary": "GM12878", "replicates": 2, "status": "released" }
Step 2: List FASTQ files
encode_list_files(accession="ENCSR000AKA", file_format="fastq")
Expected output:
{ "files": [ {"accession": "ENCFF500HI1", "output_type": "reads", "paired_end": "1", "file_size_mb": 35000}, {"accession": "ENCFF501HI2", "output_type": "reads", "paired_end": "2", "file_size_mb": 36000} ] }
Interpretation: Hi-C paired-end reads represent chimeric ligation junctions. Each read pair captures a 3D contact.
Step 3: Run the Hi-C pipeline
nextflow run pipeline-hic/main.nf \ --fastq_r1 ENCFF500HI1.fastq.gz \ --fastq_r2 ENCFF501HI2.fastq.gz \ --genome GRCh38 \ --restriction_enzyme DpnII \ --resolution 5000,10000,25000 \ -profile docker
Key pipeline steps:
- BWA-MEM alignment (chimeric read handling)
- pairtools parse (extract valid pairs)
- pairtools dedup (remove PCR duplicates)
- Contact matrix generation (.hic format)
- TAD calling (directionality index or insulation score)
- Loop calling (HiCCUPS at 5kb/10kb/25kb resolution)
Step 4: Validate output quality
| Metric | Threshold | Purpose |
|---|---|---|
| Cis/trans ratio | > 60% cis | Library quality |
| Long-range cis | > 40% of cis | Useful contacts |
| Valid pairs | > 50% of mapped | Ligation success |
| Duplicate rate | < 30% | Library complexity |
Step 5: Identify significant loops
Download loop calls for downstream analysis:
encode_list_files(accession="ENCSR000AKA", file_format="bedpe", assembly="GRCh38")
Integration with downstream skills
- Loop calls (BEDPE) feed into -> hic-aggregation for cross-tissue loop catalog
- TAD boundaries feed into -> regulatory-elements for domain-level regulation
- Loop anchors feed into -> peak-annotation for enhancer-promoter assignment
- Contact data integrates with -> visualization-workflow for 3D genome display
- Pipeline provenance logged by -> data-provenance
Code Examples
1. Find Hi-C data for 3D genome analysis
encode_search_experiments( assay_title="Hi-C", organ="heart" )
Expected output:
{ "total": 4, "experiments": [ { "accession": "ENCSR654HRT", "assay_title": "Hi-C", "biosample_summary": "heart left ventricle tissue male adult (51 years)", "status": "released" } ] }
2. Get experiment details for pipeline configuration
encode_get_experiment(accession="ENCSR654HRT")
Expected output:
{ "accession": "ENCSR654HRT", "assay_title": "Hi-C", "replicates": 2, "biosample_summary": "heart left ventricle tissue male adult (51 years)", "files_count": 18, "assembly": "GRCh38", "audit": {"ERROR": 0, "WARNING": 1} }
Integration
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Chromatin loops (BEDPE) | hic-aggregation | Cross-tissue loop catalog |
| TAD boundaries | regulatory-elements | Domain-level regulatory architecture |
| Loop anchors (BED) | peak-annotation | Assign genes to loop-connected enhancers |
| Contact matrices (.hic) | visualization-workflow | 3D genome visualization |
| Loop-disrupting coordinates | variant-annotation | Identify variants breaking chromatin contacts |
| QC metrics | quality-assessment | Validate Hi-C library quality |
| Pipeline parameters | data-provenance | Record BWA/pairtools/Juicer versions |
| Loop anchor regions | motif-analysis | Discover CTCF motifs at loop anchors |
Related Skills
-- Parent skill with compute resource assessment and cloud setuppipeline-guide
-- Aggregate Hi-C loops across samples/tissueshic-aggregation
-- Evaluate pipeline output quality metricsquality-assessment
-- Track all pipeline inputs, outputs, and parametersdata-provenance
-- Download ENCODE Hi-C FASTQ files for pipeline inputdownload-encode
-- Verify literature claims backing analytical decisionspublication-trust
Presenting Results
When reporting Hi-C pipeline results:
- Valid pair count: Report total valid pairs (UU + optionally UR) and the fraction of all read pairs that are valid contacts
- Cis/trans ratio: Report the cis/trans contact ratio (>1.5 pass) and long-range cis fraction (>20kb, >40% expected). These are the primary Hi-C quality indicators
- Contact matrix resolution: Report the achievable resolution based on sequencing depth (e.g., "500M valid pairs supports 5kb resolution") and list all resolutions generated
- Loop counts: Report HiCCUPS loop count at each resolution tested and note the resolution with the most loops called
- Matrix paths: Provide paths to the .hic file (Juicebox-compatible) and .mcool file (cooler/HiGlass-compatible)
- Key QC metrics: Present library complexity (unique/total >0.7), WW pair fraction (<30%), and pair classification distribution in a summary table
- Normalization: Note which normalization was applied (KR default for Juicer)
- Next steps: Suggest
for cross-sample loop catalogs, orhic-aggregation
for Juicebox/HiGlass session setupvisualization-workflow