Encode-toolkit download-encode

Download ENCODE genomics files (BED, FASTQ, BAM, bigWig, etc.) to the user's machine. Use when the user wants to download data files from ENCODE experiments.

install
source · Clone the upstream repo
git clone https://github.com/ammawla/encode-toolkit
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugin/skills/download-encode" ~/.claude/skills/ammawla-encode-toolkit-download-encode && rm -rf "$T"
manifest: plugin/skills/download-encode/SKILL.md
source content

Download ENCODE Files

When to Use

  • User wants to download ENCODE data files to their local machine
  • User asks to "download", "get", or "fetch" ENCODE files
  • User needs specific file formats (BED, FASTQ, BAM, bigWig) from experiments
  • User wants to batch download files matching search criteria
  • User needs to verify file integrity after download (MD5 checksums)
  • User asks about organizing downloaded files by experiment or format

Help the user download ENCODE data files to their local machine.

Download Strategy

  1. Specific files by accession: Use

    encode_download_files
    with file accession IDs (e.g., "ENCFF635JIA").

  2. Batch download by criteria: Use

    encode_batch_download
    to search and download in one step.

    • Always start with
      dry_run=True
      (default) to preview what will be downloaded
    • Show the user the file count, total size, and file list
    • Only proceed with
      dry_run=False
      after user confirms
  3. Download organization options:

    • "flat"
      : All files in one directory
    • "experiment"
      : Organized by experiment accession (recommended)
    • "format"
      : Organized by file format
    • "experiment_format"
      : Organized by experiment, then format

Important Notes

  • All downloads include MD5 verification by default (
    verify_md5=True
    )
  • Ask the user for a download directory if not specified
  • Warn about large downloads (>1GB total or >50 files)
  • Files already downloaded will be skipped (idempotent)
  • For restricted files, credentials must be configured first via
    encode_manage_credentials

Pitfalls & Edge Cases

  1. Disk space: BAM files can be 5-50GB each; FASTQ files 1-20GB. Before any batch download, warn the user about estimated total size from the dry_run preview. A single ChIP-seq experiment can produce 10-30GB of raw data files.
  2. MD5 verification failures: If MD5 verification fails, the file may be corrupted or incompletely downloaded. Always re-download rather than skipping verification. Never set
    verify_md5=False
    unless the user explicitly requests it and understands the risk.
  3. Downloading too much data: Users often request BAM files when they only need peak calls or signal tracks. Suggest
    preferred_default=True
    to get ENCODE's recommended files, or filter by
    output_type
    (e.g., "IDR thresholded peaks", "fold change over control") to avoid downloading raw data unnecessarily.
  4. Restricted/unreleased data: Files with status other than "released" may require ENCODE credentials. Use
    encode_manage_credentials(action="check")
    to verify credentials are configured before attempting to download restricted data.
  5. Mixed assemblies in batch download: Always specify the
    assembly
    filter (e.g., "GRCh38") in batch downloads. Without it, you may download files aligned to different genome assemblies (hg19, GRCh38, mm10), making downstream analysis impossible.
  6. Timeout on large files: For downloading many files or very large files,
    encode_batch_download
    handles retries and concurrent downloads better than individual
    encode_download_files
    calls. The default limit of 100 files provides a safety cap.

File Type Guide

When users request "files" without specifying a type, use this priority to suggest the right output_type:

  • Peak analysis:
    output_type="IDR thresholded peaks"
    (most stringent, recommended for ChIP-seq/ATAC-seq)
  • Signal visualization:
    file_format="bigWig"
    ,
    output_type="fold change over control"
    (for genome browser tracks)
  • Gene expression:
    output_type="gene quantifications"
    (for RNA-seq TPM/FPKM tables)
  • Raw data reprocessing:
    file_format="fastq"
    (only when user needs to run their own pipeline)
  • Quick defaults:
    preferred_default=True
    (ENCODE's recommended files for any experiment)

What to Download for Each Analysis

Analysis GoalFile FormatOutput TypeWhy This File
Peak locations (ChIP/ATAC)bed narrowPeakIDR thresholded peaksGold-standard replicated peaks passing irreproducibility threshold
Broad domain marks (H3K27me3)bed broadPeakreplicated peaksBroad marks need broadPeak format, not narrowPeak
Signal visualizationbigWigfold change over controlNormalized signal track for genome browser display
Signal statisticsbigWigsignal p-valueStatistical significance of signal over background
Raw data reprocessingfastqreadsStarting from scratch with your own pipeline
Alignment inspectionbamalignmentsCheck read mapping quality, fragment sizes, duplicates
Browser-compatible peaksbigBedpeaksUCSC/IGV-compatible binary peak format
Gene expression levelstsvgene quantificationsTPM/FPKM tables for RNA-seq differential expression
Transcript isoformstsvtranscript quantificationsIsoform-level expression for splicing analysis
3D genome contactshiccontact matrixHi-C interaction matrices for loop/TAD calling
Methylation levelsbedmethylation state at CpGPer-CpG methylation fractions for WGBS

Assay-Specific Recommendations

AssayPrimary DownloadSecondary Download
Histone ChIP-seqIDR thresholded peaks (bed)fold change over control (bigWig)
TF ChIP-seqIDR thresholded peaks (bed)fold change over control (bigWig)
ATAC-seqIDR thresholded peaks (bed)fold change over control (bigWig)
DNase-seqpeaks (bed)signal of unique reads (bigWig)
RNA-seqgene quantifications (tsv)signal of unique reads (bigWig)
WGBSmethylation state at CpG (bed)signal (bigWig)
Hi-Ccontact matrix (hic)contact domains (bed)
CUT&RUNpeaks (bed)fold change over control (bigWig)
CUT&Tagpeaks (bed)fold change over control (bigWig)
eCLIPpeaks (bed)signal of unique reads (bigWig)

File Selection Priority

When multiple files exist for the same experiment, choose files in this priority order:

  1. preferred_default=True: ENCODE curators mark recommended files. Always prefer these when available. Use

    encode_list_files(experiment_accession="ENCSR...", preferred_default=True)
    to find them.

  2. Peak file hierarchy (most to least stringent):

    • IDR thresholded peaks — replicated, irreproducibility-filtered (gold standard)
    • Optimal IDR thresholded peaks — union of replicate-level peaks
    • Conservative IDR thresholded peaks — intersection of replicate-level peaks
    • Pseudoreplicated peaks — peaks from pooled pseudoreplicates
    • Replicated peaks — peaks found in both replicates (broad marks)
  3. Signal track hierarchy:

    • fold change over control — normalized signal, best for comparing across experiments
    • signal p-value — statistical significance of enrichment
    • signal of unique reads — uniquely mapped read signal
    • signal of all reads — includes multi-mapped reads (noisier)
  4. Assembly preference:

    • GRCh38 for human (current standard) — always use this
    • hg19 for human (legacy) — only if collaborators require it
    • mm10 for mouse (current standard)
    • Never mix assemblies within an analysis
  5. Replicate preference:

    • Replicated files (combined replicates) over single-replicate files
    • Biological replicates over technical replicates
    • Isogenic replication over anisogenic
  6. Status preference:

    • released — fully validated, use these
    • archived — older versions, avoid unless specifically needed
    • revoked — quality issues found, never use

Storage Estimates

Plan disk space before downloading. Use

dry_run=True
to get exact sizes for your query.

File TypeTypical Size per File10 Experiments50 Experiments
BED peaks (narrowPeak)1-10 MB10-100 MB50-500 MB
BED peaks (broadPeak)5-50 MB50-500 MB250 MB - 2.5 GB
bigWig signal tracks200 MB - 2 GB2-20 GB10-100 GB
bigBed peaks1-20 MB10-200 MB50 MB - 1 GB
TSV quantifications5-50 MB50-500 MB250 MB - 2.5 GB
BAM alignments2-50 GB20-500 GB100 GB - 2.5 TB
FASTQ reads5-100 GB50 GB - 1 TB250 GB - 5 TB
HiC contact matrices500 MB - 5 GB5-50 GB25-250 GB

Rule of thumb: Peak files and quantifications are small (MB). Signal tracks are medium (hundreds of MB). Alignments and raw reads are large (GB to tens of GB). Always preview with

dry_run=True
before committing to a large download.

Walkthrough: Downloading a Complete ChIP-seq Dataset

This walkthrough shows the full process for downloading H3K27ac ChIP-seq data from human pancreas tissue.

Step 1: Find the experiment

encode_search_experiments(
  assay_title="Histone ChIP-seq",
  target="H3K27ac",
  organ="pancreas",
  biosample_type="tissue",
  assembly="GRCh38"
)
  -> Returns experiments, e.g., ENCSR831JOY

Step 2: List available files

encode_list_files(
  experiment_accession="ENCSR831JOY",
  assembly="GRCh38"
)
  -> Returns all files: FASTQs, BAMs, peaks, signals
  -> Note the file accessions for the files you need

Step 3: Identify the right files

Filter to what you actually need — usually peaks + signal tracks:

encode_list_files(
  experiment_accession="ENCSR831JOY",
  assembly="GRCh38",
  preferred_default=True
)
  -> Returns ENCODE-recommended files only
  -> Typically: IDR peaks (bed) + fold change signal (bigWig)

Or be specific about output types:

encode_list_files(
  experiment_accession="ENCSR831JOY",
  output_type="IDR thresholded peaks",
  assembly="GRCh38"
)
  -> Returns only IDR peak files, e.g., ENCFF635JIA

Step 4: Download with MD5 verification

encode_download_files(
  file_accessions=["ENCFF635JIA", "ENCFF388RZD"],
  download_dir="/Users/you/data/encode/h3k27ac_pancreas",
  organize_by="experiment",
  verify_md5=True
)
  -> Downloads files with integrity verification
  -> Creates: download_dir/ENCSR831JOY/ENCFF635JIA.bed.gz
              download_dir/ENCSR831JOY/ENCFF388RZD.bigWig

Step 5: Verify the download results

Check the returned JSON for:

  • summary.successful
    — number of files downloaded
  • summary.failed
    — should be 0
  • summary.total_size_human
    — total bytes downloaded
  • Each file's
    md5_verified
    — should be True for all files

If any file fails MD5 verification, re-download that specific file. Do not proceed with a corrupted file.

Step 6: Log provenance

Track the experiment and log where the data came from:

encode_track_experiment(
  accession="ENCSR831JOY",
  notes="H3K27ac ChIP-seq, pancreas tissue, downloaded for enhancer analysis"
)
  -> Stores experiment metadata, publications, and pipeline info locally

If you create derived files later (e.g., filtered peaks), log them too:

encode_log_derived_file(
  file_path="/Users/you/data/encode/h3k27ac_pancreas/filtered_peaks.bed",
  source_accessions=["ENCSR831JOY", "ENCFF635JIA"],
  description="H3K27ac peaks filtered against ENCODE Blacklist v2",
  file_type="filtered_peaks",
  tool_used="bedtools intersect v2.31.0",
  parameters="bedtools intersect -v -a ENCFF635JIA.bed.gz -b hg38-blacklist.v2.bed"
)

Walkthrough: Batch Download for Multi-Experiment Analysis

Use

encode_batch_download
when downloading data across multiple experiments, such as collecting all H3K4me3 peaks across many tissues.

Step 1: Preview with dry run

Always start with

dry_run=True
to see what will be downloaded:

encode_batch_download(
  download_dir="/Users/you/data/encode/h3k4me3_multi_tissue",
  output_type="IDR thresholded peaks",
  target="H3K4me3",
  assembly="GRCh38",
  assay_title="Histone ChIP-seq",
  biosample_type="tissue",
  organize_by="experiment",
  dry_run=True
)
  -> Preview: 42 files, 180MB total, from 42 experiments
  -> Shows file list with accessions, sizes, and experiment info

Step 2: Review and confirm

Present the dry run results to the user:

  • Total file count and size
  • Breakdown by experiment or tissue
  • Any unexpected files (wrong assembly, archived status)

If the count is too large, narrow with additional filters (e.g., add

organ="pancreas"
).

Step 3: Execute the download

encode_batch_download(
  download_dir="/Users/you/data/encode/h3k4me3_multi_tissue",
  output_type="IDR thresholded peaks",
  target="H3K4me3",
  assembly="GRCh38",
  assay_title="Histone ChIP-seq",
  biosample_type="tissue",
  organize_by="experiment",
  dry_run=False
)
  -> Downloads all 42 files with MD5 verification
  -> Creates: download_dir/ENCSR.../ENCFF....bed.gz (one per experiment)

Step 4: Handle failed downloads

If some files fail:

  • Check the
    errors
    array in the response for specific failure reasons
  • Network timeouts: retry the failed accessions with
    encode_download_files
  • MD5 mismatches: re-download the specific files
  • 403/404 errors: the file may be restricted or withdrawn from ENCODE
encode_download_files(
  file_accessions=["ENCFF_FAILED_1", "ENCFF_FAILED_2"],
  download_dir="/Users/you/data/encode/h3k4me3_multi_tissue",
  organize_by="experiment",
  verify_md5=True
)

Organization strategies

Choose

organize_by
based on your analysis plan:

StrategyDirectory StructureBest For
"experiment"
download_dir/ENCSR.../files
Comparing files within experiments
"format"
download_dir/bed/files
,
download_dir/bigWig/files
Running format-specific pipelines
"experiment_format"
download_dir/ENCSR.../bed/files
Large multi-format downloads
"flat"
download_dir/files
Small downloads, quick access

Gotchas

  1. Assembly mismatch: Always specify

    assembly="GRCh38"
    for human or
    assembly="mm10"
    for mouse. Omitting this in batch downloads can produce a mix of GRCh38 and hg19 files that cannot be compared. There is no automated liftover in the download tools — you must use the
    liftover-coordinates
    skill separately if you need to convert between assemblies.

  2. File status matters: Only

    status="released"
    files are fully validated by ENCODE. Archived files may have been superseded by newer processing. Revoked files had quality issues discovered after release. Always check file status before using data in analysis.

  3. MD5 verification is not optional: Corrupted files produce silent errors in downstream analysis — wrong peak counts, shifted signal tracks, truncated alignments. The few extra seconds for MD5 verification prevents hours of debugging. Only disable with

    verify_md5=False
    if you are re-downloading a file you already verified.

  4. Streaming for large files: BAM and FASTQ files are downloaded with streaming to avoid loading entire files into memory. The

    encode_batch_download
    tool handles this automatically. If a download is interrupted, re-running the same command will skip already-completed files (idempotent).

  5. The 100-file safety limit:

    encode_batch_download
    defaults to
    limit=100
    to prevent accidentally downloading thousands of files. If your query returns more than 100 files, narrow your filters or run multiple targeted batches. You can increase the limit explicitly if you have confirmed the download is intentional.

  6. preferred_default may return nothing: Not all experiments have files marked as

    preferred_default=True
    . If this filter returns empty results, fall back to filtering by specific
    output_type
    and
    assembly
    instead.

  7. Credential requirements: Files with status "in progress" or "submitted" require ENCODE DCC credentials. Use

    encode_manage_credentials(action="check")
    before attempting restricted downloads. Contact the ENCODE DCC for access to unreleased data.

  8. Duplicate files across experiments: When downloading the same file type across many experiments, some control files (e.g., input ChIP-seq) may be shared between experiments. The download tool skips already-existing files, so shared controls will not be downloaded twice.

Code Examples

1. Smart download: "Download IDR thresholded peaks for H3K4me3 ChIP-seq in GRCh38"

Step 1: Preview with dry run
  encode_batch_download(
    download_dir="/Users/you/data/encode",
    output_type="IDR thresholded peaks",
    target="H3K4me3",
    assembly="GRCh38",
    assay_title="Histone ChIP-seq",
    dry_run=True
  )
  -> Shows: 18 files, 45MB total (peak files are small)
  -> Present file list to user for confirmation

Step 2: Confirm and download
  encode_batch_download(
    download_dir="/Users/you/data/encode",
    output_type="IDR thresholded peaks",
    target="H3K4me3",
    assembly="GRCh38",
    assay_title="Histone ChIP-seq",
    dry_run=False
  )
  -> Downloads with MD5 verification, skips already-downloaded files

2. Batch download with preview: "Download all ATAC-seq bigWig signal tracks for pancreas"

Step 1: Dry run to see what's available
  encode_batch_download(
    download_dir="/Users/you/data/encode/atac_pancreas",
    file_format="bigWig",
    assay_title="ATAC-seq",
    organ="pancreas",
    assembly="GRCh38",
    organize_by="experiment",
    dry_run=True
  )
  -> Review: 24 files, 8.3GB total, from 6 experiments

Step 2: Download after user confirms
  (same call with dry_run=False)

3. Organized download: "Download files organized by experiment and format"

Step 1: Download specific files with organization
  encode_download_files(
    file_accessions=["ENCFF635JIA", "ENCFF388RZD", "ENCFF901ABC"],
    download_dir="/Users/you/data/encode",
    organize_by="experiment_format"
  )
  -> Creates: download_dir/ENCSR.../bed/file.bed.gz
             download_dir/ENCSR.../bigWig/file.bigWig

4. ENCODE-recommended defaults: "Just get the recommended files for this experiment"

encode_batch_download(
  download_dir="/Users/you/data/encode/defaults",
  preferred_default=True,
  assembly="GRCh38",
  assay_title="Histone ChIP-seq",
  target="H3K27me3",
  organ="liver",
  organize_by="experiment",
  dry_run=True
)
  -> Returns only ENCODE-curated default files
  -> Typically the most useful subset for standard analyses

5. RNA-seq expression data: "Download gene quantification tables"

encode_batch_download(
  download_dir="/Users/you/data/encode/rnaseq_brain",
  output_type="gene quantifications",
  assay_title="total RNA-seq",
  organ="brain",
  assembly="GRCh38",
  organize_by="experiment",
  dry_run=True
)
  -> Preview: TSV files with TPM/FPKM values, typically 5-20MB each

Integration

This skill produces...Feed into...Purpose
Downloaded FASTQ filespipeline-chipseq through pipeline-cutandrunRaw data for pipeline processing
Downloaded BED peak filespeak-annotationPeak files for gene assignment
Downloaded bigWig signalsvisualization-workflowSignal tracks for genome browser
MD5-verified filesdata-provenanceVerified file acquisition for audit trail
Downloaded BED fileshistone-aggregationPeak files for cross-experiment merge
Downloaded methylation filesmethylation-aggregationCpG methylation data for aggregation
File download metadatatrack-experimentsRecord which files were downloaded
Downloaded reference databioinformatics-installerReference genomes and annotations

Presenting Results

When presenting download results to the user:

  • Show a download summary table: filename | size | format | MD5 status | path
  • For
    dry_run=True
    , present what WOULD be downloaded with total size estimate and file count
  • Report any failures separately with error messages
  • After successful downloads, suggest next steps:
    • "Would you like to log these as tracked experiments?" (use
      encode_track_experiment
      )
    • "Would you like to log any derived files for provenance?" (use
      encode_log_derived_file
      )
  • For large batch downloads, summarize by experiment and format rather than listing every file

Key Literature

  • ENCODE Phase 3: ENCODE Project Consortium 2020 (Nature, ~2,000 citations) DOI: 10.1038/s41586-020-2493-4 — Source catalog for all downloadable genomic data.
  • FAIR Principles: Wilkinson et al. 2016 (Scientific Data, ~5,000 citations) DOI: 10.1038/sdata.2016.18 — Findable, Accessible, Interoperable, Reusable data principles that ENCODE's download infrastructure supports.
  • IDR Framework: Li et al. 2011 (Annals of Applied Statistics) DOI: 10.1214/11-AOAS466 — Irreproducible Discovery Rate method used for peak thresholding in ENCODE.
  • ENCODE Blacklist: Amemiya et al. 2019 (Scientific Reports, ~1,400 citations) DOI: 10.1038/s41598-019-45839-z — Regions to exclude from downloaded peak files before analysis.

Related Skills

SkillWhen to Use Instead/Additionally
search-encode
Finding experiments and files before downloading
track-experiments
Tracking downloaded experiments locally
data-provenance
Logging derived files created from downloaded data
quality-assessment
Evaluating experiment quality before downloading
publication-trust
Evaluating the provenance and trustworthiness of linked publications
liftover-coordinates
Converting between genome assemblies if you downloaded hg19 data
batch-analysis
Running analyses across multiple downloaded experiments

For the request: "$ARGUMENTS"