Encode-toolkit bioinformatics-installer

Install bioinformatics tools for ENCODE data analysis. Covers CLI tools (BWA, STAR, samtools, MACS2), R/Bioconductor packages (DESeq2, Seurat, ChIPseeker), Python packages (Scanpy, deeptools), and Nextflow pipeline infrastructure. Generates conda environments, R install scripts, and Python requirements. Use when the user needs to set up a bioinformatics workstation, install tools for a specific assay, create reproducible environments, or troubleshoot dependency issues. Trigger on: install tools, set up environment, conda create, bioinformatics setup, install R packages, install Bioconductor, install pipeline tools.

install

source · Clone the upstream repo

git clone https://github.com/ammawla/encode-toolkit

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bioinformatics-installer" ~/.claude/skills/ammawla-encode-toolkit-bioinformatics-installer-2150c3 && rm -rf "$T"

manifest: skills/bioinformatics-installer/SKILL.md

source content

Bioinformatics Installer for ENCODE Data Analysis

Install all bioinformatics tools needed for ENCODE data analysis, organized by assay type. This skill provides ready-to-use conda environment definitions, R/Bioconductor install scripts, Python package lists, and Nextflow pipeline infrastructure setup. Every environment is version-pinned for reproducibility and tested against ENCODE uniform processing standards.

When to Use

User wants to install bioinformatics tools needed for ENCODE data analysis
User asks about "install tools", "conda environment", "setup bioinformatics", or "install HOMER/MACS2/deeptools"
User needs pre-configured conda environments for specific assay pipelines (ChIP-seq, ATAC-seq, RNA-seq, etc.)
User wants to install R/Bioconductor packages (DESeq2, Seurat, ChIPseeker) or Python packages (Scanpy, pysam)
Example queries: "install tools for ChIP-seq analysis", "set up a conda environment for ATAC-seq", "install deeptools and bedtools"

Overview

ENCODE data analysis requires a broad ecosystem of tools spanning command-line aligners, peak callers, signal processors, statistical analysis frameworks in R, Python visualization and single-cell packages, and workflow engines. Setting up these tools correctly — with compatible versions, proper channel priorities, and no dependency conflicts — is a significant barrier for new users and a reproducibility concern for experienced analysts.

This skill solves that by providing:

7 assay-specific conda environments with pinned tool versions matching ENCODE pipeline standards
R/Bioconductor install script covering 50+ packages across 8 categories
Python install script for single-cell, Hi-C, and genomics packages
Nextflow + container setup for pipeline execution on local, HPC, and cloud platforms

All environments use the same channel priority (conda-forge > bioconda > defaults) and are tested for cross-platform compatibility on Linux x86_64 and macOS (Intel + Apple Silicon where possible).

Quick Start

Install a complete environment for any assay type with a single command:

# ChIP-seq (histone or TF)
conda env create -f skills/bioinformatics-installer/environments/chipseq-env.yml

# ATAC-seq
conda env create -f skills/bioinformatics-installer/environments/atacseq-env.yml

# RNA-seq
conda env create -f skills/bioinformatics-installer/environments/rnaseq-env.yml

# Hi-C
conda env create -f skills/bioinformatics-installer/environments/hic-env.yml

# Whole-Genome Bisulfite Sequencing (WGBS)
conda env create -f skills/bioinformatics-installer/environments/wgbs-env.yml

# DNase-seq
conda env create -f skills/bioinformatics-installer/environments/dnaseseq-env.yml

# CUT&RUN / CUT&Tag
conda env create -f skills/bioinformatics-installer/environments/cutandrun-env.yml

Using mamba for faster solves (recommended):

mamba env create -f skills/bioinformatics-installer/environments/chipseq-env.yml

Install R and Python packages:

# All R/Bioconductor packages
Rscript skills/bioinformatics-installer/scripts/install-r-packages.R --all

# All Python packages
bash skills/bioinformatics-installer/scripts/install-python-packages.sh --all

# Nextflow + Docker
bash skills/bioinformatics-installer/scripts/install-nextflow.sh --docker

Per-Assay Environments

ChIP-seq Environment (

encode-chipseq

)

For histone modification and transcription factor ChIP-seq processing following ENCODE uniform pipeline standards (Landt et al. 2012, ENCODE Consortium 2020).

Tool	Version	Purpose
BWA-MEM	0.7.17	Read alignment to reference genome (Li & Durbin 2009)
samtools	1.19	BAM manipulation, sorting, indexing, flagstat (Li et al. 2009)
MACS2	2.2.9.1	Peak calling for narrow (TF) and broad (histone) marks (Zhang et al. 2008)
Picard	3.1.1	Duplicate marking and library complexity metrics (Broad Institute)
phantompeakqualtools	1.2.2	Strand cross-correlation (NSC/RSC) quality metrics (Kharchenko et al. 2008)
IDR	2.0.3	Irreproducible Discovery Rate for replicate consistency (Li et al. 2011)
deeptools	3.5.5	Signal normalization (bamCoverage), fingerprint, correlation (Ramirez et al. 2016)
bedtools	2.31.0	Interval operations, blacklist filtering (Quinlan & Hall 2010)
FastQC	0.12.1	Raw read quality assessment (Andrews 2010)
Trim Galore	0.6.10	Adapter and quality trimming via Cutadapt (Krueger 2012)
MultiQC	1.21	Aggregate QC report across all pipeline stages (Ewels et al. 2016)
bedGraphToBigWig	—	Convert bedGraph signal to bigWig for genome browser viewing (Kent et al. 2010)

Memory: BWA index for GRCh38 requires ~5.5 GB RAM. Peak calling with MACS2 typically requires 4-8 GB. phantompeakqualtools loads full BAM into memory.

Environment file:

environments/chipseq-env.yml

ATAC-seq Environment (

encode-atacseq

)

For chromatin accessibility profiling via ATAC-seq following ENCODE standards (Buenrostro et al. 2013, Corces et al. 2017).

Tool	Version	Purpose
Bowtie2	2.5.3	Alignment (preferred over BWA for ATAC-seq short fragments) (Langmead & Salzberg 2012)
MACS2	2.2.9.1	Peak calling with --nomodel --shift -100 --extsize 200 for ATAC (Zhang et al. 2008)
samtools	1.19	BAM manipulation, mitochondrial read filtering
Picard	3.1.1	Duplicate marking, insert size metrics
deeptools	3.5.5	alignmentSieve (Tn5 offset), bamCoverage (signal tracks), plotFingerprint
bedtools	2.31.0	Blacklist filtering, interval operations
FastQC	0.12.1	Raw read quality and adapter content assessment
Trim Galore	0.6.10	Adapter trimming (Nextera adapters for ATAC-seq)
MultiQC	1.21	Aggregate QC reporting

Key ATAC-seq parameters: Tn5 transposase introduces a +4/-5 bp offset that must be corrected. Fragment size distribution should show nucleosomal ladder (sub-nucleosomal, mono-, di-, tri-). TSS enrichment score should be >= 5 (GRCh38), >= 6 (hg19), or >= 10 (mm10) for high-quality data (ENCODE data standards).

Environment file:

environments/atacseq-env.yml

RNA-seq Environment (

encode-rnaseq

)

For gene expression quantification following ENCODE RNA-seq standards (Conesa et al. 2016, ENCODE Consortium 2020).

Tool	Version	Purpose
STAR	2.7.11b	Splice-aware alignment with 2-pass mapping (Dobin et al. 2013)
RSEM	1.3.3	Gene/transcript quantification with expectation-maximization (Li & Dewey 2011)
Kallisto	0.50.1	Pseudoalignment-based transcript quantification (Bray et al. 2016)
Salmon	1.10.3	Quasi-mapping transcript quantification with GC bias correction (Patro et al. 2017)
featureCounts (subread)	2.0.6	Gene-level read counting for count-based DE methods (Liao et al. 2014)
samtools	1.19	BAM handling, flagstat, idxstats
FastQC	0.12.1	Read quality assessment
Trim Galore	0.6.10	Adapter and quality trimming
MultiQC	1.21	Aggregate QC report
RSeQC	5.0.3	RNA-seq-specific QC: gene body coverage, read distribution, inner distance (Wang et al. 2012)

Memory: STAR genome generation requires 32+ GB RAM for human genome. STAR alignment requires ~30 GB RAM. Kallisto and Salmon are memory-efficient alternatives (~4 GB).

Environment file:

environments/rnaseq-env.yml

Hi-C Environment (

encode-hic

)

For chromatin conformation capture processing following ENCODE Hi-C standards (Yardimci et al. 2019, Rao et al. 2014).

Tool	Version	Purpose
BWA-MEM	0.7.17	Chimeric read alignment (each mate aligned independently)
pairtools	1.0.3	Parse, sort, deduplicate, filter contact pairs (Open2C)
cooler	0.9.3	Multi-resolution contact matrix storage and balancing (Abdennur & Mirny 2020)
Juicer	2.20.00	Contact matrix generation and HiCCUPS loop calling (Durand et al. 2016)
samtools	1.19	BAM handling for chimeric alignment parsing
bedtools	2.31.0	Restriction fragment and TAD boundary operations
FastQC	0.12.1	Read quality assessment
Trim Galore	0.6.10	Adapter trimming
MultiQC	1.21	Aggregate QC reporting

Key Hi-C parameters: Cis/trans ratio > 60%, long-range cis contacts (> 20 kb) > 40%. Resolution depends on sequencing depth: ~1 billion valid pairs for 5 kb resolution on human.

Note: Juicer requires Java 11+. Install via

conda install -c bioconda juicer_tools

or download the

.jar

directly from the Aiden Lab GitHub.

Environment file:

environments/hic-env.yml

WGBS Environment (

encode-wgbs

)

For whole-genome bisulfite sequencing (DNA methylation) following ENCODE standards (Foox et al. 2021, Schultz et al. 2015).

Tool	Version	Purpose
Bismark	0.24.2	Bisulfite-aware alignment and methylation extraction (Krueger & Andrews 2011)
MethylDackel	0.6.1	Fast methylation extraction from bisulfite BAMs (Ryan 2023)
samtools	1.19	BAM manipulation, merge, index
bedtools	2.31.0	Interval operations for DMR analysis
FastQC	0.12.1	Read quality assessment (note: bisulfite libraries have biased base composition)
Trim Galore	0.6.10	Adapter trimming with --rrbs or default mode
MultiQC	1.21	Aggregate QC reporting with Bismark module
tabix	1.19	Index methylation BED files for random access
bgzip	1.19	Block-gzip compression for indexed access

Key WGBS parameters: Bisulfite conversion rate ≥ 98% (check unmethylated spike-in lambda DNA). CpG coverage >= 10x for reliable DMR calling. M-bias plots should be checked for end-repair artifacts.

Environment file:

environments/wgbs-env.yml

DNase-seq Environment (

encode-dnaseseq

)

For DNase I hypersensitive site mapping following ENCODE standards (Thurman et al. 2012, ENCODE Consortium 2020).

Tool	Version	Purpose
BWA-MEM	0.7.17	Read alignment to reference genome
Hotspot2	2.3.1	DNase-seq hotspot detection (John et al. 2011)
HINT-ATAC	0.13.2	TF footprinting from DNase-seq data (Li et al. 2019)
F-Seq2	2.0.3	Feature density estimation for peak calling (Boyle et al. 2008, Zhao et al. 2020)
samtools	1.19	BAM handling and filtering
bedtools	2.31.0	Interval operations, blacklist filtering
FastQC	0.12.1	Read quality assessment
Trim Galore	0.6.10	Adapter trimming
MultiQC	1.21	Aggregate QC reporting

Environment file:

environments/dnaseseq-env.yml

CUT&RUN / CUT&Tag Environment (

encode-cutandrun

)

For antibody-targeted chromatin profiling via CUT&RUN (Skene & Henikoff 2017) and CUT&Tag (Kaya-Okur et al. 2019).

Tool	Version	Purpose
Bowtie2	2.5.3	Alignment (recommended for shorter CUT&RUN/Tag fragments)
SEACR	1.3	Sparse Enrichment Analysis for CUT&RUN (Meers et al. 2019)
MACS2	2.2.9.1	Alternative peak calling with adjusted parameters
samtools	1.19	BAM handling, spike-in alignment filtering
Picard	3.1.1	Duplicate marking (low duplication expected for CUT&RUN/Tag)
deeptools	3.5.5	Signal tracks, heatmaps, spike-in normalization
bedtools	2.31.0	Interval operations, suspect list filtering
FastQC	0.12.1	Read quality assessment
Trim Galore	0.6.10	Adapter trimming
MultiQC	1.21	Aggregate QC reporting

Key CUT&RUN/Tag notes: These assays have inherently lower background than ChIP-seq. Do NOT apply ChIP-seq quality thresholds — use CUT&RUN-specific metrics (Nordin et al. 2023). Apply the CUT&RUN suspect list instead of the standard ENCODE blacklist. Spike-in normalization (E. coli DNA for CUT&RUN, carry-over for CUT&Tag) is strongly recommended for quantitative comparisons.

Environment file:

environments/cutandrun-env.yml

R/Bioconductor Packages

Install all R packages needed for ENCODE downstream analysis. The install script at

scripts/install-r-packages.R

handles BiocManager setup, version locking, and category-based installation.

Core Genomic Infrastructure

These packages provide the foundation for all genomic data manipulation in R:

Package	Purpose
GenomicRanges	Interval arithmetic on genomic coordinates (Lawrence et al. 2013)
GenomicFeatures	Gene model and transcript annotation handling
rtracklayer	Import/export BED, bigWig, GFF, narrowPeak, broadPeak
IRanges	Integer range operations (underlying GenomicRanges)
GenomeInfoDb	Chromosome naming conventions (UCSC vs Ensembl vs NCBI)
BiocGenerics	Common S4 generics across Bioconductor
S4Vectors	S4 class infrastructure for Bioconductor objects
AnnotationDbi	Unified interface to annotation databases
biomaRt	Ensembl BioMart query interface for gene annotation (Durinck et al. 2009)

Differential Analysis

Package	Purpose
DESeq2	Differential gene expression with shrinkage estimators (Love et al. 2014)
edgeR	Differential expression using empirical Bayes (Robinson et al. 2010)
limma	Linear models for microarray and RNA-seq data (Ritchie et al. 2015)
DiffBind	Differential binding analysis for ChIP-seq/ATAC-seq peaks (Stark & Brown 2011)
ChIPQC	ChIP-seq quality control in R (Carroll et al. 2014)
chromVAR	Chromatin accessibility variation across single cells (Schep et al. 2017)

Annotation and Pathway Analysis

Package	Purpose
ChIPseeker	Peak annotation and visualization (Yu et al. 2015)
annotatr	Annotate genomic regions with CpG islands, genes, enhancers (Cavalcante & Sartor 2017)
clusterProfiler	Gene ontology and KEGG pathway enrichment (Yu et al. 2012)
org.Hs.eg.db	Human gene annotation database
org.Mm.eg.db	Mouse gene annotation database
TxDb.Hsapiens.UCSC.hg38.knownGene	Human transcript models (GRCh38)
TxDb.Mmusculus.UCSC.mm10.knownGene	Mouse transcript models (mm10)

Single-Cell Analysis

Package	Purpose
Seurat	Comprehensive single-cell RNA-seq analysis (Hao et al. 2021)
Signac	Single-cell chromatin accessibility (ATAC-seq) analysis (Stuart et al. 2021)
SingleCellExperiment	Core Bioconductor container for single-cell data
scater	Single-cell QC, normalization, visualization (McCarthy et al. 2017)
scran	Single-cell normalization and feature selection (Lun et al. 2016)

Bulk-to-Single-Cell Deconvolution

Package	Purpose
BayesPrism	Bayesian deconvolution with scRNA-seq reference (Chu et al. 2022)
InstaPrism	Fast approximation of BayesPrism for large datasets (Wang et al. 2024)
MuSiC_deconv	Multi-Subject Single Cell deconvolution (Wang et al. 2019)
DWLS	Dampened Weighted Least Squares deconvolution (Tsoucas et al. 2019)
BisqueRNA	Reference-based and marker-based deconvolution (Jew et al. 2020)

DNA Methylation Analysis

Package	Purpose
DMRcate	Differentially methylated region detection (Peters et al. 2021)
bsseq	Bisulfite sequencing data handling and smoothing (Hansen et al. 2012)
methylKit	Methylation analysis from bisulfite sequencing (Akalin et al. 2012)

Visualization

Package	Purpose
ComplexHeatmap	Publication-quality heatmaps with annotations (Gu et al. 2016)
EnhancedVolcano	Volcano plots for differential expression (Blighe et al. 2018)
Gviz	Genome browser-style track visualization (Hahne & Ivanek 2016)
ggplot2	Grammar of graphics for all custom plots (Wickham 2016)

Statistics and Batch Correction

Package	Purpose
sva (ComBat)	Surrogate variable analysis and batch correction (Leek et al. 2012)
WGCNA	Weighted Gene Co-expression Network Analysis (Langfelder & Horvath 2008)
ReactomePA	Reactome pathway analysis (Yu & He 2016)

Install script:

scripts/install-r-packages.R

# Install all categories
Rscript scripts/install-r-packages.R --all

# Install only specific categories
Rscript scripts/install-r-packages.R --chipseq      # DiffBind, ChIPQC, ChIPseeker
Rscript scripts/install-r-packages.R --rnaseq       # DESeq2, edgeR, limma
Rscript scripts/install-r-packages.R --singlecell   # Seurat, Signac, scater, scran
Rscript scripts/install-r-packages.R --methylation   # DMRcate, bsseq, methylKit
Rscript scripts/install-r-packages.R --deconvolution # BayesPrism, InstaPrism, MuSiC_deconv, DWLS, BisqueRNA

Python Packages

Install Python packages for single-cell analysis, Hi-C processing, signal visualization, and genomic data manipulation.

Core Single-Cell Stack

Package	Purpose
scanpy	Single-cell RNA-seq analysis framework (Wolf et al. 2018)
anndata	Annotated data matrix for single-cell (Virshup et al. 2021)
scvi-tools	Deep generative models for single-cell (Gayoso et al. 2022)
numpy	Numerical computing
pandas	Data manipulation and tabular operations
scipy	Scientific computing (sparse matrices, statistics)
matplotlib	Plotting foundation
seaborn	Statistical visualization

Genomics and Signal Processing

Package	Purpose
deeptools	Signal tracks, heatmaps, correlation (also CLI; Ramirez et al. 2016)
pyBigWig	Read/write bigWig signal files (Ryan 2023)
pysam	Python interface to samtools/htslib (Li et al. 2009)
pybedtools	Python interface to bedtools (Dale et al. 2011)

Hi-C Analysis

Package	Purpose
cooler	Multi-resolution contact matrices (Abdennur & Mirny 2020)
cooltools	Analysis toolkit for cooler data: TADs, compartments, insulation
hic-straw	Read .hic files from Juicer/Juicebox (Durand et al. 2016)
pyGenomeTracks	Genome browser visualization including Hi-C tracks

Single-Cell QC and Integration

Package	Purpose
scrublet	Doublet detection for scRNA-seq (Wolock et al. 2019)
CellBender	Remove ambient RNA contamination (Fleming et al. 2023)
harmony-pytorch	Batch integration via Harmony in PyTorch (Korsunsky et al. 2019)
scanorama	Panoramic stitching of scRNA-seq datasets (Hie et al. 2019)
bbknn	Batch-balanced KNN graph construction (Polanski et al. 2020)

Install script:

scripts/install-python-packages.sh

# Install all Python packages
bash scripts/install-python-packages.sh --all

# Install only specific categories
bash scripts/install-python-packages.sh --singlecell  # scanpy, scvi-tools, harmony
bash scripts/install-python-packages.sh --hic          # cooler, cooltools, hic-straw
bash scripts/install-python-packages.sh --deeptools    # deeptools, pyBigWig, pysam

Nextflow and Container Setup

ENCODE pipeline execution requires Nextflow DSL2 and a container runtime (Docker or Singularity).

Nextflow Installation

# Install Nextflow (requires Java 11+)
curl -s https://get.nextflow.io | bash
mv nextflow /usr/local/bin/

# Verify
nextflow -version

Docker (recommended for local/cloud)

# macOS
brew install --cask docker

# Linux (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

# Add current user to docker group (Linux)
sudo usermod -aG docker $USER

Singularity (for HPC clusters)

# Most HPC clusters have Singularity pre-installed
# Check with: module load singularity && singularity version

# If not available, install via conda:
conda install -c conda-forge singularity

Nextflow Configuration Profiles

The pipeline skills (pipeline-chipseq, pipeline-atacseq, etc.) include

nextflow.config

files with profiles for local, SLURM, GCP, and AWS execution. Select the appropriate profile:

# Local with Docker
nextflow run main.nf -profile local

# HPC with Singularity
nextflow run main.nf -profile slurm

# Google Cloud
nextflow run main.nf -profile gcp

# AWS Batch
nextflow run main.nf -profile aws

Install script:

scripts/install-nextflow.sh

Motif Analysis Tools

For transcription factor binding motif discovery and scanning.

Tool	Version	Type	Purpose
HOMER	4.11	CLI	De novo and known motif discovery, annotation (Heinz et al. 2010)
MEME Suite	5.5.5	CLI	MEME, DREME, STREME de novo discovery; FIMO scanning; AME enrichment (Bailey et al. 2015)
FIMO	5.5.5	CLI (part of MEME Suite)	Motif occurrence scanning across sequences
TFBSTools	R	R/Bioconductor	JASPAR motif handling, PFM/PWM conversion, motif scanning in R (Tan & Lenhard 2016)

HOMER Installation

# Download and configure HOMER
mkdir -p ~/software/homer
cd ~/software/homer
wget http://homer.ucsd.edu/homer/configureHomer.pl
perl configureHomer.pl -install homer
perl configureHomer.pl -install hg38   # Human genome
perl configureHomer.pl -install mm10   # Mouse genome

# Add to PATH
export PATH=$PATH:~/software/homer/bin

MEME Suite Installation

# Via conda (recommended)
conda install -c bioconda meme

# Or from source
wget https://meme-suite.org/meme/meme-software/5.5.5/meme-5.5.5.tar.gz
tar xzf meme-5.5.5.tar.gz
cd meme-5.5.5
./configure --prefix=$HOME/software/meme --enable-build-libxml2 --enable-build-libxslt
make && make install

Walkthrough: Setting Up a Complete ENCODE Analysis Environment

Goal: Install all bioinformatics tools needed to process ENCODE data, from raw FASTQ files through peak calling, annotation, and visualization, using Conda environments. Context: ENCODE analysis requires dozens of specialized tools. This skill automates installation with pre-configured Conda environments for each pipeline stage.

Step 1: Determine required tools by experiment type

encode_get_experiment(accession="ENCSR000AKA")

Expected output:

{
  "accession": "ENCSR000AKA",
  "assay_title": "Histone ChIP-seq",
  "target": "H3K27ac"
}

Interpretation: Histone ChIP-seq requires: BWA-MEM (alignment), SAMtools (BAM processing), MACS2 (peak calling), IDR (reproducibility), bedtools (interval operations), deepTools (signal visualization).

Step 2: Install the ChIP-seq Conda environment

# Using the pre-configured environment YAML
conda env create -f skills/bioinformatics-installer/scripts/chipseq-env.yml
conda activate encode-chipseq

The YAML includes:

name: encode-chipseq
channels: [bioconda, conda-forge, defaults]
dependencies:
  - bwa=0.7.17
  - samtools=1.17
  - macs2=2.2.9.1
  - idr=2.0.3
  - bedtools=2.31.0
  - deeptools=3.5.4
  - picard=3.1.1
  - fastqc=0.12.1
  - multiqc=1.17

Step 3: Install additional tools for downstream analysis

For peak annotation and motif analysis:

conda env create -f skills/bioinformatics-installer/scripts/annotation-env.yml
conda activate encode-annotation
# Includes: HOMER, GREAT, bedtools, R/Bioconductor (ChIPseeker, clusterProfiler)

Step 4: Verify installation

# Quick verification of key tools
bwa 2>&1 | head -3
samtools --version | head -1
macs2 --version
bedtools --version

Step 5: Download reference data for ENCODE analysis

encode_download_files(accessions=["ENCFF001ABC"], download_dir="/data/references")

Reference files needed:

GRCh38 genome FASTA
ENCODE blacklist v2 (Amemiya et al. 2019)
Gene annotation GTF (GENCODE v36)

Integration with downstream skills

Installed tools are used by → pipeline-chipseq through pipeline-cutandrun for processing
Reference data feeds into → download-encode for FASTQ retrieval
Environment setup enables → quality-assessment tool execution
Installed annotation tools support → peak-annotation and motif-analysis

Code Examples

1. Find experiments to identify required tools

encode_search_experiments(
  assay_title="ATAC-seq",
  organ="pancreas"
)

Expected output:

{
  "total": 8,
  "experiments": [
    {
      "accession": "ENCSR799GHJ",
      "assay_title": "ATAC-seq",
      "biosample_summary": "pancreatic islet tissue male adult (44 years)",
      "status": "released"
    }
  ]
}

Install decision: ATAC-seq requires the

atacseq-env.yml

conda environment (Bowtie2 + MACS2 + deeptools + samtools + bedtools).

2. Get file info to understand format requirements

encode_get_file_info(accession="ENCFF001ABC")

Expected output:

{
  "accession": "ENCFF001ABC",
  "file_format": "fastq",
  "file_size_mb": 4521.3,
  "read_length": 100,
  "paired_end": true,
  "platform": "Illumina NovaSeq 6000"
}

Install decision: Paired-end FASTQ needs Bowtie2 (not BWA for ATAC-seq), Picard for duplicate marking, and samtools for BAM processing.

Pitfalls & Edge Cases

Conda solver conflicts: Large conda environments with many packages can take hours to solve. Use mamba instead of conda for faster dependency resolution, or install in smaller focused environments.
R/Bioconductor version mismatch: R packages from CRAN and Bioconductor must match the R version. Installing Bioconductor 3.18 packages with R 4.4 will fail silently or produce errors. Use BiocManager::install() to ensure version compatibility.
Python 2 vs Python 3: Some legacy bioinformatics tools (MACS 1.x, old HOMER) require Python 2. Never install Python 2 tools in the same environment as Python 3 tools — use separate conda environments.
ARM Mac (M1/M2/M3) compatibility: Many bioinformatics tools lack native ARM builds. Use
```
CONDA_SUBDIR=osx-64
```
or Rosetta 2 emulation for x86_64 packages. Some tools (samtools, BWA) have ARM-native builds.
Nextflow requires Java 11+: Nextflow will not run on Java 8. Check
```
java -version
```
before running pipelines. Install with
```
curl -s https://get.nextflow.io | bash
```
for correct Java bundling.
Docker vs Singularity on HPC: Most HPC clusters do not allow Docker (requires root). Use Singularity instead. Nextflow supports both via
```
-profile singularity
```
or
```
-profile docker
```
.

Literature Foundation

#	Reference	Key Contribution
1	Li & Durbin 2009, Bioinformatics, DOI:10.1093/bioinformatics/btp324 (~30,000 cit)	BWA aligner
2	Langmead & Salzberg 2012, Nat Methods, DOI:10.1038/nmeth.1923 (~25,000 cit)	Bowtie2 aligner
3	Li et al. 2009, Bioinformatics, DOI:10.1093/bioinformatics/btp352 (~20,000 cit)	SAMtools/BAM format
4	Zhang et al. 2008, Genome Biol, DOI:10.1186/gb-2008-9-9-r137 (~7,000 cit)	MACS2 peak caller
5	Dobin et al. 2013, Bioinformatics, DOI:10.1093/bioinformatics/bts635 (~15,000 cit)	STAR RNA-seq aligner
6	Love et al. 2014, Genome Biol, DOI:10.1186/s13059-014-0550-8 (~30,000 cit)	DESeq2
7	Ramirez et al. 2016, Nucleic Acids Res, DOI:10.1093/nar/gkw257 (~3,000 cit)	deeptools
8	Wolf et al. 2018, Genome Biol, DOI:10.1186/s13059-017-1382-0 (~5,000 cit)	Scanpy
9	Hao et al. 2021, Cell, DOI:10.1016/j.cell.2021.04.048 (~8,000 cit)	Seurat v4
10	Quinlan & Hall 2010, Bioinformatics, DOI:10.1093/bioinformatics/btq033 (~10,000 cit)	bedtools
11	Ewels et al. 2016, Bioinformatics, DOI:10.1093/bioinformatics/btw354 (~3,000 cit)	MultiQC
12	Krueger & Andrews 2011, Bioinformatics, DOI:10.1093/bioinformatics/btr167 (~5,000 cit)	Bismark
13	Heinz et al. 2010, Molecular Cell, DOI:10.1016/j.molcel.2010.05.004 (~7,000 cit)	HOMER motif analysis
14	Bailey et al. 2015, Nucleic Acids Res, DOI:10.1093/nar/gkv416 (~3,000 cit)	MEME Suite
15	Meers et al. 2019, Epigenetics Chromatin, DOI:10.1186/s13072-019-0287-4 (~800 cit)	SEACR for CUT&RUN
16	Di Tommaso et al. 2017, Nat Biotechnol, DOI:10.1038/nbt.3820 (~2,500 cit)	Nextflow
17	Landt et al. 2012, Genome Res, DOI:10.1101/gr.136184.111 (~4,000 cit)	ENCODE ChIP-seq standards
18	ENCODE Consortium 2020, Nature, DOI:10.1038/s41586-020-2493-4 (~1,656 cit)	ENCODE Phase 3
19	Amemiya et al. 2019, Sci Rep, DOI:10.1038/s41598-019-45839-z (~1,372 cit)	ENCODE Blacklist v2

Integration

This skill produces...	Feed into...	Purpose
Conda environments	pipeline-chipseq through pipeline-cutandrun	Provide tool dependencies for all pipeline stages
Installed reference data	download-encode	Reference genomes and annotations for alignment
Tool version inventory	data-provenance	Record exact tool versions for reproducibility
QC tool installations	quality-assessment	Enable FastQC, MultiQC, and ENCODE QC metric tools
Annotation tool setup	peak-annotation	HOMER, ChIPseeker for peak-to-gene assignment
Motif scanning tools	jaspar-motifs	MEME Suite for motif scanning against JASPAR
Visualization tools	visualization-workflow	deepTools, IGV, R/ggplot2 for data visualization
Liftover utilities	liftover-coordinates	UCSC liftOver binary for assembly conversion

Related Skills

pipeline-guide: Parent skill for all pipeline execution; provides overview of available pipelines and tool selection guidance
pipeline-chipseq: Uses the ChIP-seq conda environment tools for FASTQ-to-peaks processing
pipeline-atacseq: Uses the ATAC-seq conda environment tools for accessibility analysis
pipeline-rnaseq: Uses the RNA-seq conda environment for expression quantification
pipeline-wgbs: Uses the WGBS conda environment for methylation analysis
pipeline-hic: Uses the Hi-C conda environment for contact matrix generation
pipeline-dnaseseq: Uses the DNase-seq conda environment for hotspot detection
pipeline-cutandrun: Uses the CUT&RUN conda environment for CUT&RUN/CUT&Tag processing
quality-assessment: Quality metrics require properly installed tools to compute
setup: Initial ENCODE Toolkit server setup (MCP connection, not bioinformatics tools)
motif-analysis: Requires HOMER and MEME Suite from this installer
visualization-workflow: Uses deeptools, pyGenomeTracks, and R visualization packages from this installer
single-cell-encode: Uses Seurat, Signac, Scanpy from this installer
publication-trust: Assess scientific integrity of publications before relying on their methods or findings

Presenting Results

Present installed tools as a checklist table: tool | version | status (installed/failed/skipped). Group by assay environment. Suggest: "Would you like to verify the installation by running a quick test on sample ENCODE data?"
If any installation fails, provide the exact error and a targeted fix. Common fixes: update conda, set channel priority, install system dependencies.

Encode-toolkit bioinformatics-installer

Bioinformatics Installer for ENCODE Data Analysis

When to Use

Overview

Quick Start

Per-Assay Environments

ChIP-seq Environment (encode-chipseq)

ATAC-seq Environment (encode-atacseq)

RNA-seq Environment (encode-rnaseq)

Hi-C Environment (encode-hic)

WGBS Environment (encode-wgbs)

DNase-seq Environment (encode-dnaseseq)

CUT&RUN / CUT&Tag Environment (encode-cutandrun)

R/Bioconductor Packages

Core Genomic Infrastructure

Differential Analysis

Annotation and Pathway Analysis

Single-Cell Analysis

Bulk-to-Single-Cell Deconvolution

DNA Methylation Analysis

Visualization

Statistics and Batch Correction

Python Packages

Core Single-Cell Stack

Genomics and Signal Processing

Hi-C Analysis

Single-Cell QC and Integration

Nextflow and Container Setup

Nextflow Installation

Docker (recommended for local/cloud)

Singularity (for HPC clusters)

Nextflow Configuration Profiles

Motif Analysis Tools

HOMER Installation

MEME Suite Installation

Walkthrough: Setting Up a Complete ENCODE Analysis Environment

Step 1: Determine required tools by experiment type

Step 2: Install the ChIP-seq Conda environment

Step 3: Install additional tools for downstream analysis

Step 4: Verify installation

Step 5: Download reference data for ENCODE analysis

Integration with downstream skills

Code Examples

1. Find experiments to identify required tools

2. Get file info to understand format requirements

Pitfalls & Edge Cases

Literature Foundation

Integration

Related Skills

Presenting Results

For the request: "$ARGUMENTS"

ChIP-seq Environment (
`encode-chipseq`
)

ATAC-seq Environment (
`encode-atacseq`
)

RNA-seq Environment (
`encode-rnaseq`
)

Hi-C Environment (
`encode-hic`
)

WGBS Environment (
`encode-wgbs`
)

DNase-seq Environment (
`encode-dnaseseq`
)

CUT&RUN / CUT&Tag Environment (
`encode-cutandrun`
)