Encode-toolkit bioinformatics-installer

Install bioinformatics tools for ENCODE data analysis. Covers CLI tools (BWA, STAR, samtools, MACS2), R/Bioconductor packages (DESeq2, Seurat, ChIPseeker), Python packages (Scanpy, deeptools), and Nextflow pipeline infrastructure. Generates conda environments, R install scripts, and Python requirements. Use when the user needs to set up a bioinformatics workstation, install tools for a specific assay, create reproducible environments, or troubleshoot dependency issues. Trigger on: install tools, set up environment, conda create, bioinformatics setup, install R packages, install Bioconductor, install pipeline tools.

install
source · Clone the upstream repo
git clone https://github.com/ammawla/encode-toolkit
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bioinformatics-installer" ~/.claude/skills/ammawla-encode-toolkit-bioinformatics-installer-2150c3 && rm -rf "$T"
manifest: skills/bioinformatics-installer/SKILL.md
source content

Bioinformatics Installer for ENCODE Data Analysis

Install all bioinformatics tools needed for ENCODE data analysis, organized by assay type. This skill provides ready-to-use conda environment definitions, R/Bioconductor install scripts, Python package lists, and Nextflow pipeline infrastructure setup. Every environment is version-pinned for reproducibility and tested against ENCODE uniform processing standards.

When to Use

  • User wants to install bioinformatics tools needed for ENCODE data analysis
  • User asks about "install tools", "conda environment", "setup bioinformatics", or "install HOMER/MACS2/deeptools"
  • User needs pre-configured conda environments for specific assay pipelines (ChIP-seq, ATAC-seq, RNA-seq, etc.)
  • User wants to install R/Bioconductor packages (DESeq2, Seurat, ChIPseeker) or Python packages (Scanpy, pysam)
  • Example queries: "install tools for ChIP-seq analysis", "set up a conda environment for ATAC-seq", "install deeptools and bedtools"

Overview

ENCODE data analysis requires a broad ecosystem of tools spanning command-line aligners, peak callers, signal processors, statistical analysis frameworks in R, Python visualization and single-cell packages, and workflow engines. Setting up these tools correctly — with compatible versions, proper channel priorities, and no dependency conflicts — is a significant barrier for new users and a reproducibility concern for experienced analysts.

This skill solves that by providing:

  • 7 assay-specific conda environments with pinned tool versions matching ENCODE pipeline standards
  • R/Bioconductor install script covering 50+ packages across 8 categories
  • Python install script for single-cell, Hi-C, and genomics packages
  • Nextflow + container setup for pipeline execution on local, HPC, and cloud platforms

All environments use the same channel priority (conda-forge > bioconda > defaults) and are tested for cross-platform compatibility on Linux x86_64 and macOS (Intel + Apple Silicon where possible).

Quick Start

Install a complete environment for any assay type with a single command:

# ChIP-seq (histone or TF)
conda env create -f skills/bioinformatics-installer/environments/chipseq-env.yml

# ATAC-seq
conda env create -f skills/bioinformatics-installer/environments/atacseq-env.yml

# RNA-seq
conda env create -f skills/bioinformatics-installer/environments/rnaseq-env.yml

# Hi-C
conda env create -f skills/bioinformatics-installer/environments/hic-env.yml

# Whole-Genome Bisulfite Sequencing (WGBS)
conda env create -f skills/bioinformatics-installer/environments/wgbs-env.yml

# DNase-seq
conda env create -f skills/bioinformatics-installer/environments/dnaseseq-env.yml

# CUT&RUN / CUT&Tag
conda env create -f skills/bioinformatics-installer/environments/cutandrun-env.yml

Using mamba for faster solves (recommended):

mamba env create -f skills/bioinformatics-installer/environments/chipseq-env.yml

Install R and Python packages:

# All R/Bioconductor packages
Rscript skills/bioinformatics-installer/scripts/install-r-packages.R --all

# All Python packages
bash skills/bioinformatics-installer/scripts/install-python-packages.sh --all

# Nextflow + Docker
bash skills/bioinformatics-installer/scripts/install-nextflow.sh --docker

Per-Assay Environments

ChIP-seq Environment (
encode-chipseq
)

For histone modification and transcription factor ChIP-seq processing following ENCODE uniform pipeline standards (Landt et al. 2012, ENCODE Consortium 2020).

ToolVersionPurpose
BWA-MEM0.7.17Read alignment to reference genome (Li & Durbin 2009)
samtools1.19BAM manipulation, sorting, indexing, flagstat (Li et al. 2009)
MACS22.2.9.1Peak calling for narrow (TF) and broad (histone) marks (Zhang et al. 2008)
Picard3.1.1Duplicate marking and library complexity metrics (Broad Institute)
phantompeakqualtools1.2.2Strand cross-correlation (NSC/RSC) quality metrics (Kharchenko et al. 2008)
IDR2.0.3Irreproducible Discovery Rate for replicate consistency (Li et al. 2011)
deeptools3.5.5Signal normalization (bamCoverage), fingerprint, correlation (Ramirez et al. 2016)
bedtools2.31.0Interval operations, blacklist filtering (Quinlan & Hall 2010)
FastQC0.12.1Raw read quality assessment (Andrews 2010)
Trim Galore0.6.10Adapter and quality trimming via Cutadapt (Krueger 2012)
MultiQC1.21Aggregate QC report across all pipeline stages (Ewels et al. 2016)
bedGraphToBigWigConvert bedGraph signal to bigWig for genome browser viewing (Kent et al. 2010)

Memory: BWA index for GRCh38 requires ~5.5 GB RAM. Peak calling with MACS2 typically requires 4-8 GB. phantompeakqualtools loads full BAM into memory.

Environment file:

environments/chipseq-env.yml


ATAC-seq Environment (
encode-atacseq
)

For chromatin accessibility profiling via ATAC-seq following ENCODE standards (Buenrostro et al. 2013, Corces et al. 2017).

ToolVersionPurpose
Bowtie22.5.3Alignment (preferred over BWA for ATAC-seq short fragments) (Langmead & Salzberg 2012)
MACS22.2.9.1Peak calling with --nomodel --shift -100 --extsize 200 for ATAC (Zhang et al. 2008)
samtools1.19BAM manipulation, mitochondrial read filtering
Picard3.1.1Duplicate marking, insert size metrics
deeptools3.5.5alignmentSieve (Tn5 offset), bamCoverage (signal tracks), plotFingerprint
bedtools2.31.0Blacklist filtering, interval operations
FastQC0.12.1Raw read quality and adapter content assessment
Trim Galore0.6.10Adapter trimming (Nextera adapters for ATAC-seq)
MultiQC1.21Aggregate QC reporting

Key ATAC-seq parameters: Tn5 transposase introduces a +4/-5 bp offset that must be corrected. Fragment size distribution should show nucleosomal ladder (sub-nucleosomal, mono-, di-, tri-). TSS enrichment score should be >= 5 (GRCh38), >= 6 (hg19), or >= 10 (mm10) for high-quality data (ENCODE data standards).

Environment file:

environments/atacseq-env.yml


RNA-seq Environment (
encode-rnaseq
)

For gene expression quantification following ENCODE RNA-seq standards (Conesa et al. 2016, ENCODE Consortium 2020).

ToolVersionPurpose
STAR2.7.11bSplice-aware alignment with 2-pass mapping (Dobin et al. 2013)
RSEM1.3.3Gene/transcript quantification with expectation-maximization (Li & Dewey 2011)
Kallisto0.50.1Pseudoalignment-based transcript quantification (Bray et al. 2016)
Salmon1.10.3Quasi-mapping transcript quantification with GC bias correction (Patro et al. 2017)
featureCounts (subread)2.0.6Gene-level read counting for count-based DE methods (Liao et al. 2014)
samtools1.19BAM handling, flagstat, idxstats
FastQC0.12.1Read quality assessment
Trim Galore0.6.10Adapter and quality trimming
MultiQC1.21Aggregate QC report
RSeQC5.0.3RNA-seq-specific QC: gene body coverage, read distribution, inner distance (Wang et al. 2012)

Memory: STAR genome generation requires 32+ GB RAM for human genome. STAR alignment requires ~30 GB RAM. Kallisto and Salmon are memory-efficient alternatives (~4 GB).

Environment file:

environments/rnaseq-env.yml


Hi-C Environment (
encode-hic
)

For chromatin conformation capture processing following ENCODE Hi-C standards (Yardimci et al. 2019, Rao et al. 2014).

ToolVersionPurpose
BWA-MEM0.7.17Chimeric read alignment (each mate aligned independently)
pairtools1.0.3Parse, sort, deduplicate, filter contact pairs (Open2C)
cooler0.9.3Multi-resolution contact matrix storage and balancing (Abdennur & Mirny 2020)
Juicer2.20.00Contact matrix generation and HiCCUPS loop calling (Durand et al. 2016)
samtools1.19BAM handling for chimeric alignment parsing
bedtools2.31.0Restriction fragment and TAD boundary operations
FastQC0.12.1Read quality assessment
Trim Galore0.6.10Adapter trimming
MultiQC1.21Aggregate QC reporting

Key Hi-C parameters: Cis/trans ratio > 60%, long-range cis contacts (> 20 kb) > 40%. Resolution depends on sequencing depth: ~1 billion valid pairs for 5 kb resolution on human.

Note: Juicer requires Java 11+. Install via

conda install -c bioconda juicer_tools
or download the
.jar
directly from the Aiden Lab GitHub.

Environment file:

environments/hic-env.yml


WGBS Environment (
encode-wgbs
)

For whole-genome bisulfite sequencing (DNA methylation) following ENCODE standards (Foox et al. 2021, Schultz et al. 2015).

ToolVersionPurpose
Bismark0.24.2Bisulfite-aware alignment and methylation extraction (Krueger & Andrews 2011)
MethylDackel0.6.1Fast methylation extraction from bisulfite BAMs (Ryan 2023)
samtools1.19BAM manipulation, merge, index
bedtools2.31.0Interval operations for DMR analysis
FastQC0.12.1Read quality assessment (note: bisulfite libraries have biased base composition)
Trim Galore0.6.10Adapter trimming with --rrbs or default mode
MultiQC1.21Aggregate QC reporting with Bismark module
tabix1.19Index methylation BED files for random access
bgzip1.19Block-gzip compression for indexed access

Key WGBS parameters: Bisulfite conversion rate ≥ 98% (check unmethylated spike-in lambda DNA). CpG coverage >= 10x for reliable DMR calling. M-bias plots should be checked for end-repair artifacts.

Environment file:

environments/wgbs-env.yml


DNase-seq Environment (
encode-dnaseseq
)

For DNase I hypersensitive site mapping following ENCODE standards (Thurman et al. 2012, ENCODE Consortium 2020).

ToolVersionPurpose
BWA-MEM0.7.17Read alignment to reference genome
Hotspot22.3.1DNase-seq hotspot detection (John et al. 2011)
HINT-ATAC0.13.2TF footprinting from DNase-seq data (Li et al. 2019)
F-Seq22.0.3Feature density estimation for peak calling (Boyle et al. 2008, Zhao et al. 2020)
samtools1.19BAM handling and filtering
bedtools2.31.0Interval operations, blacklist filtering
FastQC0.12.1Read quality assessment
Trim Galore0.6.10Adapter trimming
MultiQC1.21Aggregate QC reporting

Environment file:

environments/dnaseseq-env.yml


CUT&RUN / CUT&Tag Environment (
encode-cutandrun
)

For antibody-targeted chromatin profiling via CUT&RUN (Skene & Henikoff 2017) and CUT&Tag (Kaya-Okur et al. 2019).

ToolVersionPurpose
Bowtie22.5.3Alignment (recommended for shorter CUT&RUN/Tag fragments)
SEACR1.3Sparse Enrichment Analysis for CUT&RUN (Meers et al. 2019)
MACS22.2.9.1Alternative peak calling with adjusted parameters
samtools1.19BAM handling, spike-in alignment filtering
Picard3.1.1Duplicate marking (low duplication expected for CUT&RUN/Tag)
deeptools3.5.5Signal tracks, heatmaps, spike-in normalization
bedtools2.31.0Interval operations, suspect list filtering
FastQC0.12.1Read quality assessment
Trim Galore0.6.10Adapter trimming
MultiQC1.21Aggregate QC reporting

Key CUT&RUN/Tag notes: These assays have inherently lower background than ChIP-seq. Do NOT apply ChIP-seq quality thresholds — use CUT&RUN-specific metrics (Nordin et al. 2023). Apply the CUT&RUN suspect list instead of the standard ENCODE blacklist. Spike-in normalization (E. coli DNA for CUT&RUN, carry-over for CUT&Tag) is strongly recommended for quantitative comparisons.

Environment file:

environments/cutandrun-env.yml

R/Bioconductor Packages

Install all R packages needed for ENCODE downstream analysis. The install script at

scripts/install-r-packages.R
handles BiocManager setup, version locking, and category-based installation.

Core Genomic Infrastructure

These packages provide the foundation for all genomic data manipulation in R:

PackagePurpose
GenomicRangesInterval arithmetic on genomic coordinates (Lawrence et al. 2013)
GenomicFeaturesGene model and transcript annotation handling
rtracklayerImport/export BED, bigWig, GFF, narrowPeak, broadPeak
IRangesInteger range operations (underlying GenomicRanges)
GenomeInfoDbChromosome naming conventions (UCSC vs Ensembl vs NCBI)
BiocGenericsCommon S4 generics across Bioconductor
S4VectorsS4 class infrastructure for Bioconductor objects
AnnotationDbiUnified interface to annotation databases
biomaRtEnsembl BioMart query interface for gene annotation (Durinck et al. 2009)

Differential Analysis

PackagePurpose
DESeq2Differential gene expression with shrinkage estimators (Love et al. 2014)
edgeRDifferential expression using empirical Bayes (Robinson et al. 2010)
limmaLinear models for microarray and RNA-seq data (Ritchie et al. 2015)
DiffBindDifferential binding analysis for ChIP-seq/ATAC-seq peaks (Stark & Brown 2011)
ChIPQCChIP-seq quality control in R (Carroll et al. 2014)
chromVARChromatin accessibility variation across single cells (Schep et al. 2017)

Annotation and Pathway Analysis

PackagePurpose
ChIPseekerPeak annotation and visualization (Yu et al. 2015)
annotatrAnnotate genomic regions with CpG islands, genes, enhancers (Cavalcante & Sartor 2017)
clusterProfilerGene ontology and KEGG pathway enrichment (Yu et al. 2012)
org.Hs.eg.dbHuman gene annotation database
org.Mm.eg.dbMouse gene annotation database
TxDb.Hsapiens.UCSC.hg38.knownGeneHuman transcript models (GRCh38)
TxDb.Mmusculus.UCSC.mm10.knownGeneMouse transcript models (mm10)

Single-Cell Analysis

PackagePurpose
SeuratComprehensive single-cell RNA-seq analysis (Hao et al. 2021)
SignacSingle-cell chromatin accessibility (ATAC-seq) analysis (Stuart et al. 2021)
SingleCellExperimentCore Bioconductor container for single-cell data
scaterSingle-cell QC, normalization, visualization (McCarthy et al. 2017)
scranSingle-cell normalization and feature selection (Lun et al. 2016)

Bulk-to-Single-Cell Deconvolution

PackagePurpose
BayesPrismBayesian deconvolution with scRNA-seq reference (Chu et al. 2022)
InstaPrismFast approximation of BayesPrism for large datasets (Wang et al. 2024)
MuSiC_deconvMulti-Subject Single Cell deconvolution (Wang et al. 2019)
DWLSDampened Weighted Least Squares deconvolution (Tsoucas et al. 2019)
BisqueRNAReference-based and marker-based deconvolution (Jew et al. 2020)

DNA Methylation Analysis

PackagePurpose
DMRcateDifferentially methylated region detection (Peters et al. 2021)
bsseqBisulfite sequencing data handling and smoothing (Hansen et al. 2012)
methylKitMethylation analysis from bisulfite sequencing (Akalin et al. 2012)

Visualization

PackagePurpose
ComplexHeatmapPublication-quality heatmaps with annotations (Gu et al. 2016)
EnhancedVolcanoVolcano plots for differential expression (Blighe et al. 2018)
GvizGenome browser-style track visualization (Hahne & Ivanek 2016)
ggplot2Grammar of graphics for all custom plots (Wickham 2016)

Statistics and Batch Correction

PackagePurpose
sva (ComBat)Surrogate variable analysis and batch correction (Leek et al. 2012)
WGCNAWeighted Gene Co-expression Network Analysis (Langfelder & Horvath 2008)
ReactomePAReactome pathway analysis (Yu & He 2016)

Install script:

scripts/install-r-packages.R

# Install all categories
Rscript scripts/install-r-packages.R --all

# Install only specific categories
Rscript scripts/install-r-packages.R --chipseq      # DiffBind, ChIPQC, ChIPseeker
Rscript scripts/install-r-packages.R --rnaseq       # DESeq2, edgeR, limma
Rscript scripts/install-r-packages.R --singlecell   # Seurat, Signac, scater, scran
Rscript scripts/install-r-packages.R --methylation   # DMRcate, bsseq, methylKit
Rscript scripts/install-r-packages.R --deconvolution # BayesPrism, InstaPrism, MuSiC_deconv, DWLS, BisqueRNA

Python Packages

Install Python packages for single-cell analysis, Hi-C processing, signal visualization, and genomic data manipulation.

Core Single-Cell Stack

PackagePurpose
scanpySingle-cell RNA-seq analysis framework (Wolf et al. 2018)
anndataAnnotated data matrix for single-cell (Virshup et al. 2021)
scvi-toolsDeep generative models for single-cell (Gayoso et al. 2022)
numpyNumerical computing
pandasData manipulation and tabular operations
scipyScientific computing (sparse matrices, statistics)
matplotlibPlotting foundation
seabornStatistical visualization

Genomics and Signal Processing

PackagePurpose
deeptoolsSignal tracks, heatmaps, correlation (also CLI; Ramirez et al. 2016)
pyBigWigRead/write bigWig signal files (Ryan 2023)
pysamPython interface to samtools/htslib (Li et al. 2009)
pybedtoolsPython interface to bedtools (Dale et al. 2011)

Hi-C Analysis

PackagePurpose
coolerMulti-resolution contact matrices (Abdennur & Mirny 2020)
cooltoolsAnalysis toolkit for cooler data: TADs, compartments, insulation
hic-strawRead .hic files from Juicer/Juicebox (Durand et al. 2016)
pyGenomeTracksGenome browser visualization including Hi-C tracks

Single-Cell QC and Integration

PackagePurpose
scrubletDoublet detection for scRNA-seq (Wolock et al. 2019)
CellBenderRemove ambient RNA contamination (Fleming et al. 2023)
harmony-pytorchBatch integration via Harmony in PyTorch (Korsunsky et al. 2019)
scanoramaPanoramic stitching of scRNA-seq datasets (Hie et al. 2019)
bbknnBatch-balanced KNN graph construction (Polanski et al. 2020)

Install script:

scripts/install-python-packages.sh

# Install all Python packages
bash scripts/install-python-packages.sh --all

# Install only specific categories
bash scripts/install-python-packages.sh --singlecell  # scanpy, scvi-tools, harmony
bash scripts/install-python-packages.sh --hic          # cooler, cooltools, hic-straw
bash scripts/install-python-packages.sh --deeptools    # deeptools, pyBigWig, pysam

Nextflow and Container Setup

ENCODE pipeline execution requires Nextflow DSL2 and a container runtime (Docker or Singularity).

Nextflow Installation

# Install Nextflow (requires Java 11+)
curl -s https://get.nextflow.io | bash
mv nextflow /usr/local/bin/

# Verify
nextflow -version

Docker (recommended for local/cloud)

# macOS
brew install --cask docker

# Linux (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

# Add current user to docker group (Linux)
sudo usermod -aG docker $USER

Singularity (for HPC clusters)

# Most HPC clusters have Singularity pre-installed
# Check with: module load singularity && singularity version

# If not available, install via conda:
conda install -c conda-forge singularity

Nextflow Configuration Profiles

The pipeline skills (pipeline-chipseq, pipeline-atacseq, etc.) include

nextflow.config
files with profiles for local, SLURM, GCP, and AWS execution. Select the appropriate profile:

# Local with Docker
nextflow run main.nf -profile local

# HPC with Singularity
nextflow run main.nf -profile slurm

# Google Cloud
nextflow run main.nf -profile gcp

# AWS Batch
nextflow run main.nf -profile aws

Install script:

scripts/install-nextflow.sh

Motif Analysis Tools

For transcription factor binding motif discovery and scanning.

ToolVersionTypePurpose
HOMER4.11CLIDe novo and known motif discovery, annotation (Heinz et al. 2010)
MEME Suite5.5.5CLIMEME, DREME, STREME de novo discovery; FIMO scanning; AME enrichment (Bailey et al. 2015)
FIMO5.5.5CLI (part of MEME Suite)Motif occurrence scanning across sequences
TFBSToolsRR/BioconductorJASPAR motif handling, PFM/PWM conversion, motif scanning in R (Tan & Lenhard 2016)

HOMER Installation

# Download and configure HOMER
mkdir -p ~/software/homer
cd ~/software/homer
wget http://homer.ucsd.edu/homer/configureHomer.pl
perl configureHomer.pl -install homer
perl configureHomer.pl -install hg38   # Human genome
perl configureHomer.pl -install mm10   # Mouse genome

# Add to PATH
export PATH=$PATH:~/software/homer/bin

MEME Suite Installation

# Via conda (recommended)
conda install -c bioconda meme

# Or from source
wget https://meme-suite.org/meme/meme-software/5.5.5/meme-5.5.5.tar.gz
tar xzf meme-5.5.5.tar.gz
cd meme-5.5.5
./configure --prefix=$HOME/software/meme --enable-build-libxml2 --enable-build-libxslt
make && make install

Walkthrough: Setting Up a Complete ENCODE Analysis Environment

Goal: Install all bioinformatics tools needed to process ENCODE data, from raw FASTQ files through peak calling, annotation, and visualization, using Conda environments. Context: ENCODE analysis requires dozens of specialized tools. This skill automates installation with pre-configured Conda environments for each pipeline stage.

Step 1: Determine required tools by experiment type

encode_get_experiment(accession="ENCSR000AKA")

Expected output:

{
  "accession": "ENCSR000AKA",
  "assay_title": "Histone ChIP-seq",
  "target": "H3K27ac"
}

Interpretation: Histone ChIP-seq requires: BWA-MEM (alignment), SAMtools (BAM processing), MACS2 (peak calling), IDR (reproducibility), bedtools (interval operations), deepTools (signal visualization).

Step 2: Install the ChIP-seq Conda environment

# Using the pre-configured environment YAML
conda env create -f skills/bioinformatics-installer/scripts/chipseq-env.yml
conda activate encode-chipseq

The YAML includes:

name: encode-chipseq
channels: [bioconda, conda-forge, defaults]
dependencies:
  - bwa=0.7.17
  - samtools=1.17
  - macs2=2.2.9.1
  - idr=2.0.3
  - bedtools=2.31.0
  - deeptools=3.5.4
  - picard=3.1.1
  - fastqc=0.12.1
  - multiqc=1.17

Step 3: Install additional tools for downstream analysis

For peak annotation and motif analysis:

conda env create -f skills/bioinformatics-installer/scripts/annotation-env.yml
conda activate encode-annotation
# Includes: HOMER, GREAT, bedtools, R/Bioconductor (ChIPseeker, clusterProfiler)

Step 4: Verify installation

# Quick verification of key tools
bwa 2>&1 | head -3
samtools --version | head -1
macs2 --version
bedtools --version

Step 5: Download reference data for ENCODE analysis

encode_download_files(accessions=["ENCFF001ABC"], download_dir="/data/references")

Reference files needed:

  • GRCh38 genome FASTA
  • ENCODE blacklist v2 (Amemiya et al. 2019)
  • Gene annotation GTF (GENCODE v36)

Integration with downstream skills

  • Installed tools are used by → pipeline-chipseq through pipeline-cutandrun for processing
  • Reference data feeds into → download-encode for FASTQ retrieval
  • Environment setup enables → quality-assessment tool execution
  • Installed annotation tools support → peak-annotation and motif-analysis

Code Examples

1. Find experiments to identify required tools

encode_search_experiments(
  assay_title="ATAC-seq",
  organ="pancreas"
)

Expected output:

{
  "total": 8,
  "experiments": [
    {
      "accession": "ENCSR799GHJ",
      "assay_title": "ATAC-seq",
      "biosample_summary": "pancreatic islet tissue male adult (44 years)",
      "status": "released"
    }
  ]
}

Install decision: ATAC-seq requires the

atacseq-env.yml
conda environment (Bowtie2 + MACS2 + deeptools + samtools + bedtools).

2. Get file info to understand format requirements

encode_get_file_info(accession="ENCFF001ABC")

Expected output:

{
  "accession": "ENCFF001ABC",
  "file_format": "fastq",
  "file_size_mb": 4521.3,
  "read_length": 100,
  "paired_end": true,
  "platform": "Illumina NovaSeq 6000"
}

Install decision: Paired-end FASTQ needs Bowtie2 (not BWA for ATAC-seq), Picard for duplicate marking, and samtools for BAM processing.

Pitfalls & Edge Cases

  • Conda solver conflicts: Large conda environments with many packages can take hours to solve. Use mamba instead of conda for faster dependency resolution, or install in smaller focused environments.
  • R/Bioconductor version mismatch: R packages from CRAN and Bioconductor must match the R version. Installing Bioconductor 3.18 packages with R 4.4 will fail silently or produce errors. Use BiocManager::install() to ensure version compatibility.
  • Python 2 vs Python 3: Some legacy bioinformatics tools (MACS 1.x, old HOMER) require Python 2. Never install Python 2 tools in the same environment as Python 3 tools — use separate conda environments.
  • ARM Mac (M1/M2/M3) compatibility: Many bioinformatics tools lack native ARM builds. Use
    CONDA_SUBDIR=osx-64
    or Rosetta 2 emulation for x86_64 packages. Some tools (samtools, BWA) have ARM-native builds.
  • Nextflow requires Java 11+: Nextflow will not run on Java 8. Check
    java -version
    before running pipelines. Install with
    curl -s https://get.nextflow.io | bash
    for correct Java bundling.
  • Docker vs Singularity on HPC: Most HPC clusters do not allow Docker (requires root). Use Singularity instead. Nextflow supports both via
    -profile singularity
    or
    -profile docker
    .

Literature Foundation

#ReferenceKey Contribution
1Li & Durbin 2009, Bioinformatics, DOI:10.1093/bioinformatics/btp324 (~30,000 cit)BWA aligner
2Langmead & Salzberg 2012, Nat Methods, DOI:10.1038/nmeth.1923 (~25,000 cit)Bowtie2 aligner
3Li et al. 2009, Bioinformatics, DOI:10.1093/bioinformatics/btp352 (~20,000 cit)SAMtools/BAM format
4Zhang et al. 2008, Genome Biol, DOI:10.1186/gb-2008-9-9-r137 (~7,000 cit)MACS2 peak caller
5Dobin et al. 2013, Bioinformatics, DOI:10.1093/bioinformatics/bts635 (~15,000 cit)STAR RNA-seq aligner
6Love et al. 2014, Genome Biol, DOI:10.1186/s13059-014-0550-8 (~30,000 cit)DESeq2
7Ramirez et al. 2016, Nucleic Acids Res, DOI:10.1093/nar/gkw257 (~3,000 cit)deeptools
8Wolf et al. 2018, Genome Biol, DOI:10.1186/s13059-017-1382-0 (~5,000 cit)Scanpy
9Hao et al. 2021, Cell, DOI:10.1016/j.cell.2021.04.048 (~8,000 cit)Seurat v4
10Quinlan & Hall 2010, Bioinformatics, DOI:10.1093/bioinformatics/btq033 (~10,000 cit)bedtools
11Ewels et al. 2016, Bioinformatics, DOI:10.1093/bioinformatics/btw354 (~3,000 cit)MultiQC
12Krueger & Andrews 2011, Bioinformatics, DOI:10.1093/bioinformatics/btr167 (~5,000 cit)Bismark
13Heinz et al. 2010, Molecular Cell, DOI:10.1016/j.molcel.2010.05.004 (~7,000 cit)HOMER motif analysis
14Bailey et al. 2015, Nucleic Acids Res, DOI:10.1093/nar/gkv416 (~3,000 cit)MEME Suite
15Meers et al. 2019, Epigenetics Chromatin, DOI:10.1186/s13072-019-0287-4 (~800 cit)SEACR for CUT&RUN
16Di Tommaso et al. 2017, Nat Biotechnol, DOI:10.1038/nbt.3820 (~2,500 cit)Nextflow
17Landt et al. 2012, Genome Res, DOI:10.1101/gr.136184.111 (~4,000 cit)ENCODE ChIP-seq standards
18ENCODE Consortium 2020, Nature, DOI:10.1038/s41586-020-2493-4 (~1,656 cit)ENCODE Phase 3
19Amemiya et al. 2019, Sci Rep, DOI:10.1038/s41598-019-45839-z (~1,372 cit)ENCODE Blacklist v2

Integration

This skill produces...Feed into...Purpose
Conda environmentspipeline-chipseq through pipeline-cutandrunProvide tool dependencies for all pipeline stages
Installed reference datadownload-encodeReference genomes and annotations for alignment
Tool version inventorydata-provenanceRecord exact tool versions for reproducibility
QC tool installationsquality-assessmentEnable FastQC, MultiQC, and ENCODE QC metric tools
Annotation tool setuppeak-annotationHOMER, ChIPseeker for peak-to-gene assignment
Motif scanning toolsjaspar-motifsMEME Suite for motif scanning against JASPAR
Visualization toolsvisualization-workflowdeepTools, IGV, R/ggplot2 for data visualization
Liftover utilitiesliftover-coordinatesUCSC liftOver binary for assembly conversion

Related Skills

  • pipeline-guide: Parent skill for all pipeline execution; provides overview of available pipelines and tool selection guidance
  • pipeline-chipseq: Uses the ChIP-seq conda environment tools for FASTQ-to-peaks processing
  • pipeline-atacseq: Uses the ATAC-seq conda environment tools for accessibility analysis
  • pipeline-rnaseq: Uses the RNA-seq conda environment for expression quantification
  • pipeline-wgbs: Uses the WGBS conda environment for methylation analysis
  • pipeline-hic: Uses the Hi-C conda environment for contact matrix generation
  • pipeline-dnaseseq: Uses the DNase-seq conda environment for hotspot detection
  • pipeline-cutandrun: Uses the CUT&RUN conda environment for CUT&RUN/CUT&Tag processing
  • quality-assessment: Quality metrics require properly installed tools to compute
  • setup: Initial ENCODE Toolkit server setup (MCP connection, not bioinformatics tools)
  • motif-analysis: Requires HOMER and MEME Suite from this installer
  • visualization-workflow: Uses deeptools, pyGenomeTracks, and R visualization packages from this installer
  • single-cell-encode: Uses Seurat, Signac, Scanpy from this installer
  • publication-trust: Assess scientific integrity of publications before relying on their methods or findings

Presenting Results

  • Present installed tools as a checklist table: tool | version | status (installed/failed/skipped). Group by assay environment. Suggest: "Would you like to verify the installation by running a quick test on sample ENCODE data?"
  • If any installation fails, provide the exact error and a targeted fix. Common fixes: update conda, set channel priority, install system dependencies.

For the request: "$ARGUMENTS"