Encode-toolkit pipeline-rnaseq

Execute ENCODE RNA-seq pipeline from FASTQ to gene quantification and signal tracks. Child of pipeline-guide. Provides Nextflow execution with Docker and cloud deployment. Use when processing RNA-seq data with STAR alignment, RSEM/Kallisto quantification, or generating expression matrices. Trigger on: RNA-seq pipeline, gene expression, STAR alignment, RSEM quantification, transcript quantification, TPM, FPKM, RNA processing, run RNA-seq.

install
source · Clone the upstream repo
git clone https://github.com/ammawla/encode-toolkit
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/pipeline-rnaseq" ~/.claude/skills/ammawla-encode-toolkit-pipeline-rnaseq-270e34 && rm -rf "$T"
manifest: skills/pipeline-rnaseq/SKILL.md
source content

ENCODE RNA-seq Pipeline

When to Use

  • User wants to run an RNA-seq processing pipeline from FASTQ to gene quantification
  • User asks about "RNA-seq pipeline", "STAR alignment", "RSEM", "gene expression quantification", or "Kallisto"
  • User needs to process bulk RNA-seq data with ENCODE-standard 2-pass STAR alignment
  • Example queries: "process my RNA-seq FASTQs", "quantify gene expression from RNA-seq", "run STAR and RSEM on my data"

Execute the ENCODE RNA-seq processing pipeline from raw FASTQ files through splice-aware alignment, gene/transcript quantification, and strand-specific signal track generation. This skill provides a complete Nextflow DSL2 implementation following ENCODE uniform analysis standards.

Overview

RNA-seq measures transcriptome-wide gene expression by sequencing cDNA derived from cellular RNA. The ENCODE pipeline processes RNA-seq data through quality control, splice-aware alignment with STAR (2-pass mode), gene and transcript quantification with RSEM, optional fast pseudoalignment with Kallisto, and generation of strand-specific signal tracks as bigWig files.

Key design decisions: STAR 2-pass mode for maximum splice junction sensitivity, RSEM for accurate gene/transcript/isoform quantification including multi-mapped reads, stranded library protocol (dUTP/rf-stranded) as the ENCODE standard, and paired-end sequencing with a minimum of 30 million uniquely mapped reads per replicate.

Key Literature

ReferenceJournalYearDOIRelevance
Dobin et al. "STAR: ultrafast universal RNA-seq aligner"Bioinformatics201310.1093/bioinformatics/bts635Splice-aware aligner (~12,000 citations)
Li & Dewey "RSEM: accurate transcript quantification from RNA-Seq data"BMC Bioinformatics201110.1186/1471-2105-12-323Gene/transcript quantification (~6,000 citations)
Bray et al. "Near-optimal probabilistic RNA-seq quantification"Nature Biotechnology201610.1038/nbt.3519Fast pseudoalignment (~4,000 citations)
Wang et al. "RSeQC: quality control of RNA-seq experiments"Bioinformatics201210.1093/bioinformatics/bts356RNA-seq QC suite (~3,500 citations)
ENCODE Project Consortium "Expanded encyclopaedias"Nature202010.1038/s41586-020-2493-4ENCODE Phase 3 standards
Frankish et al. "GENCODE 2021"Nucleic Acids Research202110.1093/nar/gkaa1087Gene annotation reference

Pipeline Stages

FASTQ ──> FastQC / Trim Galore ──> STAR (2-pass) ──> Genome BAM + Transcriptome BAM
  |                                                        |              |
  |                  ┌─────────────────────────────────────┘              |
  |                  v                                                    v
  |         Signal Track Generation                              RSEM Quantification
  |          (strand-specific bigWig)                        (gene + transcript + isoform)
  |                  |                                                    |
  |                  v                                                    v
  |          Plus strand bigWig                                genes.results (TPM/FPKM)
  |          Minus strand bigWig                               isoforms.results
  |                                                                       |
  |         ┌───────────────────────────────────────────────────────────┘
  |         v
  |   Kallisto (optional fast pseudoalignment)
  |         |
  |         v
  └──> RSeQC + MultiQC ──> Aggregated QC Report

Stage Summary

StageToolInputOutputReference
1. QC & TrimmingFastQC, Trim GaloreRaw FASTQTrimmed FASTQreferences/01-qc-trimming.md
2. AlignmentSTAR (2-pass)Trimmed FASTQGenome BAM + Transcriptome BAMreferences/02-star-alignment.md
3. QuantificationRSEM, KallistoTranscriptome BAM / FASTQGene/transcript counts, TPM, FPKMreferences/03-quantification.md
4. Signal TracksbedGraphToBigWigSTAR bedGraphStrand-specific bigWigreferences/04-signal-tracks.md
5. QC MetricsRSeQC, MultiQCBAM, countsStrandedness, coverage, saturationreferences/05-qc-metrics.md

Input Requirements

Required Files

  • RNA-seq FASTQ: Paired-end reads (ENCODE standard; single-end supported)
  • Reference genome: STAR-indexed genome with gene annotation (GRCh38 + GENCODE for human)
  • Gene annotation: GENCODE GTF (v38+ for human, vM27+ for mouse)

Sample Sheet Format

sample_id,read1,read2,replicate,strandedness
SAMPLE1_rep1,rna_R1.fq.gz,rna_R2.fq.gz,1,reverse
SAMPLE1_rep2,rna_R1.fq.gz,rna_R2.fq.gz,2,reverse

Strandedness: ENCODE uses dUTP-based stranded libraries. The resulting reads are

reverse
stranded (read 2 matches the sense strand). If unknown, the pipeline will auto-detect strandedness using RSeQC
infer_experiment.py
.

Library Strandedness

ProtocolStrandednessRSEM flagKallisto flagCommon Usage
dUTP (ENCODE standard)Reverse
--strandedness reverse
--rf-stranded
Most ENCODE RNA-seq
SMARTer / SMART-Seq2Unstranded
--strandedness none
(default)Single-cell, low-input
Illumina TruSeq StrandedReverse
--strandedness reverse
--rf-stranded
Standard bulk RNA-seq
Directional ligationForward
--strandedness forward
--fr-stranded
Some legacy protocols

QC Thresholds

MetricThresholdCategorySource
Total sequenced reads>=30M PE readsRead depthENCODE
Uniquely mapped reads>=70% of totalAlignmentENCODE
Multi-mapped reads<10%AlignmentENCODE
rRNA rate<10%Sample qualityENCODE
Strandedness agreement>90%Library prepRSeQC
Exonic rate>60%Mapping qualityRSeQC
Gene body coverageRelatively uniform (5'/3' bias <1.5)RNA integrityRSeQC
Duplication rate<60%Library complexityPicard
Detected genes (TPM>1)>12,000 (human)SensitivityENCODE
SaturationApproaching plateau at sequencing depthDepth sufficiencyRSeQC

Read Depth Guidelines

ApplicationMinimum Reads (PE)RecommendedNotes
Gene-level expression20M30MENCODE minimum
Transcript-level expression40M60MIsoform resolution requires more depth
Differential expression20M per sample30M per sample3+ biological replicates per condition
Novel junction discovery60M100M+STAR 2-pass mode benefits from depth
Fusion detection50M80M+Chimeric reads are rare

Execution

Quick Start (Local Docker)

nextflow run scripts/main.nf \
  -profile local \
  --reads 'fastq/*_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir results/

SLURM HPC

nextflow run scripts/main.nf \
  -profile slurm \
  --reads 'fastq/*_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir results/

Google Cloud

nextflow run scripts/main.nf \
  -profile gcp \
  --reads 'gs://bucket/fastq/*_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir 'gs://bucket/results/'

AWS Batch

nextflow run scripts/main.nf \
  -profile aws \
  --reads 's3://bucket/fastq/*_R{1,2}.fq.gz' \
  --genome GRCh38 \
  --outdir 's3://bucket/results/'

Cloud Cost Estimates

PlatformInstanceCost/SampleTime/SampleNotes
GCPn1-highmem-8~$3-62-4 hoursSTAR index loading dominates; preemptible recommended
AWSr5.2xlarge~$3-62-4 hoursr-series for STAR memory; spot recommended
Local8 cores, 32GB$03-6 hoursDocker required; STAR needs 32GB+ RAM
SLURM8 cores, 32GBVaries2-4 hoursSingularity recommended

Memory note: STAR genome index loading requires ~32GB RAM for human. Machines with <32GB will fail at the alignment step. Use

--limitGenomeGenerateRAM
to cap index size for constrained environments, but this reduces mapping sensitivity.

Output Directory Structure

results/
  fastqc/                   # Raw and trimmed QC reports
  trimmed/                  # Trimmed FASTQ files
  star/
    genome_bam/             # Genome-aligned BAM files
    transcriptome_bam/      # Transcriptome BAM files (for RSEM)
    logs/                   # STAR alignment logs (Log.final.out)
    junctions/              # Splice junction files (SJ.out.tab)
  rsem/
    genes/                  # genes.results (gene_id, TPM, FPKM, expected_count)
    isoforms/               # isoforms.results (transcript_id, TPM, FPKM)
  kallisto/                 # Kallisto quantification output (optional)
    abundance.tsv           # transcript TPMs and counts
  signal/
    plus/                   # Plus-strand bigWig signal tracks
    minus/                  # Minus-strand bigWig signal tracks
  qc/
    rseqc/                  # RSeQC output (strandedness, coverage, distribution)
    multiqc/                # Aggregated QC report
  logs/                     # Nextflow execution logs

Common Pitfalls

1. Insufficient Memory for STAR

STAR loads the entire genome index into shared memory (~32GB for human). This is the most common failure mode. Always allocate at least 32GB RAM for the alignment step. On shared HPC systems, check memory limits per job.

2. Wrong Strandedness Setting

Using incorrect strandedness results in near-zero gene counts. If you see uniformly low counts, check strandedness with

infer_experiment.py
from RSeQC. ENCODE dUTP libraries are
reverse
stranded.

3. Using FPKM for Cross-Sample Comparison

FPKM values are not comparable across samples because they depend on total library composition. Use TPM (comparable across samples) or raw counts with DESeq2/edgeR normalization for differential expression.

4. Ignoring Multi-Mapped Reads

RSEM uses an expectation-maximization algorithm to probabilistically assign multi-mapped reads. This is critical for gene families and repetitive elements. Do not pre-filter multi-mappers before RSEM quantification.

5. Skipping rRNA Assessment

High rRNA contamination (>10%) indicates failed rRNA depletion. This reduces effective sequencing depth and inflates apparent duplication rates. Always check rRNA rate before downstream analysis.

6. Not Using 2-Pass Mode for Novel Junctions

STAR 1-pass mode only uses annotated splice junctions. 2-pass mode first discovers novel junctions then re-maps, critical for non-model organisms or samples with extensive alternative splicing.

Pipeline Scripts

FileDescriptionLines
scripts/main.nf
Nextflow DSL2 pipeline~130
scripts/nextflow.config
Execution profiles (local/slurm/gcp/aws)~60
scripts/Dockerfile
Docker build with STAR, RSEM, Kallisto, RSeQC~35

ENCODE Data Integration

After running on your own data, compare with ENCODE reference:

# Find matching ENCODE RNA-seq experiments
encode_search_experiments(
    assay_title="total RNA-seq",
    organ="pancreas",
    biosample_type="tissue"
)

# Download ENCODE gene quantifications for comparison
encode_batch_download(
    download_dir="/data/encode_reference/",
    output_type="gene quantifications",
    assay_title="total RNA-seq",
    organ="pancreas",
    assembly="GRCh38"
)

# Download ENCODE signal tracks for browser visualization
encode_search_files(
    file_format="bigWig",
    assay_title="total RNA-seq",
    organ="pancreas",
    output_type="signal of unique reads"
)

Pitfalls & Edge Cases

  • Strandedness must match library prep: STAR requires correct
    --outSAMstrandField
    and RSEM needs
    --strandedness
    . Using wrong strand settings can halve gene counts or assign reads to antisense genes.
  • rRNA contamination: rRNA >10% wastes sequencing depth. Ribosomal depletion libraries should have <5%. Poly-A selection libraries should have <1%. Check with FastQC or Picard CollectRnaSeqMetrics.
  • STAR 2-pass mode is required: The first pass discovers novel splice junctions; the second pass uses them. Single-pass STAR misses tissue-specific or rare splicing events, reducing sensitivity for differential exon usage.
  • Gene-level vs transcript-level quantification: RSEM provides transcript-level estimates but gene-level aggregation is more robust for differential expression. Transcript-level analysis requires many more replicates (≥6).
  • TPM normalization is not for cross-sample comparison: TPM normalizes within a sample but is NOT appropriate for comparing expression across conditions. Use DESeq2 size factors or TMM normalization for differential expression.
  • Batch effects in multi-lab data: RNA-seq is highly sensitive to library prep method, sequencer, and lab. Always check for batch effects with PCA before combining datasets from different sources.

Walkthrough: Processing ENCODE RNA-seq from FASTQ to Gene Quantification

Goal: Process raw RNA-seq FASTQ files through the ENCODE pipeline to generate gene expression quantifications (TPM/FPKM) and signal tracks. Context: The ENCODE RNA-seq pipeline uses STAR 2-pass alignment and RSEM quantification, producing both gene-level and transcript-level expression estimates.

Step 1: Find RNA-seq experiment

encode_get_experiment(accession="ENCSR000CPR")

Expected output:

{
  "accession": "ENCSR000CPR",
  "assay_title": "RNA-seq",
  "biosample_summary": "K562",
  "replicates": 2,
  "status": "released"
}

Step 2: List FASTQ files

encode_list_files(accession="ENCSR000CPR", file_format="fastq")

Expected output:

{
  "files": [
    {"accession": "ENCFF200RN1", "output_type": "reads", "paired_end": "1", "biological_replicates": [1], "file_size_mb": 3200},
    {"accession": "ENCFF201RN2", "output_type": "reads", "paired_end": "2", "biological_replicates": [1], "file_size_mb": 3300}
  ]
}

Step 3: Run the RNA-seq pipeline

nextflow run pipeline-rnaseq/main.nf \
  --fastq_r1 ENCFF200RN1.fastq.gz \
  --fastq_r2 ENCFF201RN2.fastq.gz \
  --genome GRCh38 \
  --annotation gencode.v36.annotation.gtf \
  --strandedness reverse \
  -profile docker

Key pipeline steps:

  1. Quality trimming (fastp)
  2. STAR 2-pass alignment (splice-aware)
  3. RSEM gene quantification (TPM, FPKM, expected counts)
  4. Kallisto transcript quantification (optional)
  5. Signal track generation (bigWig)
  6. QC metrics (RSeQC, MultiQC)

Step 4: Validate output quality

MetricThresholdPurpose
Mapping rate>= 70%Alignment success
rRNA rate< 10%rRNA depletion efficiency
Replicate correlation>= 0.9Biological consistency
Exonic rate> 60%Expected for mRNA-seq

Step 5: Use expression data with ENCODE epigenomic data

Compare gene expression with enhancer marks:

encode_search_experiments(assay_title="Histone ChIP-seq", biosample_term_name="K562", target="H3K27ac", organism="Homo sapiens")

Interpretation: Genes with high TPM AND nearby H3K27ac peaks have validated enhancer-gene connections. Low expression despite nearby enhancer marks suggests poised or tissue-specific regulation.

Integration with downstream skills

  • Gene quantifications feed into -> peak-annotation for expression-validated peak targets
  • Expression data connects to -> gtex-expression for tissue comparison
  • Processed data feeds into -> compare-biosamples for differential expression analysis
  • Pipeline provenance logged by -> data-provenance

Code Examples

1. Find RNA-seq experiments for a tissue

encode_search_experiments(assay_title="total RNA-seq", organ="liver", organism="Homo sapiens")

Expected output:

{
  "total": 35,
  "results": [
    {"accession": "ENCSR300RNA", "assay_title": "RNA-seq", "biosample_summary": "liver", "status": "released"}
  ]
}

2. Check for existing gene quantifications

encode_list_files(accession="ENCSR300RNA", file_format="tsv", output_type="gene quantifications", assembly="GRCh38")

Expected output:

{
  "files": [
    {"accession": "ENCFF400GEQ", "output_type": "gene quantifications", "file_format": "tsv", "file_size_mb": 5.2}
  ]
}

3. Download expression data

encode_download_files(accessions=["ENCFF400GEQ"], download_dir="/data/rnaseq/quantification")

Expected output:

{
  "downloaded": 1,
  "md5_verified": true,
  "files": ["/data/rnaseq/quantification/ENCFF400GEQ.tsv"]
}

Integration

This skill produces...Feed into...Purpose
Gene expression (TPM/FPKM)peak-annotationValidate enhancer targets with expression data
Expression matrixgtex-expressionCompare cell-line vs. tissue expression
Differential expression resultscompare-biosamplesIdentify tissue-specific gene regulation
Signal tracks (bigWig)visualization-workflowDisplay expression signal in genome browser
Expression quantificationsdisease-researchConnect gene expression to disease phenotypes
Pipeline run parametersdata-provenanceRecord STAR/RSEM versions and settings
QC metricsquality-assessmentValidate against ENCODE RNA-seq standards

Related Skills

  • pipeline-guide (parent): General pipeline selection and resource assessment
  • quality-assessment: Deep-dive QC analysis beyond basic metrics
  • integrative-analysis: Combine RNA-seq with ChIP-seq/ATAC-seq for regulatory inference
  • compare-biosamples: Compare expression profiles across cell types
  • single-cell-encode: For scRNA-seq data processing (different pipeline)
  • pipeline-chipseq: Sibling pipeline for ChIP-seq data
  • pipeline-atacseq: Sibling pipeline for ATAC-seq data
  • publication-trust: Verify literature claims backing analytical decisions

Presenting Results

When reporting RNA-seq pipeline results:

  • Mapping rate: Report STAR uniquely mapped rate (>70% expected), multi-mapped rate (<10%), and unmapped rate from Log.final.out
  • Gene/transcript counts: Report number of detected genes (TPM>1, expect >12,000 for human) and detected transcripts
  • rRNA rate: Report rRNA contamination percentage (<10% expected). Flag if rRNA depletion appears to have failed
  • Quantification paths: Provide paths to RSEM genes.results (TPM, FPKM, expected_count) and Kallisto abundance.tsv files
  • Strandedness: Confirm the detected strandedness from RSeQC infer_experiment.py matches the expected library protocol
  • Key QC metrics: Present exonic rate (>60%), gene body coverage uniformity, duplication rate, and saturation curve status in a summary table
  • Signal tracks: Provide paths to strand-specific bigWig files (plus/ and minus/)
  • Next steps: Suggest
    integrative-analysis
    to combine RNA-seq with ChIP-seq/ATAC-seq for regulatory inference, or
    compare-biosamples
    for cross-tissue expression comparison

For the request: "$ARGUMENTS"