Encode-toolkit pipeline-wgbs
Execute ENCODE Whole Genome Bisulfite Sequencing (WGBS) pipeline from FASTQ to methylation calls. Child of pipeline-guide. Provides Nextflow execution with Docker and cloud deployment. Use when processing WGBS/bisulfite-seq data, calling methylation levels, generating bedMethyl files. Trigger on: WGBS pipeline, bisulfite sequencing, methylation calling, DNA methylation pipeline, bismark, bwa-meth, bedMethyl.
git clone https://github.com/ammawla/encode-toolkit
T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/pipeline-wgbs" ~/.claude/skills/ammawla-encode-toolkit-pipeline-wgbs-244fce && rm -rf "$T"
skills/pipeline-wgbs/SKILL.mdENCODE WGBS Pipeline: FASTQ to Methylation Calls
When to Use
- User wants to run a WGBS/bisulfite sequencing pipeline from FASTQ to methylation calls
- User asks about "WGBS pipeline", "bisulfite sequencing", "methylation calling", "Bismark", or "bedMethyl"
- User needs to process whole-genome bisulfite sequencing data following ENCODE standards
- Example queries: "process my WGBS FASTQs", "call methylation levels from bisulfite-seq", "run Bismark on my WGBS data"
Execute the ENCODE DNA methylation pipeline for Whole Genome Bisulfite Sequencing data, producing per-CpG methylation levels in bedMethyl format.
Pipeline Overview
FASTQ -> Trim adapters -> Bismark align -> Deduplicate -> MethylDackel extract -> bedMethyl | | | | | | QC Trim Galore Bismark/bwa-meth Picard Per-CpG calls Final output
ENCODE Repository
- GitHub:
ENCODE-DCC/dna-me-pipeline - Container:
encodedcc/dna-me-pipeline - WDL: Available for Cromwell execution
- This skill: Nextflow DSL2 reimplementation for portability
Core Tools and Versions
| Tool | Version | Purpose | Citation |
|---|---|---|---|
| Trim Galore | 0.6.10 | Adapter + quality trimming (bisulfite-aware) | Krueger (Babraham) |
| Bismark | 0.24.2 | Bisulfite-aware alignment + methylation | Krueger & Andrews 2011 |
| bwa-meth | 0.2.7 | Alternative bisulfite aligner (faster) | Pedersen 2014 |
| MethylDackel | 0.6.1 | Methylation extraction from BAM | Ryan (GitHub) |
| Picard | 3.1.1 | Duplicate marking | Broad Institute |
| samtools | 1.19 | BAM operations | Li et al. 2009 |
| FastQC | 0.12.1 | Read quality assessment | Andrews (Babraham) |
| MultiQC | 1.21 | Aggregated QC reporting | Ewels et al. 2016 |
Key Literature
-
Krueger & Andrews 2011 - "Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications" (Bioinformatics, ~4,000 citations) DOI: 10.1093/bioinformatics/btr167
-
Lister et al. 2009 - "Human DNA methylomes at base resolution show widespread epigenomic differences" (Nature, ~5,000 citations) DOI: 10.1038/nature08514
-
Schultz et al. 2015 - "Human body epigenome maps reveal noncanonical DNA methylation variation" (Nature, ~1,500 citations) DOI: 10.1038/nature14248
-
Pedersen et al. 2014 - "Fast and accurate alignment of long bisulfite-seq reads" arXiv:1401.1129 (bwa-meth)
-
Amemiya et al. 2019 - "The ENCODE Blacklist" (Scientific Reports, ~1,372 citations) DOI: 10.1038/s41598-019-45839-z
Execution
Quick Start (Local)
nextflow run main.nf \ -profile local \ --reads '/data/fastq/*_R{1,2}.fastq.gz' \ --genome_dir '/ref/bismark_index' \ --outdir results/ \ -resume
SLURM HPC
nextflow run main.nf \ -profile slurm \ --reads '/data/fastq/*_R{1,2}.fastq.gz' \ --genome_dir '/ref/bismark_index' \ --outdir results/ \ -resume
Cloud (GCP / AWS)
nextflow run main.nf \ -profile gcp \ --reads 'gs://bucket/fastq/*_R{1,2}.fastq.gz' \ --genome_dir 'gs://bucket/ref/bismark_index' \ --outdir 'gs://bucket/results/' \ -resume
Resource Requirements
| Step | CPUs | RAM | Time (30x human) |
|---|---|---|---|
| Trim Galore | 4 | 4 GB | 1-2 hours |
| Bismark align | 8 | 48 GB | 8-16 hours |
| Deduplication | 2 | 16 GB | 1-2 hours |
| MethylDackel | 4 | 8 GB | 1-2 hours |
| Total | 8 | 48 GB | 12-24 hours |
Pipeline Parameters
| Parameter | Default | Description |
|---|---|---|
| required | Glob pattern to paired FASTQ files |
| required | Path to Bismark genome index directory |
| | Output directory |
| | Aligner: or |
| | Minimum coverage for CpG reporting |
| | Remove overlapping PE reads (avoid double-counting) |
| | Lambda genome index for conversion rate QC |
| | Skip deduplication (for RRBS data) |
Output Files
results/ fastqc/ # Raw read quality trim_galore/ # Trimmed reads + reports bismark/ alignments/ # Sorted, deduplicated BAMs dedup_reports/ # Duplication metrics methylation/ # bedMethyl files (primary output) {sample}.CpG.bedMethyl.gz {sample}.CHG.bedMethyl.gz # Non-CpG contexts {sample}.CHH.bedMethyl.gz conversion_rate/ # Lambda/pUC19 conversion QC coverage/ {sample}.coverage_stats.txt multiqc/ multiqc_report.html
bedMethyl Format
The primary output is per-CpG methylation in bedMethyl format:
chr1 10468 10470 . 1000 + 10468 10470 0,0,0 12 83.3
Columns: chr, start, end, name, score, strand, thickStart, thickEnd, color, coverage, methylation_percentage
QC Thresholds (ENCODE Standards)
| Metric | Pass | Warning | Fail |
|---|---|---|---|
| Bisulfite conversion rate | ≥98% | 95-98% | <95% |
| CpG coverage (genome-wide) | >10x | 5-10x | <5x |
| Mapping rate | >70% | 50-70% | <50% |
| Duplication rate | <30% | 30-50% | >50% |
| CpG sites covered (>=5x) | >80% | 60-80% | <60% |
| Lambda spike-in conversion | ≥98% | 95-98% | <95% |
Critical Pitfalls
RRBS vs WGBS
RRBS (Reduced Representation) uses MspI digestion and covers ~10% of CpGs. WGBS covers the full genome. These are DIFFERENT protocols:
- RRBS: Skip deduplication (
), different trimming--skip_dedup true - WGBS: Full dedup required, standard Trim Galore settings
- Never mix RRBS and WGBS data in the same analysis
Strand-Specific vs Merged CpG
Bismark reports methylation per strand by default. For most analyses, merge complementary CpG strands:
- Forward C at position N and reverse G at position N+1 are the same CpG
- MethylDackel
handles this automatically--mergeContext - Always specify
unless you need strand-specific data--mergeContext
Incomplete Bisulfite Conversion
Conversion artifacts produce false methylation calls:
- Always include lambda phage or pUC19 spike-in DNA
- Unmethylated spike-in should show ≥98% conversion
- If conversion <98%, the library has systematic artifacts -- do NOT proceed
M-bias Plots
MethylDackel generates M-bias plots showing methylation level by read position:
- End-repair artifacts cause elevated methylation at read ends
- Use
and--OT
flags to trim affected positions--OB - ENCODE typically trims 10 bp from 5' of read 2
Low Coverage Regions
Regions with <5x coverage have unreliable methylation estimates:
- Filter bedMethyl to
(default)--min_coverage 5 - For differential methylation analysis, consider
--min_coverage 10 - Report the fraction of CpGs meeting coverage threshold
Provenance Integration
After pipeline completion, log all outputs:
# Log derived bedMethyl files encode_log_derived_file( file_path="/results/bismark/methylation/sample1.CpG.bedMethyl.gz", source_accessions=["ENCSR...", "ENCFF..."], description="CpG methylation calls from ENCODE WGBS pipeline", file_type="bedMethyl", tool_used="Bismark 0.24.2 + MethylDackel 0.6.1", parameters="bismark --genome /ref -1 R1.fq.gz -2 R2.fq.gz; MethylDackel extract --mergeContext --minDepth 5" )
Reference Files
Detailed step-by-step documentation is provided in the
references/ directory:
-- Bisulfite-specific adapter trimming with Trim Galore01-qc-trimming.md
-- Bismark alignment and bwa-meth alternative02-bismark-alignment.md
-- Deduplication and BAM filtering03-dedup-filtering.md
-- MethylDackel extraction and bedMethyl generation04-methylation-calling.md
-- Conversion rate QC, coverage stats, M-bias05-qc-metrics.md
Walkthrough: Processing ENCODE WGBS from FASTQ to Methylation Calls
Goal: Process whole-genome bisulfite sequencing FASTQ files through the ENCODE pipeline to generate per-CpG methylation calls for epigenomic analysis. Context: WGBS requires bisulfite-aware alignment (Bismark) and per-CpG methylation extraction (MethylDackel), with ≥98% bisulfite conversion required.
Step 1: Find WGBS experiment
encode_get_experiment(accession="ENCSR765JPC")
Expected output:
{ "accession": "ENCSR765JPC", "assay_title": "WGBS", "biosample_summary": "liver", "replicates": 2, "status": "released" }
Step 2: List FASTQ files
encode_list_files(accession="ENCSR765JPC", file_format="fastq")
Expected output:
{ "files": [ {"accession": "ENCFF300BS1", "output_type": "reads", "paired_end": "1", "file_size_mb": 45000}, {"accession": "ENCFF301BS2", "output_type": "reads", "paired_end": "2", "file_size_mb": 46000} ] }
Interpretation: WGBS files are very large (~45GB per read file). Ensure adequate storage (>500GB for processing).
Step 3: Run the WGBS pipeline
nextflow run pipeline-wgbs/main.nf \ --fastq_r1 ENCFF300BS1.fastq.gz \ --fastq_r2 ENCFF301BS2.fastq.gz \ --genome GRCh38 \ --bismark_index /ref/bismark_index/ \ -profile docker
Key pipeline steps:
- Adapter/quality trimming (Trim Galore)
- Bisulfite-aware alignment (Bismark)
- Deduplication (Bismark deduplicate)
- Methylation extraction (MethylDackel)
- CpG coverage and beta-value calculation
- Bisulfite conversion rate check (≥98% required)
Step 4: Validate output quality
| Metric | Threshold | Purpose |
|---|---|---|
| Bisulfite conversion | >= 98% | Library quality |
| CpG coverage | >= 10x for DMR calling | Statistical power |
| Mapping rate | > 40% (BS-aware) | Alignment success |
| Duplication rate | < 30% | Library complexity |
Step 5: Identify differentially methylated regions
Feed per-CpG methylation into -> methylation-aggregation for cross-tissue comparison and HMR/UMR/PMD identification.
Integration with downstream skills
- Per-CpG methylation files feed into -> methylation-aggregation for cross-tissue atlas
- DMRs feed into -> peak-annotation for nearest gene assignment
- Methylation at regulatory elements connects to -> regulatory-elements
- CpG variant methylation integrates with -> variant-annotation
- Pipeline provenance logged by -> data-provenance
Code Examples
1. Find WGBS data for methylation analysis
encode_search_experiments( assay_title="WGBS", organ="brain" )
Expected output:
{ "total": 6, "experiments": [ { "accession": "ENCSR321BRN", "assay_title": "WGBS", "biosample_summary": "brain tissue female adult (53 years)", "status": "released" } ] }
2. Download processed methylation files
encode_search_files( assay_title="WGBS", organ="brain", file_format="bed", output_type="methylation state at CpG", assembly="GRCh38" )
Expected output:
{ "total": 3, "files": [ { "accession": "ENCFF567MET", "output_type": "methylation state at CpG", "assembly": "GRCh38", "file_size_mb": 845.2 } ] }
Integration
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Per-CpG methylation (bedMethyl) | methylation-aggregation | Cross-tissue methylation atlas |
| Differentially methylated regions | peak-annotation | Assign DMRs to nearest genes |
| Methylation at regulatory sites | regulatory-elements | Correlate methylation with cCRE activity |
| CpG methylation near variants | variant-annotation | Annotate variants affecting CpG methylation |
| Signal tracks (bigWig) | visualization-workflow | Display methylation signal in genome browser |
| QC metrics | quality-assessment | Validate conversion rate and coverage |
| Pipeline parameters | data-provenance | Record Bismark/MethylDackel versions |
| Methylation at promoters | gtex-expression | Correlate promoter methylation with gene expression |
Related Skills
-- Parent skill with compute resource assessment and cloud setuppipeline-guide
-- Aggregate methylation data across samples/tissuesmethylation-aggregation
-- Evaluate pipeline output quality metricsquality-assessment
-- Track all pipeline inputs, outputs, and parametersdata-provenance
-- Download ENCODE WGBS FASTQ files for pipeline inputdownload-encode
-- Verify literature claims backing analytical decisionspublication-trust
Presenting Results
When reporting WGBS pipeline results:
- Bisulfite conversion rate: Report the lambda/pUC19 spike-in conversion rate prominently (≥98% pass, 95-98% warning, <95% fail). This is the most critical QC metric for WGBS
- CpG coverage depth: Report genome-wide mean CpG coverage and fraction of CpGs meeting the minimum depth threshold (>=5x for reporting, >=10x for DMR analysis)
- Global methylation level: Report the genome-wide average CpG methylation percentage and note any non-CpG (CHG/CHH) methylation if relevant to the tissue
- bedMethyl output paths: Provide paths to the primary CpG bedMethyl files and any CHG/CHH context files generated
- M-bias assessment: Note whether M-bias trimming was applied and which read positions were excluded
- Key QC metrics: Present mapping rate (>70%), duplication rate (<30%), and coverage statistics in a summary table
- Output summary: Report total CpG sites called, sites passing coverage filter, and estimated data completeness
- Next steps: Suggest
for cross-sample averaging and HMR/UMR/PMD identificationmethylation-aggregation