Encode-toolkit data-provenance

Track exact provenance for every operation on ENCODE data — tool versions, reference files, scripts, parameters, and timestamps — to enable publication-ready methods writing. Use when the user processes ENCODE files, runs any bioinformatics tool, creates filtered/merged datasets, runs pipelines, performs liftover, uses R/Python/Bash for analysis, or needs to document their analysis chain for reproducibility and publication. Also use when the user says "write me methods" to auto-generate methods sections from the provenance log. This skill implements comprehensive provenance documentation: every tool, every version, every reference file, every parameter, every accession — no shortcuts. Use this skill for ANY processing step, ANY file transformation, ANY analysis operation on ENCODE data.

install
source · Clone the upstream repo
git clone https://github.com/ammawla/encode-toolkit
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugin/skills/data-provenance" ~/.claude/skills/ammawla-encode-toolkit-data-provenance && rm -rf "$T"
manifest: plugin/skills/data-provenance/SKILL.md
source content

Exact Provenance Tracking and Methods Writing

When to Use

  • User wants to track the full analysis chain from ENCODE download through processing to publication figure
  • User asks about "provenance", "reproducibility", "methods section", or "analysis log"
  • User needs to log derived files with their processing parameters for audit trail
  • User wants to auto-generate publication-ready methods text from logged analysis steps
  • Example queries: "log my peak calling step", "generate a methods section from my analysis", "show the provenance chain for this figure"

Track every operation on ENCODE data with exact tool versions, reference files, scripts, parameters, and timestamps to enable publication-ready methods sections.

Scientific Rationale

The question: "What exactly was done to this data, and can someone else reproduce it identically?"

Reproducibility is the foundation of science. Yet the "Methods" sections of most genomics papers are vague — "reads were aligned with STAR" tells you nothing about which STAR version, which genome index, which parameters, or which annotation version was used. The difference between GENCODE v38 and v39 gene annotations can change thousands of gene assignments.

Comprehensive Provenance Standard

This skill implements a documentation standard where every operation records:

  1. Tool: Exact name and version (e.g.,
    bedtools v2.31.0
    , not just "bedtools")
  2. Reference files: Exact source, version, and download URL (e.g., "GRCh38.p14 chromosome sizes from UCSC, downloaded 2024-01-15")
  3. Parameters: Complete command or function call, not just key parameters
  4. Input: Exact file paths, accessions, and checksums
  5. Output: File path, description, and checksum
  6. Script: If a custom script was used, store the script alongside the output
  7. Timestamp: When the operation was performed
  8. Environment: R version, Python version, package versions, OS

This creates a complete audit trail such that a methods section can be auto-generated with zero ambiguity.

Why This Level of Detail Matters

Consider a simple liftover operation. A vague log says "coordinates were lifted from hg19 to hg38." A comprehensive provenance log says:

"Genomic coordinates were lifted from GRCh37/hg19 to GRCh38/hg38 using UCSC liftOver (v377, Kent et al. 2002, PMID: 12045153). The chain file hg19ToHg38.over.chain.gz was obtained from UCSC Genome Browser (https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/, accessed 2024-01-15, MD5: abc123...). Of 45,231 input regions, 44,892 (99.25%) were successfully converted; 339 regions (0.75%) failed to map and were excluded. Unmapped regions were logged to unmapped.bed."

The second version can be reproduced exactly. The first cannot.

Step 1: Initialize Experiment Log

At the start of any analysis session, create an experiment log:

Log Structure

project_dir/
├── experiment_log.json          # Machine-readable provenance log
├── scripts/                     # All scripts used in this analysis
│   ├── 001_download.sh
│   ├── 002_filter_peaks.sh
│   └── 003_merge_samples.R
├── reference_files/             # Reference files used (or symlinks)
│   ├── GRCh38.chrom.sizes
│   └── gencode.v44.annotation.gtf
├── data/                        # ENCODE downloads
│   └── (organized by experiment)
├── derived/                     # All derived files
│   ├── filtered_peaks/
│   └── merged_results/
└── methods/                     # Auto-generated methods text
    └── methods_draft.md

Experiment Log Format (experiment_log.json)

{
  "project": "H3K27ac analysis in human pancreas",
  "created": "2024-01-15T10:30:00Z",
  "analyst": "Dr. A. Mawla",
  "organism": "Homo sapiens",
  "assembly": "GRCh38",
  "gene_annotation": "GENCODE v44",
  "operations": [],
  "encode_experiments": [],
  "software_environment": {},
  "reference_files": []
}

Track Source ENCODE Experiments

encode_track_experiment(accession="ENCSR...", notes="Experiment log entry")

For each experiment, record in the log:

FieldExampleSource
AccessionENCSR133RZOENCODE portal
AssayHistone ChIP-seqencode_get_experiment
TargetH3K27acencode_get_experiment
Biosamplepancreas tissueencode_get_experiment
LabBing Ren, UCSDencode_get_experiment
Replicates2 biologicalencode_get_experiment
SequencerIllumina HiSeq 4000encode_get_experiment
Read length76bp PEencode_get_experiment
Read count42.3M per repFile metadata
LibraryTruSeq ChIPencode_get_experiment
Batch/date2019-06-15encode_get_experiment

Step 2: Log Every Operation

Operation Log Entry Format

Every processing step creates a log entry with these fields:

{
  "operation_id": "op_003",
  "timestamp": "2024-01-15T14:22:00Z",
  "description": "Filter H3K27ac peaks by signalValue",
  "category": "filtering",
  "tool": {
    "name": "bedtools",
    "version": "2.31.0",
    "citation": "Quinlan & Hall 2010, Bioinformatics, DOI:10.1093/bioinformatics/btq033"
  },
  "command": "awk '$7 >= 4.5' ENCFF123ABC.bed | bedtools intersect -a stdin -b blacklist.bed -v > filtered_peaks.bed",
  "script_path": "scripts/002_filter_peaks.sh",
  "inputs": [
    {
      "file": "ENCFF123ABC.bed",
      "accession": "ENCFF123ABC",
      "type": "IDR thresholded peaks",
      "md5": "abc123..."
    }
  ],
  "reference_files": [
    {
      "file": "hg38-blacklist.v2.bed.gz",
      "source": "ENCODE Blacklist v2 (Amemiya et al. 2019, Sci Rep, DOI:10.1038/s41598-019-45839-z)",
      "url": "https://github.com/Boyle-Lab/Blacklist/raw/master/lists/hg38-blacklist.v2.bed.gz",
      "md5": "def456..."
    }
  ],
  "parameters": {
    "signalValue_threshold": 4.5,
    "blacklist_filter": "exclude overlapping regions"
  },
  "outputs": [
    {
      "file": "derived/filtered_peaks/H3K27ac_pancreas_filtered.bed",
      "description": "H3K27ac peaks in pancreas, signalValue >= 4.5, blacklist-filtered",
      "regions_count": 34521,
      "md5": "ghi789..."
    }
  ],
  "statistics": {
    "input_regions": 45231,
    "output_regions": 34521,
    "filtered_out": 10710,
    "filter_rate": "23.7%"
  }
}

Common Operations to Log

Downloading ENCODE Files

encode_download_files(
    file_accessions=["ENCFF..."],
    download_dir="/path/to/data/",
    organize_by="experiment",
    verify_md5=True
)

Log: file accession, download URL, MD5 verification result, file size, download timestamp.

Genome Coordinate Liftover

Log: liftOver version, chain file (source URL, date accessed), input count, output count, unmapped count, unmapped file location.

Peak Filtering

Log: filter criteria (signalValue threshold, p-value cutoff), blacklist used and version, input/output region counts, what was removed.

Merging/Union Operations

Log: merge tool + version, merge distance parameter, input files (all accessions), sample tagging method, output count, overlap statistics.

R/Bioconductor Analysis

{
  "tool": {
    "name": "DESeq2",
    "version": "1.42.0",
    "r_version": "4.3.2",
    "bioconductor_version": "3.18",
    "citation": "Love et al. 2014, Genome Biology, DOI:10.1186/s13059-014-0550-8"
  },
  "command": "DESeq2::results(dds, contrast=c('condition','treated','control'), alpha=0.05)",
  "parameters": {
    "design_formula": "~ batch + condition",
    "contrast": ["condition", "treated", "control"],
    "alpha": 0.05,
    "lfcThreshold": 0
  }
}

Python Analysis

{
  "tool": {
    "name": "scanpy",
    "version": "1.9.6",
    "python_version": "3.11.5",
    "anndata_version": "0.10.3",
    "citation": "Wolf et al. 2018, Genome Biology, DOI:10.1186/s13059-017-1382-0"
  }
}

Step 3: Record Software Environment

At the start of each analysis, capture the full environment:

R Environment

sessionInfo()
# Or more detailed:
devtools::session_info()

Log: R version, platform, attached packages with versions, loaded namespaces.

Python Environment

import pkg_resources
{pkg.key: pkg.version for pkg in pkg_resources.working_set}

Log: Python version, all installed packages with versions, virtual environment path.

Command-Line Tools

For each tool used, record the version:

bedtools --version        # bedtools v2.31.0
samtools --version        # samtools 1.19
STAR --version            # 2.7.11a
macs2 --version           # macs2 2.2.9.1
liftOver                  # Kent tools (note: no --version flag; record binary date)

System Information

uname -a                  # OS and kernel
nproc                     # CPU cores
free -h                   # Memory (Linux)
sysctl -n hw.memsize      # Memory (macOS)
nvidia-smi                # GPU info (if applicable)

Step 4: Store Scripts

Every custom script used in the analysis should be stored in the

scripts/
directory with sequential numbering:

Naming Convention

scripts/
├── 001_download_encode_data.sh
├── 002_filter_peaks.sh
├── 003_merge_samples.R
├── 004_chromhmm_segmentation.sh
├── 005_differential_analysis.R
└── 006_visualization.py

Script Header Template

Every stored script should include a header:

#!/bin/bash
# Script: 002_filter_peaks.sh
# Project: H3K27ac analysis in human pancreas
# Date: 2024-01-15
# Author: Generated by ENCODE Connector
# Description: Filter H3K27ac peaks by signalValue and remove blacklisted regions
# Dependencies: bedtools v2.31.0, awk (GNU Awk 5.2.1)
# Input: ENCFF123ABC.bed (IDR thresholded peaks, GRCh38)
# Output: derived/filtered_peaks/H3K27ac_pancreas_filtered.bed
# Reference: hg38-blacklist.v2.bed.gz (Amemiya et al. 2019)

Step 5: Log Derived Files to ENCODE Tracker

After each operation, register the derived file:

encode_log_derived_file(
    file_path="/path/to/output.bed",
    source_accessions=["ENCSR...", "ENCFF..."],
    description="H3K27ac peaks in pancreas, signalValue >= 4.5, blacklist-filtered",
    file_type="filtered_peaks",
    tool_used="bedtools v2.31.0 + awk",
    parameters="awk '$7 >= 4.5' | bedtools intersect -v blacklist"
)

Verify the provenance chain:

encode_get_provenance(file_path="/path/to/output.bed")

Step 6: Version Control and Experiment Branching

When the User Runs Multiple Versions

If the user tries different parameters or approaches:

  1. Log EACH version as a separate operation with unique operation_id
  2. Record what was different between versions
  3. Ask the user which version to use going forward
  4. Mark the chosen version as "selected" and others as "alternative"

Example Version Log

{
  "operation_id": "op_003a",
  "description": "Filter peaks - signalValue >= 4.5",
  "status": "alternative",
  "note": "Less stringent threshold, more peaks retained"
},
{
  "operation_id": "op_003b",
  "description": "Filter peaks - signalValue >= 7.0",
  "status": "selected",
  "note": "More stringent, user chose this for final analysis"
}

Step 7: Auto-Generate Methods Sections

When the user requests methods writing, read the experiment log and generate publication-ready text.

Methods Template Structure

Data Acquisition

[Assay] data for [biosample] were obtained from the ENCODE Project (ENCODE Project Consortium 2020) via the ENCODE portal (https://www.encodeproject.org). [N] experiments were included (accessions: [list]). All experiments used [sequencer] with [read length] [SE/PE] reads, generating [N]M reads per replicate across [N] biological replicates. Data were processed by the ENCODE Uniform Processing Pipeline (version [X]).

File Selection

[Output type] files aligned to [assembly] were selected for downstream analysis. Files were selected using ENCODE's preferred default designation. IDR thresholded peaks (Li et al. 2011) were used for [ChIP-seq/ATAC-seq] to ensure replicate concordance.

Quality Assessment

Experiments were assessed for quality using ENCODE audit flags. Experiments with ERROR-level audits were excluded. ChIP-seq quality was evaluated using FRiP (≥[X]%), NSC (>[X]), RSC (>[X]), and NRF (≥[X]) metrics.

Processing Steps For each operation in the log, generate a sentence:

[Description]. [Tool] (version [X]; [citation]) was used with the following parameters: [parameters]. [Reference files] were obtained from [source] (version [X], accessed [date]). Of [N] input [regions/reads], [N] ([%]) passed filtering.

Data Availability

All source data are available from the ENCODE portal under accessions [list]. Derived files, analysis scripts, and the complete provenance log are available at [repository URL]. Software versions: [list all tools and versions used].

Scientific Documentation Standards

Methods sections MUST follow these principles for every computational step:

Precision over approximation

  • Always use exact counts: "1,245 genes" not "~1,200 genes"; "42.3M reads" not "millions of reads"
  • Every number that can be exact, should be exact

Complete tool attribution

  • Every bioinformatics tool gets three things: name, version, and citation — no exceptions
  • Don't just say "reads were aligned with STAR" — say which version, which index, which parameters

Full reference specification

  • Always state BOTH genome build AND annotation version: "GRCh38.p14 with GENCODE v44" not just "hg38"
  • Annotation versions matter: GENCODE v38 and v44 define different gene sets

Experimental context

  • Name the sequencing platform: "Illumina HiSeq 4000" or "NovaSeq 6000"
  • Report read characteristics: length, SE/PE, read count per sample, fragment size if relevant
  • Report replicate details: number of biological replicates, sex breakdown (e.g., "3M/2F"), pooling strategy

Show your filtering work

  • Every filtering step reports input count, pass count, fail count, and percentage: "Of 67,412 regions, 66,894 (99.2%) passed; 518 excluded"
  • The reader should never wonder how much data was lost at any step

Statistical rigor

  • Name the specific test AND the multiple testing correction: "Benjamini-Hochberg FDR < 0.05" not just "adjusted p-value"
  • Name the software used for statistics with its version

Data accessibility

  • Provide GEO/ENCODE accessions for both data deposition AND any external data reused
  • Link to scripts, provenance logs, and derived files

Orthogonal validation

  • Never rely on a single enrichment or pathway method — use 2+ complementary approaches so results don't depend on one algorithm's biases

Citation Format for Tools

When generating methods, include proper citations:

ToolCitation
bedtoolsQuinlan & Hall 2010, Bioinformatics
samtoolsLi et al. 2009, Bioinformatics
STARDobin et al. 2013, Bioinformatics
featureCountsLiao et al. 2014, Bioinformatics
edgeRRobinson et al. 2010, Bioinformatics
MACS2Zhang et al. 2008, Genome Biology
DESeq2Love et al. 2014, Genome Biology
SeuratStuart et al. 2019, Cell
SCTransformHafemeister & Satija 2019, Genome Biology
CellRanger10x Genomics (cite version used)
ScanpyWolf et al. 2018, Genome Biology
ChromHMMErnst & Kellis 2012, Nature Methods
liftOverKent et al. 2002, Genome Research
HOMERHeinz et al. 2010, Molecular Cell
deepToolsRamirez et al. 2016, Nucleic Acids Research
HarmonyKorsunsky et al. 2019, Nature Methods
IDRLi et al. 2011, Annals of Applied Statistics
WGCNALangfelder & Horvath 2008, BMC Bioinformatics
CibersortXNewman et al. 2019, Nature Biotechnology
GSEASubramanian et al. 2005, PNAS
GvizHahne & Ivanek 2016, Methods in Molecular Biology
GraphPad PrismGraphPad Software (cite version)
DAVIDHuang et al. 2009, Nature Protocols
EnrichrKuleshov et al. 2016, Nucleic Acids Research
DEGASLi et al. 2022, Genome Biology
RRHOPlaisier et al. 2010, Nucleic Acids Research

Step 8: Supplementary Data Tables

Generate supplementary tables following the scientific documentation standards above:

Table S1: ENCODE Experiments Used

AccessionAssayTargetBiosampleLabReplicatesSequencerRead LengthRead CountLibrary
ENCSR...Histone ChIP-seqH3K27acpancreasRen2 bioHiSeq 400076bp PE42.3MTruSeq

Table S2: Files Selected

File AccessionExperimentFormatOutput TypeAssemblyPipelineSizeMD5
ENCFF...ENCSR...bed narrowPeakIDR thresholded peaksGRCh38ENCODE v2.11.2MBabc...

Table S3: Processing Steps

StepDescriptionToolVersionInputOutputParametersReference Files
1Peak filteringbedtools2.31.0ENCFF...filtered.bedsignalValue≥4.5blacklist v2

Table S4: Software Environment

SoftwareVersionCitation
R4.3.2R Core Team 2023
Bioconductor3.18Huber et al. 2015
DESeq21.42.0Love et al. 2014

Export using:

encode_export_data(format="csv")  # For Table S1
encode_get_citations(export_format="bibtex")  # For bibliography

Pitfalls and Edge Cases

Tool Version Drift

  • Tool versions change over time;
    bedtools v2.30
    may produce different results than
    v2.31
  • ALWAYS record versions at time of use, not at time of writing
  • If re-running an analysis months later, verify tool versions match the log

Reference File Versioning

  • Genome annotations (GENCODE) release new versions regularly
  • Chromosome size files differ between assemblies and even between UCSC/Ensembl conventions
  • Blacklists have versions (v1 vs v2) that exclude different regions
  • ALWAYS record the exact version and download URL/date

Incomplete Provenance

  • If a step was performed interactively (e.g., manual filtering in IGV), log it anyway with a note
  • "No provenance" is worse than "approximate provenance"
  • If a tool doesn't report its version, record the binary date and download source

Multi-User Environments

  • If multiple people work on the same project, log WHO performed each operation
  • Use consistent file paths or relative paths in the log
  • Store scripts in version control (git) alongside the provenance log

Containerization for Exact Reproduction

  • For maximum reproducibility, consider Docker/Singularity containers
  • Record the container image and tag alongside the tool version
  • Nextflow/Snakemake workflows can encode the full environment

Walkthrough: Building a Complete Provenance Trail for an ENCODE Analysis

Goal: Document every step of an ENCODE analysis pipeline with full provenance — from raw data acquisition through processing, analysis, and derived outputs — enabling reproducibility and publication-ready methods. Context: Reproducibility requires knowing exactly what data, tools, parameters, and versions produced each result. This skill automates provenance tracking.

Step 1: Log data acquisition

encode_track_experiment(accession="ENCSR000AKA", notes="H3K27ac ChIP-seq in GM12878 for enhancer analysis")

Expected output:

{
  "status": "tracked",
  "accession": "ENCSR000AKA",
  "notes": "H3K27ac ChIP-seq in GM12878 for enhancer analysis",
  "tracked_at": "2025-03-08T10:00:00Z"
}

Step 2: Log file downloads with MD5 verification

encode_download_files(accessions=["ENCFF001ABC"], download_dir="/data/chipseq")

Expected output:

{
  "downloaded": 1,
  "md5_verified": true,
  "files": ["/data/chipseq/ENCFF001ABC.bed.gz"]
}

Step 3: Log derived analysis outputs

encode_log_derived_file(
  source_accessions=["ENCFF001ABC", "ENCFF002DEF"],
  derived_file="/data/analysis/gm12878_enhancers_filtered.bed",
  description="Filtered H3K27ac peaks: removed blacklist regions, merged within 500bp, filtered signalValue > 5",
  tool="bedtools v2.31.0",
  parameters="intersect -v (blacklist), merge -d 500, filter signalValue > 5"
)

Expected output:

{
  "status": "logged",
  "derived_file": "/data/analysis/gm12878_enhancers_filtered.bed",
  "source_count": 2,
  "logged_at": "2025-03-08T11:00:00Z"
}

Step 4: View the complete provenance chain

encode_get_provenance(file_path="/data/analysis/gm12878_enhancers_filtered.bed")

Expected output:

{
  "file": "/data/analysis/gm12878_enhancers_filtered.bed",
  "description": "Filtered H3K27ac peaks",
  "tool": "bedtools v2.31.0",
  "sources": [
    {"accession": "ENCFF001ABC", "type": "encode_file"},
    {"accession": "ENCFF002DEF", "type": "encode_file"}
  ],
  "logged_at": "2025-03-08T11:00:00Z"
}

Step 5: Generate provenance summary for publication

encode_get_tracking_summary()

Interpretation: The complete provenance chain enables automatic generation of methods sections: "H3K27ac ChIP-seq peaks (ENCFF001ABC) were filtered using ENCODE blacklist v2 (Amemiya et al. 2019) with bedtools v2.31.0..."

Integration with downstream skills

  • Provenance records from all skills feed into this skill for centralized tracking
  • Provenance data supports → scientific-writing methods section generation
  • File lineage connects to → cite-encode for proper ENCODE data attribution
  • Pipeline provenance from → pipeline-chipseq through pipeline-cutandrun records processing steps

Code Examples

1. Track an experiment for provenance

encode_track_experiment(accession="ENCSR000AKA", notes="GM12878 H3K27ac for enhancer catalog")

Expected output:

{
  "status": "tracked",
  "accession": "ENCSR000AKA",
  "notes": "GM12878 H3K27ac for enhancer catalog"
}

2. Log a derived analysis file

encode_log_derived_file(
  source_accessions=["ENCFF001ABC"],
  derived_file="/data/peaks_filtered.bed",
  description="Blacklist-filtered peaks",
  tool="bedtools v2.31.0"
)

Expected output:

{
  "status": "logged",
  "derived_file": "/data/peaks_filtered.bed",
  "source_count": 1
}

3. View provenance chain

encode_get_provenance(file_path="/data/peaks_filtered.bed")

Expected output:

{
  "file": "/data/peaks_filtered.bed",
  "tool": "bedtools v2.31.0",
  "sources": [{"accession": "ENCFF001ABC", "type": "encode_file"}]
}

Integration

This skill produces...Feed into...Using tool/skill
Provenance chain (accession → derived files)Methods section generationscientific-writing skill
Logged analysis steps with parametersReproducibility auditpublication-trust skill
MD5-verified file recordsData availability statementcite-encode skill
Sequential script numberingPipeline documentationpipeline-guide skill
Complete tool + version recordsTool citation listcite-encode → BibTeX export

Related Skills

  • pipeline-guide
    — ENCODE pipeline execution and monitoring
  • cite-encode
    — Generating citations and bibliography for ENCODE data
  • quality-assessment
    — Evaluating quality of ENCODE experiments
  • multi-omics-integration
    — Multi-omics workflows that generate provenance
  • histone-aggregation
    — Aggregation workflows that produce derived files
  • accessibility-aggregation
    — ATAC-seq aggregation with provenance
  • geo-connector
    — Log cross-references between ENCODE and GEO datasets
  • cross-reference
    — Link experiments to PubMed, DOI, GEO, NCT IDs
  • publication-trust
    — Verify literature claims backing analytical decisions

Presenting Results

  • Present provenance chain as: derived_file -> source_files -> ENCODE accessions, showing tools and parameters used. Include timestamps. Suggest: "Would you like to export this chain for your methods section?"

For the request: "$ARGUMENTS"