Encode-toolkit quality-assessment

Evaluate ENCODE experiment quality using standard metrics and audit flags. Use when the user asks about data quality, wants to filter for high-quality experiments, needs to interpret quality metrics (FRiP, NSC, RSC, NRF, IDR, TSS enrichment, fragment size), wants to understand ENCODE audit warnings, needs to compare quality across experiments, or is deciding whether data is usable for their analysis. Also use when the user mentions QC, quality control, or data filtering.

install

source · Clone the upstream repo

git clone https://github.com/ammawla/encode-toolkit

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugin/skills/quality-assessment" ~/.claude/skills/ammawla-encode-toolkit-quality-assessment && rm -rf "$T"

manifest: plugin/skills/quality-assessment/SKILL.md

source content

Assess ENCODE Data Quality

When to Use

User asks about data quality, QC metrics, or whether an experiment is reliable
User wants to filter experiments by quality (FRiP, NSC, RSC, NRF, IDR, TSS enrichment)
User asks "is this experiment good enough?" or "should I use this data?"
User needs to interpret ENCODE audit flags (ERROR, NOT_COMPLIANT, WARNING)
User wants to compare quality across multiple experiments
User is selecting high-quality experiments for a meta-analysis or aggregation

Help the user evaluate whether ENCODE experiments meet quality standards for their analysis. Quality assessment is not a single-metric exercise — it requires integrating multiple orthogonal measures in the context of the specific assay, biological system, and analytical goals.

Literature Foundation

#	Reference	Key Contribution
1	Landt et al. 2012, Genome Res, DOI:10.1101/gr.136184.111 (~3,500 cit)	ENCODE/modENCODE ChIP-seq guidelines; defined NSC, RSC, NRF, FRiP thresholds
2	ENCODE Project Consortium 2020, Nature, DOI:10.1038/s41586-020-2493-4 (~1,656 cit)	ENCODE Phase 3; expanded quality standards to new assays, defined cCRE registry
3	Buenrostro et al. 2013, Nat Methods, DOI:10.1038/nmeth.2688 (~7,000 cit)	Introduced ATAC-seq; established fragment size and TSS enrichment as key QC
4	Ou et al. 2018, BMC Genomics, DOI:10.1186/s12864-018-4559-3	ATACseqQC R package; systematic quality metrics for ATAC-seq
5	Conesa et al. 2016, Genome Biol, DOI:10.1186/s13059-016-0881-8 (~2,363 cit)	RNA-seq best practices survey; defined mapping rate, rRNA, gene body coverage
6	Foox et al. 2021, Genome Biol, DOI:10.1186/s13059-021-02529-2	SEQC2 EpiQC consortium; multi-platform WGBS benchmarking
7	Yardimci et al. 2019, Genome Biol, DOI:10.1186/s13059-019-1658-7	Hi-C quality measures; cis/trans ratio, distance-dependent decay, resolution
8	Skene & Henikoff 2017, eLife, DOI:10.7554/eLife.21856 (~1,800 cit)	CUT&RUN method; established spike-in normalization and low-background QC
9	Kaya-Okur et al. 2019, Nat Commun, DOI:10.1038/s41467-019-09982-5 (~1,200 cit)	CUT&Tag method; tagmentation-based profiling with distinct QC profile
10	Li et al. 2011, Ann Appl Stat, DOI:10.1214/11-AOAS466 (~1,500 cit)	Irreproducible Discovery Rate (IDR); principled replicate concordance
11	Hitz et al. 2023, Nucleic Acids Res, DOI:10.1093/nar/gkad243	ENCODE uniform processing pipelines; standardized QC across all assays
12	Nordin et al. 2023, Genome Biol, DOI:10.1186/s13059-023-03027-3	CUT&RUN suspect list; identified artifact-prone regions specific to CUT&RUN/CUT&Tag
13	Amemiya et al. 2019, Sci Rep, DOI:10.1038/s41598-019-45839-z (~1,372 cit)	ENCODE Blacklist v2; artifact regions to exclude from all analyses

Step 1: Retrieve Experiment Details and Audit Status

Use

encode_get_experiment

with the accession to get full metadata including:

Audit status (ERROR, NOT_COMPLIANT, WARNING, INTERNAL_ACTION)
Replicate information (biological and technical replicates)
Pipeline and analysis details (which ENCODE uniform pipeline was used)
Quality metrics embedded in file objects

encode_get_experiment(accession="ENCSR...")

For batch assessment across multiple experiments:

encode_search_experiments(assay_title="...", organ="...", limit=50)
# Then iterate through results checking audit flags

Step 2: Interpret ENCODE Audit Flags

ENCODE audits are generated by automated validators during the ENCODE uniform processing pipeline (Hitz et al. 2023). They flag experiments by severity:

Level	Meaning	Action
ERROR	Critical issues — data may be unreliable	Avoid using unless no alternative exists. Document thoroughly if used.
NOT_COMPLIANT	Does not meet current ENCODE standards	Usable with caveats. Check which specific standard is violated.
WARNING	Minor issues detected	Generally safe. Document the specific warning.
INTERNAL_ACTION	DCC processing notes	Usually not a concern for external users.

Common audit categories and what they mean:

Audit Category	What It Checks
`replicate concordance`	IDR or correlation between biological replicates
`library complexity`	NRF, PBC1, PBC2 — whether library is saturated
`read depth`	Whether minimum depth thresholds are met
`control quality`	Whether input/IgG control is adequate
`mapping quality`	Alignment rate and uniquely mapped fraction
`peak calling`	Whether peaks were called successfully, FRiP
`antibody validation`	Whether antibody meets ENCODE standards

Present every audit flag to the user and explain each one. A single ERROR audit does not automatically disqualify an experiment — context matters.

Step 3: Evaluate ChIP-seq Quality (Landt et al. 2012)

The ENCODE ChIP-seq guidelines (Landt et al. 2012) established the foundational metrics still used today. These were developed from analysis of hundreds of ChIP-seq experiments and reflect empirically-derived thresholds.

Core Metrics

Metric	Threshold	Concern	What It Measures	Why It Matters
FRiP	≥1% (TF), ≥5% (histone)	Below threshold	Fraction of reads in peaks	Signal enrichment. Very low FRiP means most reads are background. TF ChIP typically has lower FRiP than broad histone marks.
NSC	>1.05	≤1.05	Normalized strand cross-correlation	Signal-to-noise ratio. Computed from strand shift analysis. Values near 1.0 indicate no enrichment.
RSC	>0.8	≤0.8	Relative strand cross-correlation	Signal relative to phantom peak. More robust than NSC for shallow libraries.
NRF	≥0.8	<0.8	Non-redundant fraction (unique/total)	Library complexity. Low NRF = excessive PCR duplication = wasted sequencing.
PBC1	≥0.8	<0.5	PCR bottleneck coefficient 1	N1/Nd: fraction of locations with exactly 1 read. More sensitive than NRF at high depth.
PBC2	≥3	<1	PCR bottleneck coefficient 2	N1/N2: ratio of 1-read to 2-read locations. <1 indicates severe bottleneck.

Read Depth Requirements

Target Type	Minimum per Replicate	Recommended	Notes
Transcription factor	10M uniquely mapped	20M	Narrow peaks, need depth for detection
Broad histone mark (H3K27me3, H3K9me3, H3K36me3)	20M uniquely mapped	45M	Broad domains require more reads
Narrow histone mark (H3K4me3, H3K27ac)	20M uniquely mapped	20M	Sharp peaks, similar to TF
Input/IgG control	10M uniquely mapped	Match IP depth	Should match or exceed IP library depth

IDR Analysis (Li et al. 2011)

The Irreproducible Discovery Rate provides principled assessment of replicate concordance:

IDR Comparison	Expected	Concern	Interpretation
Nt (true replicates)	≥50% of Np	<50% Np	Low concordance between biological replicates
Np (pooled pseudoreplicates)	Reference set	—	Represents total discoverable peaks
Self-consistency (Ns)	≥50% of Np	<50% Np	Individual replicate quality
Rescue ratio (Np/max(Nt,Ns))	<2	>2	High ratio = one replicate much weaker

Key insight: IDR thresholded peaks represent peaks passing replicate concordance analysis. Pseudoreplicated peaks = single-replicate fallback (lower confidence). Optimal IDR peaks from pooled data = most complete peak set.

Antibody Validation

ENCODE requires characterization for every antibody:

Primary: IP followed by mass spectrometry or immunoprecipitation-western
Secondary: At least one of: knockdown/knockout, motif enrichment, genomic annotation enrichment
Check the
```
antibody_lot_reviews
```
field in experiment metadata

Step 4: Evaluate ATAC-seq Quality (Buenrostro et al. 2013; Ou et al. 2018)

ATAC-seq has a distinct quality profile driven by the transposase insertion mechanism.

Core Metrics

Metric	Good	Concern	What It Measures
TSS enrichment	≥5 GRCh38 / ≥6 hg19 / ≥10 mm10 (ENCODE data standards)	<4	Signal enrichment at transcription start sites. The single most informative ATAC-seq QC metric.
FRiP	≥20%	<10%	Higher expected FRiP than ChIP-seq because accessible chromatin = true signal
Fragment size distribution	Clear nucleosomal ladder	Monotonic decay	Should show peaks at <150bp (NFR), ~200bp (mono-nuc), ~400bp (di-nuc), ~600bp (tri-nuc)
NFR ratio	>2× mono-nucleosomal	<1×	Ratio of sub-nucleosomal to mono-nucleosomal fragments
Mitochondrial reads	<20% (after filtering)	>50%	Mitochondrial DNA is highly accessible; excessive = poor nuclear enrichment
Duplicate rate	<30%	>50%	PCR duplication. Omni-ATAC protocol reduces this.
NRF	≥0.7	<0.5	Library complexity, same concept as ChIP-seq

Read Depth Requirements

Sample Type	Minimum	Recommended
Bulk ATAC-seq	25M uniquely mapped (post-dedup, post-mito filter)	50M
Single-cell ATAC-seq	25K unique fragments per cell	50K per cell

Fragment Size Interpretation (Buenrostro et al. 2013)

The fragment size distribution is the signature QC plot for ATAC-seq:

Sub-nucleosomal (<150bp): Nucleosome-free regions — these are the "open chromatin" signal
Mono-nucleosomal (~200bp): Single nucleosome wrapped
Di-nucleosomal (~400bp): Two nucleosomes
Tri-nucleosomal (~600bp): Three nucleosomes

A clean ATAC-seq library shows a clear nucleosomal ladder. Monotonic decay (no peaks) suggests either dead cells, over-transposition, or excessive DNA damage.

Step 5: Evaluate RNA-seq Quality (Conesa et al. 2016)

Core Metrics

Metric	Good	Concern	What It Measures
Mapping rate	70-90% uniquely mapped	<70%	Alignment success. Low rate = contamination, adapter issues, or wrong reference
rRNA contamination	<10%	>20%	Ribosomal RNA depletion efficiency. High = failed ribo-depletion
Gene body coverage	Uniform 5'→3'	Strong 3' bias	Even coverage across gene bodies. 3' bias = degraded RNA or poly-A capture bias
Duplication rate	<50%	>70%	PCR amplification artifacts
Replicate correlation	Spearman ≥0.9 (same condition)	<0.8	Concordance between replicates
Exonic reads	>60% of mapped	<40%	Reads mapping to annotated exons vs intergenic
Intergenic reads	<10%	>20%	Reads mapping between genes — may indicate genomic DNA contamination

Read Depth Requirements (Conesa et al. 2016)

Application	Minimum	Recommended	Notes
Gene-level quantification	10M mapped	30M mapped	Standard bulk RNA-seq
Transcript-level quantification	30M mapped	60M mapped	Isoform detection requires more depth
Differential expression	10M per sample, ≥3 bio reps	20M per sample	Statistical power depends more on replicates than depth
Rare transcript detection	50M+ mapped	100M mapped	Long-tail of expression distribution
total RNA-seq	50M+ mapped	100M mapped	Includes non-coding RNA, intergenic transcripts

Strand Specificity

ENCODE RNA-seq data may be stranded or unstranded:

Stranded: Can distinguish sense vs antisense transcription. Required for accurate quantification of overlapping genes.
Unstranded: Cannot resolve strand of origin. Check
```
run_type
```
and
```
library_strand_specificity
```
in metadata.

Step 6: Evaluate WGBS Quality (ENCODE data standards)

Core Metrics

Metric	Good	Concern	What It Measures
Bisulfite conversion rate	≥98%	<98%	Efficiency of C→U conversion of unmethylated cytosines. Measured from spike-in controls (lambda phage DNA).
CpG coverage	>80% of CpGs at ≥1×	<50%	Fraction of CpG sites covered by at least one read
Mean CpG coverage	≥10× for DMR analysis	<5×	Average sequencing depth at CpG sites. 10× needed for reliable methylation calls.
Mapping rate	>60% unique	<40%	Lower than standard WGS due to reduced complexity after bisulfite conversion
Duplication rate	<30%	>50%	PCR duplicates
CpG methylation distribution	Bimodal (near 0% and near 100%)	Unimodal	Healthy cells show bimodal: most CpGs are either fully methylated or unmethylated
Lambda/pUC19 conversion	≥98% conversion rate	<98%	Spike-in controls for bisulfite conversion efficiency

Platform Considerations (Foox et al. 2021)

The SEQC2 EpiQC benchmark found significant platform effects:

Different sequencing platforms (Illumina HiSeq, NovaSeq, MGI) can produce systematically different methylation calls
Cross-platform comparisons require careful normalization
RRBS (Reduced Representation Bisulfite Sequencing) covers only ~10% of CpGs — do NOT combine RRBS with WGBS directly

Coverage Requirements

Application	Minimum CpG Coverage	Recommended
Methylation landscape	1×	5×
Differentially methylated regions	5× per sample	10× per sample
Allele-specific methylation	15×	30×
Single CpG resolution	10×	30×

Step 7: Evaluate Hi-C Quality (Yardimci et al. 2019)

Core Metrics

Metric	Good	Concern	What It Measures
Cis/trans ratio	>60% cis	<40% cis	Fraction of contacts within same chromosome. Low cis = random ligation = poor quality
Long-range cis (>20kb)	>40% of cis	<15%	True 3D interactions vs random proximity. Short-range contacts are noise-enriched
Unique valid pairs	>50% of total	<25%	Pairs surviving all filters (mapping, dedup, chimera removal)
Duplicate rate	<40%	>60%	PCR duplicates in Hi-C are especially problematic because they inflate contact frequencies
Contact distance decay	Smooth P(s)∝s^-1 curve	Irregular/plateau	Expected power-law decay with genomic distance

Read Depth and Resolution

Resolution Target	Minimum Valid Pairs	Recommended
Compartment-level (100kb)	50M	100M
TAD-level (40kb)	200M	500M
Loop-level (5-10kb)	500M	1B+
Sub-TAD (1kb)	2B+	5B+

Note: Hi-C resolution is not just about read depth — it also depends on restriction enzyme site density, ligation efficiency, and fragment size distribution. In situ Hi-C (Rao et al. 2014) generally produces cleaner data than dilution Hi-C.

Step 8: Evaluate CUT&RUN / CUT&Tag Quality (Skene & Henikoff 2017; Kaya-Okur et al. 2019)

These newer profiling methods have distinct quality profiles from ChIP-seq.

Key Differences from ChIP-seq

Feature	ChIP-seq	CUT&RUN / CUT&Tag
Background	High (requires input control)	Low (targeted cleavage)
Required depth	10-45M	3-8M sufficient
FRiP	>1-5%	>20% typical
Input control	Required	IgG control recommended but lower priority
Fragment size	Size-selected ~200-600bp	Variable; CUT&RUN releases <120bp fragments
Spike-in	Not standard	Recommended (E. coli carry-over or added spike-in)

CUT&RUN Quality Metrics

Metric	Good	Concern
FRiP	>20%	<5%
Fragment size	Peak at <120bp (released fragments)	Only large fragments
Read depth	3-8M unique mapped	<1M
Spike-in ratio	Consistent across conditions	>5× variation
Duplicate rate	<30%	>60%

CUT&RUN Suspect List (Nordin et al. 2023)

CUT&RUN and CUT&Tag generate artifacts at specific genomic regions (distinct from the ENCODE Blacklist). These are regions with apparent enrichment that is not target-specific:

Use the CUT&RUN suspect list (Nordin et al. 2023) IN ADDITION to the ENCODE Blacklist
Available at: https://github.com/Boyle-Lab/CUT-RUN_suspect_list
Particularly important for H3K4me3 and H3K27me3 CUT&RUN data

Step 9: Evaluate Single-Cell Quality (scRNA-seq and scATAC-seq)

ENCODE includes scRNA-seq and scATAC-seq experiments (primarily 10X Chromium platform). Single-cell data has distinct quality metrics from bulk assays, focused on per-cell quality rather than per-experiment signal-to-noise.

scRNA-seq Quality Metrics

Metric	Acceptable Range	Red Flag	Notes
Genes per cell (median)	1,500–4,000 (10X) / 4,000–8,000 (Smart-seq2)	<500	Tissue-dependent; immune cells typically lower than epithelial
UMIs per cell (median)	3,000–15,000 (10X)	<1,000	N/A for Smart-seq2 (no UMIs)
Mitochondrial % (median)	<10–15%	>25%	High mito% indicates cell stress or lysis; tissue-dependent thresholds
Doublet rate (estimated)	2–8% (10X, cell-count dependent) / <2% (plate-based)	>10%	Increases with cell loading density; use Scrublet or DoubletFinder
Mapping rate	>80%	<60%	Low mapping suggests contamination or mismapping
Sequencing saturation	>40%	<20%	Low saturation may miss rare transcripts
Cell count vs expected	Within 50–150% of expected	<30% or >200%	Very low = failed capture; very high = doublets or debris

scATAC-seq Quality Metrics

Metric	Acceptable Range	Red Flag	Notes
Unique fragments per cell	>3,000	<1,000	Sparse data below threshold makes peak calling unreliable
TSS enrichment per cell	>5	<2	Low TSS enrichment indicates failed Tn5 insertion bias
Fraction in peaks (FRiP)	>20%	<10%	Measures signal-to-noise at single-cell level
Fraction of mitochondrial reads	<5%	>10%	Dead/dying cells captured
Duplicate rate	<40%	>60%	High duplication indicates low library complexity

Single-Cell-Specific Quality Pitfalls

Ambient RNA contamination (scRNA-seq): Cell-free RNA from lysed cells during droplet capture inflates apparent expression of highly-expressed genes across all cells. Use CellBender (best-in-class), SoupX, or DecontX to estimate and remove ambient contamination BEFORE downstream analysis.
Barcode multiplets (scATAC-seq): Multiple cells per droplet inflate fragment counts and blur cell-type signals. ArchR and SnapATAC2 include doublet detection modules.
Cell-type composition bias: Quality metrics vary by cell type. A "low-quality" cell may be a small immune cell, not a damaged cell. Apply adaptive QC thresholds per cluster (e.g., miQC) rather than global cutoffs.
Batch effects across donors: For ENCODE tissue scRNA-seq from multiple donors, batch correction (Harmony, scVI) is typically needed before integration. The single-cell-encode skill covers integration workflows.

Where to Find Single-Cell Quality in ENCODE

Single-cell experiments in ENCODE include cell-level quality summaries in their metadata. Check:

encode_get_experiment(accession="ENCSR...")

Look for

audit

flags — ENCODE applies automated QC checks including minimum cell counts, minimum genes per cell, and maximum doublet rates. Also check the

replicates

section for library preparation details.

Step 10: Assess Replication

ENCODE requires minimum 2 independent biological replicates for released data.

Replicate Types and Their Meaning

Replicate Type	Definition	Use Case
Biological	Independent biological samples	Gold standard — captures biological variation
Technical	Same sample, different library prep	Assesses technical reproducibility
Isogenic	Same genotype, different growth/collection	Common for cell lines (e.g., K562, GM12878)
Anisogenic	Different genotypes/donors	Common for tissue samples

Concordance Assessment

Use

encode_list_files

to check for replicated peak files:

File Output Type	What It Means	Confidence
IDR thresholded peaks	Passed replicate concordance analysis (Li et al. 2011)	Highest
Optimal IDR peaks	Peaks from pooled data, thresholded by IDR	Complete set
Conservative IDR peaks	Stricter IDR threshold	Most conservative
Pseudoreplicated peaks	IDR on pseudoreplicates from pooled data	Single-replicate fallback
Replicated peaks	Found in multiple replicates (non-IDR method)	Moderate

Cross-Replicate Correlation

For quantitative data (RNA-seq, signal tracks):

Spearman correlation ≥0.9: Excellent concordance
0.8-0.9: Acceptable, check for outliers
<0.8: Investigate — batch effect, sample mix-up, or biological variation

Step 11: Apply the ENCODE Blacklist (Amemiya et al. 2019)

Before interpreting any peak-based quality metric, confirm that the ENCODE Blacklist has been applied:

Blacklist v2 regions: High-signal artifacts (satellite repeats, centromeric regions, high-copy sequences)
Available at: https://github.com/Boyle-Lab/Blacklist/
hg38:
```
hg38-blacklist.v2.bed.gz
```
(910 regions)
mm10:
```
mm10-blacklist.v2.bed.gz
```

Failure to remove blacklisted regions will inflate FRiP, create false peaks, and confound enrichment analyses. If analyzing CUT&RUN/CUT&Tag, also apply the CUT&RUN suspect list (Nordin et al. 2023).

Step 12: File Quality Tiers and Selection

When listing files with

encode_list_files

, use quality-informed selection:

# Get preferred default files (ENCODE's recommendation)
encode_list_files(experiment_accession="ENCSR...", preferred_default=True)

# Get IDR thresholded peaks (gold standard for ChIP-seq)
encode_list_files(experiment_accession="ENCSR...", output_type="IDR thresholded peaks", assembly="GRCh38")

# Get signal tracks for visualization
encode_list_files(experiment_accession="ENCSR...", output_type="fold change over control", assembly="GRCh38")

File Quality Hierarchy

Priority	File Type	When to Use
1	`preferred_default=True`	ENCODE's recommended files — start here
2	IDR thresholded peaks	Gold standard for ChIP-seq peak calls
3	Fold change over control	Normalized signal for visualization and quantitative comparison
4	Signal of unique reads	Clean signal tracks (unnormalized)
5	Pseudoreplicated peaks	Fallback when IDR fails or only 1 replicate available
6	Unfiltered alignments	Only for custom re-analysis

Step 13: Summarize Quality Verdict

Provide a structured quality assessment:

Quality Tiers

Tier	Criteria	Recommendation
High quality	No ERROR/NOT_COMPLIANT audits, all metrics above thresholds, ≥2 biological replicates, IDR peaks available	Use confidently. Ideal for primary analysis.
Usable with caveats	WARNING-level audits or borderline metrics (within 20% of threshold), good replication	Usable. Document specific limitations in methods.
Use with caution	NOT_COMPLIANT flags, one metric below threshold, or single-replicate	Use only if no better alternative. Document all issues. Flag in results.
Not recommended	ERROR flags, multiple metrics below threshold, poor replication, no IDR peaks	Avoid. Seek alternative experiments or datasets.

Quality Summary Template

For each experiment assessed, provide:

Accession: ENCSR...
Assay: type and target
Audit status: list all flags with explanations
Key metrics: table with values and pass/fail
Replication: number and type of replicates, concordance
Pipeline: which ENCODE uniform pipeline version was used
Verdict: tier assignment with justification
Caveats: any specific limitations to note

Pitfalls and Common Mistakes

Single-metric decisions: No single metric captures quality. FRiP alone can be misleading — some TF ChIP-seq with biological signal has low FRiP due to focal binding patterns. Always evaluate collectively.
Comparing across assays: Do NOT compare ChIP-seq metrics to ATAC-seq metrics to CUT&RUN metrics. Each assay has its own quality profile and thresholds.
Ignoring batch effects: Experiments from different labs, dates, or platforms may have systematic quality differences. When combining data, check for batch-correlated quality variation.
Assembly mismatch: Quality metrics computed on different assemblies (hg19 vs GRCh38) may differ slightly. Always verify the assembly of quality metrics matches your analysis assembly.
Antibody lot variation: The same antibody target can show different enrichment across lots. Check
```
antibody_lot_reviews
```
in ENCODE metadata.
Read depth ≠ quality: A deeply sequenced bad library is still a bad library. Check NRF/PBC first — if complexity is exhausted, more sequencing wastes resources.
Control quality matters: An IP library is only as good as its control. Poor input/IgG control undermines all downstream peak-based metrics.
Newer assays, different rules: CUT&RUN and CUT&Tag have inherently different quality profiles from ChIP-seq. Applying ChIP-seq thresholds to CUT&RUN will incorrectly flag high-quality data.

Walkthrough: Quality Assessment of ENCODE ChIP-seq Before Analysis

Goal: Evaluate the quality of ENCODE ChIP-seq experiments against ENCODE consortium standards before including them in downstream analysis. Context: Not all ENCODE experiments meet the highest quality standards. Quality assessment prevents garbage-in-garbage-out in aggregation and integration analyses.

Step 1: Get experiment details and audit status

encode_get_experiment(accession="ENCSR000AKA")

Expected output:

{
  "accession": "ENCSR000AKA",
  "assay_title": "Histone ChIP-seq",
  "target": "H3K27ac",
  "biosample_summary": "GM12878",
  "replicates": 2,
  "status": "released",
  "audit": {"ERROR": 0, "NOT_COMPLIANT": 0, "WARNING": 1}
}

Interpretation: 0 ERRORs and 0 NOT_COMPLIANT = experiment meets ENCODE standards. 1 WARNING is acceptable.

Step 2: Check file-level quality

encode_list_files(accession="ENCSR000AKA", file_format="bed", output_type="IDR thresholded peaks", assembly="GRCh38")

Step 3: Review quality metrics

Key ChIP-seq quality thresholds (Landt et al. 2012):

Metric	Threshold	Meaning
FRiP	>= 1%	Signal enrichment over background
NSC	> 1.05	Strand cross-correlation signal
RSC	> 0.8	Relative strand correlation
NRF	>= 0.8	Library complexity
IDR	< 0.05	Reproducibility between replicates

Step 4: Track quality-verified experiments

encode_track_experiment(accession="ENCSR000AKA", notes="QC PASSED: FRiP=3.2%, NSC=1.12, RSC=0.95, 0 audit errors")

Integration with downstream skills

Quality-filtered experiments feed into histone-aggregation and other aggregation skills
QC metrics inform pipeline-chipseq parameter tuning
Audit status guides search-encode experiment selection
QC documentation supports data-provenance and scientific-writing

Code Examples

1. Check experiment audit status

encode_get_experiment(accession="ENCSR000AKA")

Expected output:

{
  "accession": "ENCSR000AKA",
  "audit": {"ERROR": 0, "NOT_COMPLIANT": 0, "WARNING": 1}
}

2. List files to check quality metrics

encode_list_files(accession="ENCSR000AKA", file_format="bed", assembly="GRCh38")

Expected output:

{
  "files": [
    {"accession": "ENCFF001ABC", "output_type": "IDR thresholded peaks", "file_size_mb": 1.1}
  ]
}

3. Get detailed file info for QC

encode_get_file_info(accession="ENCFF001ABC")

Expected output:

{
  "accession": "ENCFF001ABC",
  "file_format": "bed narrowPeak",
  "output_type": "IDR thresholded peaks",
  "assembly": "GRCh38",
  "quality_metrics": {"frip": 0.032, "nsc": 1.12, "rsc": 0.95}
}

Integration

This skill produces...	Feed into...	Purpose
Quality-verified experiments	histone-aggregation	Only aggregate high-quality data
QC pass/fail decisions	search-encode	Filter search results by quality
Quality metric reports	data-provenance	Document QC criteria used
Audit interpretation	pipeline-guide	Guide reprocessing decisions
QC documentation	scientific-writing	Methods section QC reporting
Quality thresholds	publication-trust	Verify QC threshold citations
Validated experiment lists	batch-analysis	Process only quality-approved experiments
QC-filtered peaks	regulatory-elements	High-confidence regulatory element maps

Related Skills

publication-trust: Assess scientific integrity of publications before relying on their methods or findings — complements experiment quality with publication quality
histone-aggregation: Uses quality filtering before aggregating peaks
accessibility-aggregation: ATAC-seq quality directly impacts aggregation
data-provenance: Log quality decisions and thresholds used
compare-biosamples: Quality must be comparable across samples being compared
integrative-analysis: All data sources need quality assessment before integration
single-cell-encode: Quality metrics for scRNA-seq and scATAC-seq (genes/cell, fragments/cell, TSS enrichment)
epigenome-profiling: Quality assessment is a prerequisite for epigenomic profile assembly
variant-annotation: Quality of ENCODE experiments determines reliability of variant annotation
pipeline-guide: Pipeline version affects quality metric computation
batch-analysis: Batch QC screening across multiple experiments for systematic quality filtering

Presenting Results

Present QC metrics as a traffic-light table: metric | value | threshold | status (PASS/WARN/FAIL). Always include the ENCODE audit level. Suggest: "Would you like to filter to only experiments meeting all QC thresholds?"

Encode-toolkit quality-assessment

Assess ENCODE Data Quality

When to Use

Literature Foundation

Step 1: Retrieve Experiment Details and Audit Status

Step 2: Interpret ENCODE Audit Flags

Step 3: Evaluate ChIP-seq Quality (Landt et al. 2012)

Core Metrics

Read Depth Requirements

IDR Analysis (Li et al. 2011)

Antibody Validation

Step 4: Evaluate ATAC-seq Quality (Buenrostro et al. 2013; Ou et al. 2018)

Core Metrics

Read Depth Requirements

Fragment Size Interpretation (Buenrostro et al. 2013)

Step 5: Evaluate RNA-seq Quality (Conesa et al. 2016)

Core Metrics

Read Depth Requirements (Conesa et al. 2016)

Strand Specificity

Step 6: Evaluate WGBS Quality (ENCODE data standards)

Core Metrics

Platform Considerations (Foox et al. 2021)

Coverage Requirements

Step 7: Evaluate Hi-C Quality (Yardimci et al. 2019)

Core Metrics

Read Depth and Resolution

Step 8: Evaluate CUT&RUN / CUT&Tag Quality (Skene & Henikoff 2017; Kaya-Okur et al. 2019)

Key Differences from ChIP-seq

CUT&RUN Quality Metrics

CUT&RUN Suspect List (Nordin et al. 2023)

Step 9: Evaluate Single-Cell Quality (scRNA-seq and scATAC-seq)

scRNA-seq Quality Metrics

scATAC-seq Quality Metrics

Single-Cell-Specific Quality Pitfalls

Where to Find Single-Cell Quality in ENCODE

Step 10: Assess Replication

Replicate Types and Their Meaning

Concordance Assessment

Cross-Replicate Correlation

Step 11: Apply the ENCODE Blacklist (Amemiya et al. 2019)

Step 12: File Quality Tiers and Selection

File Quality Hierarchy

Step 13: Summarize Quality Verdict

Quality Tiers

Quality Summary Template

Pitfalls and Common Mistakes

Walkthrough: Quality Assessment of ENCODE ChIP-seq Before Analysis

Step 1: Get experiment details and audit status

Step 2: Check file-level quality

Step 3: Review quality metrics

Step 4: Track quality-verified experiments

Integration with downstream skills

Code Examples

1. Check experiment audit status

2. List files to check quality metrics

3. Get detailed file info for QC

Integration

Related Skills

Presenting Results

For the request: "$ARGUMENTS"