Encode-toolkit quality-assessment

Evaluate ENCODE experiment quality using standard metrics and audit flags. Use when the user asks about data quality, wants to filter for high-quality experiments, needs to interpret quality metrics (FRiP, NSC, RSC, NRF, IDR, TSS enrichment, fragment size), wants to understand ENCODE audit warnings, needs to compare quality across experiments, or is deciding whether data is usable for their analysis. Also use when the user mentions QC, quality control, or data filtering.

install
source · Clone the upstream repo
git clone https://github.com/ammawla/encode-toolkit
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugin/skills/quality-assessment" ~/.claude/skills/ammawla-encode-toolkit-quality-assessment && rm -rf "$T"
manifest: plugin/skills/quality-assessment/SKILL.md
source content

Assess ENCODE Data Quality

When to Use

  • User asks about data quality, QC metrics, or whether an experiment is reliable
  • User wants to filter experiments by quality (FRiP, NSC, RSC, NRF, IDR, TSS enrichment)
  • User asks "is this experiment good enough?" or "should I use this data?"
  • User needs to interpret ENCODE audit flags (ERROR, NOT_COMPLIANT, WARNING)
  • User wants to compare quality across multiple experiments
  • User is selecting high-quality experiments for a meta-analysis or aggregation

Help the user evaluate whether ENCODE experiments meet quality standards for their analysis. Quality assessment is not a single-metric exercise — it requires integrating multiple orthogonal measures in the context of the specific assay, biological system, and analytical goals.

Literature Foundation

#ReferenceKey Contribution
1Landt et al. 2012, Genome Res, DOI:10.1101/gr.136184.111 (~3,500 cit)ENCODE/modENCODE ChIP-seq guidelines; defined NSC, RSC, NRF, FRiP thresholds
2ENCODE Project Consortium 2020, Nature, DOI:10.1038/s41586-020-2493-4 (~1,656 cit)ENCODE Phase 3; expanded quality standards to new assays, defined cCRE registry
3Buenrostro et al. 2013, Nat Methods, DOI:10.1038/nmeth.2688 (~7,000 cit)Introduced ATAC-seq; established fragment size and TSS enrichment as key QC
4Ou et al. 2018, BMC Genomics, DOI:10.1186/s12864-018-4559-3ATACseqQC R package; systematic quality metrics for ATAC-seq
5Conesa et al. 2016, Genome Biol, DOI:10.1186/s13059-016-0881-8 (~2,363 cit)RNA-seq best practices survey; defined mapping rate, rRNA, gene body coverage
6Foox et al. 2021, Genome Biol, DOI:10.1186/s13059-021-02529-2SEQC2 EpiQC consortium; multi-platform WGBS benchmarking
7Yardimci et al. 2019, Genome Biol, DOI:10.1186/s13059-019-1658-7Hi-C quality measures; cis/trans ratio, distance-dependent decay, resolution
8Skene & Henikoff 2017, eLife, DOI:10.7554/eLife.21856 (~1,800 cit)CUT&RUN method; established spike-in normalization and low-background QC
9Kaya-Okur et al. 2019, Nat Commun, DOI:10.1038/s41467-019-09982-5 (~1,200 cit)CUT&Tag method; tagmentation-based profiling with distinct QC profile
10Li et al. 2011, Ann Appl Stat, DOI:10.1214/11-AOAS466 (~1,500 cit)Irreproducible Discovery Rate (IDR); principled replicate concordance
11Hitz et al. 2023, Nucleic Acids Res, DOI:10.1093/nar/gkad243ENCODE uniform processing pipelines; standardized QC across all assays
12Nordin et al. 2023, Genome Biol, DOI:10.1186/s13059-023-03027-3CUT&RUN suspect list; identified artifact-prone regions specific to CUT&RUN/CUT&Tag
13Amemiya et al. 2019, Sci Rep, DOI:10.1038/s41598-019-45839-z (~1,372 cit)ENCODE Blacklist v2; artifact regions to exclude from all analyses

Step 1: Retrieve Experiment Details and Audit Status

Use

encode_get_experiment
with the accession to get full metadata including:

  • Audit status (ERROR, NOT_COMPLIANT, WARNING, INTERNAL_ACTION)
  • Replicate information (biological and technical replicates)
  • Pipeline and analysis details (which ENCODE uniform pipeline was used)
  • Quality metrics embedded in file objects
encode_get_experiment(accession="ENCSR...")

For batch assessment across multiple experiments:

encode_search_experiments(assay_title="...", organ="...", limit=50)
# Then iterate through results checking audit flags

Step 2: Interpret ENCODE Audit Flags

ENCODE audits are generated by automated validators during the ENCODE uniform processing pipeline (Hitz et al. 2023). They flag experiments by severity:

LevelMeaningAction
ERRORCritical issues — data may be unreliableAvoid using unless no alternative exists. Document thoroughly if used.
NOT_COMPLIANTDoes not meet current ENCODE standardsUsable with caveats. Check which specific standard is violated.
WARNINGMinor issues detectedGenerally safe. Document the specific warning.
INTERNAL_ACTIONDCC processing notesUsually not a concern for external users.

Common audit categories and what they mean:

Audit CategoryWhat It Checks
replicate concordance
IDR or correlation between biological replicates
library complexity
NRF, PBC1, PBC2 — whether library is saturated
read depth
Whether minimum depth thresholds are met
control quality
Whether input/IgG control is adequate
mapping quality
Alignment rate and uniquely mapped fraction
peak calling
Whether peaks were called successfully, FRiP
antibody validation
Whether antibody meets ENCODE standards

Present every audit flag to the user and explain each one. A single ERROR audit does not automatically disqualify an experiment — context matters.

Step 3: Evaluate ChIP-seq Quality (Landt et al. 2012)

The ENCODE ChIP-seq guidelines (Landt et al. 2012) established the foundational metrics still used today. These were developed from analysis of hundreds of ChIP-seq experiments and reflect empirically-derived thresholds.

Core Metrics

MetricThresholdConcernWhat It MeasuresWhy It Matters
FRiP≥1% (TF), ≥5% (histone)Below thresholdFraction of reads in peaksSignal enrichment. Very low FRiP means most reads are background. TF ChIP typically has lower FRiP than broad histone marks.
NSC>1.05≤1.05Normalized strand cross-correlationSignal-to-noise ratio. Computed from strand shift analysis. Values near 1.0 indicate no enrichment.
RSC>0.8≤0.8Relative strand cross-correlationSignal relative to phantom peak. More robust than NSC for shallow libraries.
NRF≥0.8<0.8Non-redundant fraction (unique/total)Library complexity. Low NRF = excessive PCR duplication = wasted sequencing.
PBC1≥0.8<0.5PCR bottleneck coefficient 1N1/Nd: fraction of locations with exactly 1 read. More sensitive than NRF at high depth.
PBC2≥3<1PCR bottleneck coefficient 2N1/N2: ratio of 1-read to 2-read locations. <1 indicates severe bottleneck.

Read Depth Requirements

Target TypeMinimum per ReplicateRecommendedNotes
Transcription factor10M uniquely mapped20MNarrow peaks, need depth for detection
Broad histone mark (H3K27me3, H3K9me3, H3K36me3)20M uniquely mapped45MBroad domains require more reads
Narrow histone mark (H3K4me3, H3K27ac)20M uniquely mapped20MSharp peaks, similar to TF
Input/IgG control10M uniquely mappedMatch IP depthShould match or exceed IP library depth

IDR Analysis (Li et al. 2011)

The Irreproducible Discovery Rate provides principled assessment of replicate concordance:

IDR ComparisonExpectedConcernInterpretation
Nt (true replicates)≥50% of Np<50% NpLow concordance between biological replicates
Np (pooled pseudoreplicates)Reference setRepresents total discoverable peaks
Self-consistency (Ns)≥50% of Np<50% NpIndividual replicate quality
Rescue ratio (Np/max(Nt,Ns))<2>2High ratio = one replicate much weaker

Key insight: IDR thresholded peaks represent peaks passing replicate concordance analysis. Pseudoreplicated peaks = single-replicate fallback (lower confidence). Optimal IDR peaks from pooled data = most complete peak set.

Antibody Validation

ENCODE requires characterization for every antibody:

  • Primary: IP followed by mass spectrometry or immunoprecipitation-western
  • Secondary: At least one of: knockdown/knockout, motif enrichment, genomic annotation enrichment
  • Check the
    antibody_lot_reviews
    field in experiment metadata

Step 4: Evaluate ATAC-seq Quality (Buenrostro et al. 2013; Ou et al. 2018)

ATAC-seq has a distinct quality profile driven by the transposase insertion mechanism.

Core Metrics

MetricGoodConcernWhat It Measures
TSS enrichment≥5 GRCh38 / ≥6 hg19 / ≥10 mm10 (ENCODE data standards)<4Signal enrichment at transcription start sites. The single most informative ATAC-seq QC metric.
FRiP≥20%<10%Higher expected FRiP than ChIP-seq because accessible chromatin = true signal
Fragment size distributionClear nucleosomal ladderMonotonic decayShould show peaks at <150bp (NFR), ~200bp (mono-nuc), ~400bp (di-nuc), ~600bp (tri-nuc)
NFR ratio>2× mono-nucleosomal<1×Ratio of sub-nucleosomal to mono-nucleosomal fragments
Mitochondrial reads<20% (after filtering)>50%Mitochondrial DNA is highly accessible; excessive = poor nuclear enrichment
Duplicate rate<30%>50%PCR duplication. Omni-ATAC protocol reduces this.
NRF≥0.7<0.5Library complexity, same concept as ChIP-seq

Read Depth Requirements

Sample TypeMinimumRecommended
Bulk ATAC-seq25M uniquely mapped (post-dedup, post-mito filter)50M
Single-cell ATAC-seq25K unique fragments per cell50K per cell

Fragment Size Interpretation (Buenrostro et al. 2013)

The fragment size distribution is the signature QC plot for ATAC-seq:

  • Sub-nucleosomal (<150bp): Nucleosome-free regions — these are the "open chromatin" signal
  • Mono-nucleosomal (~200bp): Single nucleosome wrapped
  • Di-nucleosomal (~400bp): Two nucleosomes
  • Tri-nucleosomal (~600bp): Three nucleosomes

A clean ATAC-seq library shows a clear nucleosomal ladder. Monotonic decay (no peaks) suggests either dead cells, over-transposition, or excessive DNA damage.

Step 5: Evaluate RNA-seq Quality (Conesa et al. 2016)

Core Metrics

MetricGoodConcernWhat It Measures
Mapping rate70-90% uniquely mapped<70%Alignment success. Low rate = contamination, adapter issues, or wrong reference
rRNA contamination<10%>20%Ribosomal RNA depletion efficiency. High = failed ribo-depletion
Gene body coverageUniform 5'→3'Strong 3' biasEven coverage across gene bodies. 3' bias = degraded RNA or poly-A capture bias
Duplication rate<50%>70%PCR amplification artifacts
Replicate correlationSpearman ≥0.9 (same condition)<0.8Concordance between replicates
Exonic reads>60% of mapped<40%Reads mapping to annotated exons vs intergenic
Intergenic reads<10%>20%Reads mapping between genes — may indicate genomic DNA contamination

Read Depth Requirements (Conesa et al. 2016)

ApplicationMinimumRecommendedNotes
Gene-level quantification10M mapped30M mappedStandard bulk RNA-seq
Transcript-level quantification30M mapped60M mappedIsoform detection requires more depth
Differential expression10M per sample, ≥3 bio reps20M per sampleStatistical power depends more on replicates than depth
Rare transcript detection50M+ mapped100M mappedLong-tail of expression distribution
total RNA-seq50M+ mapped100M mappedIncludes non-coding RNA, intergenic transcripts

Strand Specificity

ENCODE RNA-seq data may be stranded or unstranded:

  • Stranded: Can distinguish sense vs antisense transcription. Required for accurate quantification of overlapping genes.
  • Unstranded: Cannot resolve strand of origin. Check
    run_type
    and
    library_strand_specificity
    in metadata.

Step 6: Evaluate WGBS Quality (ENCODE data standards)

Core Metrics

MetricGoodConcernWhat It Measures
Bisulfite conversion rate≥98%<98%Efficiency of C→U conversion of unmethylated cytosines. Measured from spike-in controls (lambda phage DNA).
CpG coverage>80% of CpGs at ≥1×<50%Fraction of CpG sites covered by at least one read
Mean CpG coverage≥10× for DMR analysis<5×Average sequencing depth at CpG sites. 10× needed for reliable methylation calls.
Mapping rate>60% unique<40%Lower than standard WGS due to reduced complexity after bisulfite conversion
Duplication rate<30%>50%PCR duplicates
CpG methylation distributionBimodal (near 0% and near 100%)UnimodalHealthy cells show bimodal: most CpGs are either fully methylated or unmethylated
Lambda/pUC19 conversion≥98% conversion rate<98%Spike-in controls for bisulfite conversion efficiency

Platform Considerations (Foox et al. 2021)

The SEQC2 EpiQC benchmark found significant platform effects:

  • Different sequencing platforms (Illumina HiSeq, NovaSeq, MGI) can produce systematically different methylation calls
  • Cross-platform comparisons require careful normalization
  • RRBS (Reduced Representation Bisulfite Sequencing) covers only ~10% of CpGs — do NOT combine RRBS with WGBS directly

Coverage Requirements

ApplicationMinimum CpG CoverageRecommended
Methylation landscape
Differentially methylated regions5× per sample10× per sample
Allele-specific methylation15×30×
Single CpG resolution10×30×

Step 7: Evaluate Hi-C Quality (Yardimci et al. 2019)

Core Metrics

MetricGoodConcernWhat It Measures
Cis/trans ratio>60% cis<40% cisFraction of contacts within same chromosome. Low cis = random ligation = poor quality
Long-range cis (>20kb)>40% of cis<15%True 3D interactions vs random proximity. Short-range contacts are noise-enriched
Unique valid pairs>50% of total<25%Pairs surviving all filters (mapping, dedup, chimera removal)
Duplicate rate<40%>60%PCR duplicates in Hi-C are especially problematic because they inflate contact frequencies
Contact distance decaySmooth P(s)∝s^-1 curveIrregular/plateauExpected power-law decay with genomic distance

Read Depth and Resolution

Resolution TargetMinimum Valid PairsRecommended
Compartment-level (100kb)50M100M
TAD-level (40kb)200M500M
Loop-level (5-10kb)500M1B+
Sub-TAD (1kb)2B+5B+

Note: Hi-C resolution is not just about read depth — it also depends on restriction enzyme site density, ligation efficiency, and fragment size distribution. In situ Hi-C (Rao et al. 2014) generally produces cleaner data than dilution Hi-C.

Step 8: Evaluate CUT&RUN / CUT&Tag Quality (Skene & Henikoff 2017; Kaya-Okur et al. 2019)

These newer profiling methods have distinct quality profiles from ChIP-seq.

Key Differences from ChIP-seq

FeatureChIP-seqCUT&RUN / CUT&Tag
BackgroundHigh (requires input control)Low (targeted cleavage)
Required depth10-45M3-8M sufficient
FRiP>1-5%>20% typical
Input controlRequiredIgG control recommended but lower priority
Fragment sizeSize-selected ~200-600bpVariable; CUT&RUN releases <120bp fragments
Spike-inNot standardRecommended (E. coli carry-over or added spike-in)

CUT&RUN Quality Metrics

MetricGoodConcern
FRiP>20%<5%
Fragment sizePeak at <120bp (released fragments)Only large fragments
Read depth3-8M unique mapped<1M
Spike-in ratioConsistent across conditions>5× variation
Duplicate rate<30%>60%

CUT&RUN Suspect List (Nordin et al. 2023)

CUT&RUN and CUT&Tag generate artifacts at specific genomic regions (distinct from the ENCODE Blacklist). These are regions with apparent enrichment that is not target-specific:

Step 9: Evaluate Single-Cell Quality (scRNA-seq and scATAC-seq)

ENCODE includes scRNA-seq and scATAC-seq experiments (primarily 10X Chromium platform). Single-cell data has distinct quality metrics from bulk assays, focused on per-cell quality rather than per-experiment signal-to-noise.

scRNA-seq Quality Metrics

MetricAcceptable RangeRed FlagNotes
Genes per cell (median)1,500–4,000 (10X) / 4,000–8,000 (Smart-seq2)<500Tissue-dependent; immune cells typically lower than epithelial
UMIs per cell (median)3,000–15,000 (10X)<1,000N/A for Smart-seq2 (no UMIs)
Mitochondrial % (median)<10–15%>25%High mito% indicates cell stress or lysis; tissue-dependent thresholds
Doublet rate (estimated)2–8% (10X, cell-count dependent) / <2% (plate-based)>10%Increases with cell loading density; use Scrublet or DoubletFinder
Mapping rate>80%<60%Low mapping suggests contamination or mismapping
Sequencing saturation>40%<20%Low saturation may miss rare transcripts
Cell count vs expectedWithin 50–150% of expected<30% or >200%Very low = failed capture; very high = doublets or debris

scATAC-seq Quality Metrics

MetricAcceptable RangeRed FlagNotes
Unique fragments per cell>3,000<1,000Sparse data below threshold makes peak calling unreliable
TSS enrichment per cell>5<2Low TSS enrichment indicates failed Tn5 insertion bias
Fraction in peaks (FRiP)>20%<10%Measures signal-to-noise at single-cell level
Fraction of mitochondrial reads<5%>10%Dead/dying cells captured
Duplicate rate<40%>60%High duplication indicates low library complexity

Single-Cell-Specific Quality Pitfalls

  • Ambient RNA contamination (scRNA-seq): Cell-free RNA from lysed cells during droplet capture inflates apparent expression of highly-expressed genes across all cells. Use CellBender (best-in-class), SoupX, or DecontX to estimate and remove ambient contamination BEFORE downstream analysis.
  • Barcode multiplets (scATAC-seq): Multiple cells per droplet inflate fragment counts and blur cell-type signals. ArchR and SnapATAC2 include doublet detection modules.
  • Cell-type composition bias: Quality metrics vary by cell type. A "low-quality" cell may be a small immune cell, not a damaged cell. Apply adaptive QC thresholds per cluster (e.g., miQC) rather than global cutoffs.
  • Batch effects across donors: For ENCODE tissue scRNA-seq from multiple donors, batch correction (Harmony, scVI) is typically needed before integration. The single-cell-encode skill covers integration workflows.

Where to Find Single-Cell Quality in ENCODE

Single-cell experiments in ENCODE include cell-level quality summaries in their metadata. Check:

encode_get_experiment(accession="ENCSR...")

Look for

audit
flags — ENCODE applies automated QC checks including minimum cell counts, minimum genes per cell, and maximum doublet rates. Also check the
replicates
section for library preparation details.

Step 10: Assess Replication

ENCODE requires minimum 2 independent biological replicates for released data.

Replicate Types and Their Meaning

Replicate TypeDefinitionUse Case
BiologicalIndependent biological samplesGold standard — captures biological variation
TechnicalSame sample, different library prepAssesses technical reproducibility
IsogenicSame genotype, different growth/collectionCommon for cell lines (e.g., K562, GM12878)
AnisogenicDifferent genotypes/donorsCommon for tissue samples

Concordance Assessment

Use

encode_list_files
to check for replicated peak files:

File Output TypeWhat It MeansConfidence
IDR thresholded peaksPassed replicate concordance analysis (Li et al. 2011)Highest
Optimal IDR peaksPeaks from pooled data, thresholded by IDRComplete set
Conservative IDR peaksStricter IDR thresholdMost conservative
Pseudoreplicated peaksIDR on pseudoreplicates from pooled dataSingle-replicate fallback
Replicated peaksFound in multiple replicates (non-IDR method)Moderate

Cross-Replicate Correlation

For quantitative data (RNA-seq, signal tracks):

  • Spearman correlation ≥0.9: Excellent concordance
  • 0.8-0.9: Acceptable, check for outliers
  • <0.8: Investigate — batch effect, sample mix-up, or biological variation

Step 11: Apply the ENCODE Blacklist (Amemiya et al. 2019)

Before interpreting any peak-based quality metric, confirm that the ENCODE Blacklist has been applied:

  • Blacklist v2 regions: High-signal artifacts (satellite repeats, centromeric regions, high-copy sequences)
  • Available at: https://github.com/Boyle-Lab/Blacklist/
  • hg38:
    hg38-blacklist.v2.bed.gz
    (910 regions)
  • mm10:
    mm10-blacklist.v2.bed.gz

Failure to remove blacklisted regions will inflate FRiP, create false peaks, and confound enrichment analyses. If analyzing CUT&RUN/CUT&Tag, also apply the CUT&RUN suspect list (Nordin et al. 2023).

Step 12: File Quality Tiers and Selection

When listing files with

encode_list_files
, use quality-informed selection:

# Get preferred default files (ENCODE's recommendation)
encode_list_files(experiment_accession="ENCSR...", preferred_default=True)

# Get IDR thresholded peaks (gold standard for ChIP-seq)
encode_list_files(experiment_accession="ENCSR...", output_type="IDR thresholded peaks", assembly="GRCh38")

# Get signal tracks for visualization
encode_list_files(experiment_accession="ENCSR...", output_type="fold change over control", assembly="GRCh38")

File Quality Hierarchy

PriorityFile TypeWhen to Use
1
preferred_default=True
ENCODE's recommended files — start here
2IDR thresholded peaksGold standard for ChIP-seq peak calls
3Fold change over controlNormalized signal for visualization and quantitative comparison
4Signal of unique readsClean signal tracks (unnormalized)
5Pseudoreplicated peaksFallback when IDR fails or only 1 replicate available
6Unfiltered alignmentsOnly for custom re-analysis

Step 13: Summarize Quality Verdict

Provide a structured quality assessment:

Quality Tiers

TierCriteriaRecommendation
High qualityNo ERROR/NOT_COMPLIANT audits, all metrics above thresholds, ≥2 biological replicates, IDR peaks availableUse confidently. Ideal for primary analysis.
Usable with caveatsWARNING-level audits or borderline metrics (within 20% of threshold), good replicationUsable. Document specific limitations in methods.
Use with cautionNOT_COMPLIANT flags, one metric below threshold, or single-replicateUse only if no better alternative. Document all issues. Flag in results.
Not recommendedERROR flags, multiple metrics below threshold, poor replication, no IDR peaksAvoid. Seek alternative experiments or datasets.

Quality Summary Template

For each experiment assessed, provide:

  1. Accession: ENCSR...
  2. Assay: type and target
  3. Audit status: list all flags with explanations
  4. Key metrics: table with values and pass/fail
  5. Replication: number and type of replicates, concordance
  6. Pipeline: which ENCODE uniform pipeline version was used
  7. Verdict: tier assignment with justification
  8. Caveats: any specific limitations to note

Pitfalls and Common Mistakes

  1. Single-metric decisions: No single metric captures quality. FRiP alone can be misleading — some TF ChIP-seq with biological signal has low FRiP due to focal binding patterns. Always evaluate collectively.

  2. Comparing across assays: Do NOT compare ChIP-seq metrics to ATAC-seq metrics to CUT&RUN metrics. Each assay has its own quality profile and thresholds.

  3. Ignoring batch effects: Experiments from different labs, dates, or platforms may have systematic quality differences. When combining data, check for batch-correlated quality variation.

  4. Assembly mismatch: Quality metrics computed on different assemblies (hg19 vs GRCh38) may differ slightly. Always verify the assembly of quality metrics matches your analysis assembly.

  5. Antibody lot variation: The same antibody target can show different enrichment across lots. Check

    antibody_lot_reviews
    in ENCODE metadata.

  6. Read depth ≠ quality: A deeply sequenced bad library is still a bad library. Check NRF/PBC first — if complexity is exhausted, more sequencing wastes resources.

  7. Control quality matters: An IP library is only as good as its control. Poor input/IgG control undermines all downstream peak-based metrics.

  8. Newer assays, different rules: CUT&RUN and CUT&Tag have inherently different quality profiles from ChIP-seq. Applying ChIP-seq thresholds to CUT&RUN will incorrectly flag high-quality data.

Walkthrough: Quality Assessment of ENCODE ChIP-seq Before Analysis

Goal: Evaluate the quality of ENCODE ChIP-seq experiments against ENCODE consortium standards before including them in downstream analysis. Context: Not all ENCODE experiments meet the highest quality standards. Quality assessment prevents garbage-in-garbage-out in aggregation and integration analyses.

Step 1: Get experiment details and audit status

encode_get_experiment(accession="ENCSR000AKA")

Expected output:

{
  "accession": "ENCSR000AKA",
  "assay_title": "Histone ChIP-seq",
  "target": "H3K27ac",
  "biosample_summary": "GM12878",
  "replicates": 2,
  "status": "released",
  "audit": {"ERROR": 0, "NOT_COMPLIANT": 0, "WARNING": 1}
}

Interpretation: 0 ERRORs and 0 NOT_COMPLIANT = experiment meets ENCODE standards. 1 WARNING is acceptable.

Step 2: Check file-level quality

encode_list_files(accession="ENCSR000AKA", file_format="bed", output_type="IDR thresholded peaks", assembly="GRCh38")

Step 3: Review quality metrics

Key ChIP-seq quality thresholds (Landt et al. 2012):

MetricThresholdMeaning
FRiP>= 1%Signal enrichment over background
NSC> 1.05Strand cross-correlation signal
RSC> 0.8Relative strand correlation
NRF>= 0.8Library complexity
IDR< 0.05Reproducibility between replicates

Step 4: Track quality-verified experiments

encode_track_experiment(accession="ENCSR000AKA", notes="QC PASSED: FRiP=3.2%, NSC=1.12, RSC=0.95, 0 audit errors")

Integration with downstream skills

  • Quality-filtered experiments feed into histone-aggregation and other aggregation skills
  • QC metrics inform pipeline-chipseq parameter tuning
  • Audit status guides search-encode experiment selection
  • QC documentation supports data-provenance and scientific-writing

Code Examples

1. Check experiment audit status

encode_get_experiment(accession="ENCSR000AKA")

Expected output:

{
  "accession": "ENCSR000AKA",
  "audit": {"ERROR": 0, "NOT_COMPLIANT": 0, "WARNING": 1}
}

2. List files to check quality metrics

encode_list_files(accession="ENCSR000AKA", file_format="bed", assembly="GRCh38")

Expected output:

{
  "files": [
    {"accession": "ENCFF001ABC", "output_type": "IDR thresholded peaks", "file_size_mb": 1.1}
  ]
}

3. Get detailed file info for QC

encode_get_file_info(accession="ENCFF001ABC")

Expected output:

{
  "accession": "ENCFF001ABC",
  "file_format": "bed narrowPeak",
  "output_type": "IDR thresholded peaks",
  "assembly": "GRCh38",
  "quality_metrics": {"frip": 0.032, "nsc": 1.12, "rsc": 0.95}
}

Integration

This skill produces...Feed into...Purpose
Quality-verified experimentshistone-aggregationOnly aggregate high-quality data
QC pass/fail decisionssearch-encodeFilter search results by quality
Quality metric reportsdata-provenanceDocument QC criteria used
Audit interpretationpipeline-guideGuide reprocessing decisions
QC documentationscientific-writingMethods section QC reporting
Quality thresholdspublication-trustVerify QC threshold citations
Validated experiment listsbatch-analysisProcess only quality-approved experiments
QC-filtered peaksregulatory-elementsHigh-confidence regulatory element maps

Related Skills

  • publication-trust: Assess scientific integrity of publications before relying on their methods or findings — complements experiment quality with publication quality
  • histone-aggregation: Uses quality filtering before aggregating peaks
  • accessibility-aggregation: ATAC-seq quality directly impacts aggregation
  • data-provenance: Log quality decisions and thresholds used
  • compare-biosamples: Quality must be comparable across samples being compared
  • integrative-analysis: All data sources need quality assessment before integration
  • single-cell-encode: Quality metrics for scRNA-seq and scATAC-seq (genes/cell, fragments/cell, TSS enrichment)
  • epigenome-profiling: Quality assessment is a prerequisite for epigenomic profile assembly
  • variant-annotation: Quality of ENCODE experiments determines reliability of variant annotation
  • pipeline-guide: Pipeline version affects quality metric computation
  • batch-analysis: Batch QC screening across multiple experiments for systematic quality filtering

Presenting Results

  • Present QC metrics as a traffic-light table: metric | value | threshold | status (PASS/WARN/FAIL). Always include the ENCODE audit level. Suggest: "Would you like to filter to only experiments meeting all QC thresholds?"

For the request: "$ARGUMENTS"