Encode-toolkit jaspar-motifs

Guide for using JASPAR transcription factor binding profiles with ENCODE ChIP-seq data. Use when users need to find TF binding motifs in ENCODE peaks, validate ChIP-seq targets with known motifs, or scan regulatory regions for TF binding potential. Trigger on: JASPAR, motif database, binding profile, PWM, position weight matrix, TF motif, motif enrichment, motif scanning, binding site prediction.

install
source · Clone the upstream repo
git clone https://github.com/ammawla/encode-toolkit
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ammawla/encode-toolkit "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jaspar-motifs" ~/.claude/skills/ammawla-encode-toolkit-jaspar-motifs-00675d && rm -rf "$T"
manifest: skills/jaspar-motifs/SKILL.md
source content

Using JASPAR Transcription Factor Binding Profiles with ENCODE ChIP-seq Data

Integrate JASPAR position weight matrices (PWMs) with ENCODE ChIP-seq peaks to validate TF binding targets, discover co-binding partners, and scan regulatory elements for TF binding potential.

Scientific Rationale

The question: "Does the expected TF binding motif appear in my ENCODE ChIP-seq peaks, and what other TF motifs are enriched?"

ENCODE TF ChIP-seq experiments identify where a transcription factor binds in the genome, but the peak coordinates alone do not confirm direct DNA binding or reveal the binding sequence specificity. JASPAR provides curated position weight matrices (PWMs) — mathematical representations of TF binding preferences — that enable two critical analyses:

  1. Target validation: If CTCF ChIP-seq peaks are enriched for the CTCF motif (JASPAR MA0139.1), the experiment worked correctly. If they are NOT enriched, something may be wrong with the antibody, crosslinking, or peak calling.

  2. Co-factor discovery: Motif enrichment analysis in ChIP-seq peaks often reveals motifs for co-binding TFs that were not the ChIP target, uncovering regulatory complexes.

What JASPAR Provides

  • 900+ curated TF binding profiles across 7 taxonomic groups
  • Position Frequency Matrices (PFMs), Position Weight Matrices (PWMs), and sequence logos
  • Multiple profile versions reflecting binding mode diversity
  • Quality scores (based on validation evidence)
  • Taxonomic classification and TF structural class annotation
  • REST API for programmatic access

The ENCODE-JASPAR Synergy

ENCODE providesJASPAR providesTogether
Where a TF binds (peak coordinates)How a TF recognizes DNA (binding motif)Validated binding sites with sequence specificity
TF binding in specific tissuesUniversal binding preferencesTissue-specific motif usage
Co-occupancy data (multiple ChIP-seq)Co-factor motif profilesRegulatory complex architecture
Chromatin context (accessibility, marks)Motif sequence requirementsContext-dependent binding rules

Key Literature

  • Castro-Mondragon et al. 2022 "JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles" (Nucleic Acids Research, ~1,400 citations). The current JASPAR release with 1,956 profiles across 7 taxonomic groups, including unvalidated (UNVALIDATED collection) profiles. Introduced TFBSTools integration and improved REST API. DOI: 10.1093/nar/gkab1113
  • Sandelin et al. 2004 "JASPAR: an open-access database for eukaryotic transcription factor binding profiles" (Nucleic Acids Research, ~2,000 citations). The founding JASPAR publication establishing the curated, open-access model for TF binding profiles. DOI: 10.1093/nar/gkh012
  • Grant et al. 2011 "FIMO: scanning for occurrences of a given motif" (Bioinformatics, ~2,500 citations). FIMO (Find Individual Motif Occurrences) — the standard tool for scanning sequences with PWMs. Part of the MEME Suite. DOI: 10.1093/bioinformatics/btr064
  • Heinz et al. 2010 "Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities" (Molecular Cell, ~5,000 citations). Introduced HOMER motif analysis — the most widely used tool for de novo and known motif enrichment in ChIP-seq peaks. DOI: 10.1016/j.molcel.2010.05.004
  • ENCODE Project Consortium 2020 (Nature, ~1,656 citations). The TF ChIP-seq experiments that JASPAR motifs validate and enrich. DOI: 10.1038/s41586-020-2493-4

When to Use This Skill

ScenarioHow JASPAR Helps
Validating ENCODE TF ChIP-seqCheck if target TF motif is enriched in peaks
Finding co-binding TFsScan peaks for additional enriched motifs
Interpreting ENCODE enhancersIdentify which TFs can bind enhancer sequences
Variant in TF binding siteCheck if variant disrupts a JASPAR motif
Comparing TF binding across tissuesDetermine if same motif is used in different contexts
Planning CRISPR validationIdentify core motif bases to mutate

JASPAR REST API Reference

Base URL:

https://jaspar.genereg.net/api/v1/

No authentication required. Responses are JSON.

Key Endpoints

EndpointPurposeKey Parameters
/matrix/
List/search all profiles
name
,
collection
,
tax_group
,
tf_class
/matrix/{id}/
Get specific profileMatrix ID (e.g., MA0139.1)
/matrix/{id}/?format=pfm
Get PFM (counts)
/matrix/{id}/?format=pwm
Get PWM (log-odds)
/matrix/{id}/?format=jaspar
Get JASPAR format
/matrix/{id}/?format=meme
Get MEME formatReady for FIMO scanning
/taxon/
List taxonomic groups
/tfclass/
List TF structural classes

Common Matrix IDs for ENCODE TFs

TFJASPAR IDClassNotes
CTCFMA0139.1C2H2 zinc fingerMost common ENCODE TF ChIP-seq target
TP53 (p53)MA0106.3p53 familyTumor suppressor
SP1MA0079.5C2H2 zinc fingerGC-rich promoter binding
FOXA1MA0148.4ForkheadPioneer factor
FOXA2MA0047.3ForkheadLiver, pancreas
HNF4AMA0114.4Nuclear receptorHepatocyte-enriched
NRF1MA0506.2bZIPMitochondrial regulation
REST (NRSF)MA0138.2C2H2 zinc fingerNeuronal gene repressor
MYCMA0147.3bHLHOncogene, E-box binding
JUN (AP-1)MA0488.1bZIPImmediate early response
GATA4MA0482.2GATACardiac, endoderm
PAX6MA0069.1Paired boxEye, brain development

Step 1: Retrieve ENCODE ChIP-seq Peaks

# Find TF ChIP-seq experiments
encode_search_experiments(
    assay_title="TF ChIP-seq",
    target="CTCF",
    organ="pancreas",
    biosample_type="tissue"
)

# Get IDR thresholded peaks (highest confidence)
encode_list_files(
    experiment_accession="ENCSR...",
    file_format="bed",
    output_type="IDR thresholded peaks",
    assembly="GRCh38",
    preferred_default=True
)

Track the experiment:

encode_track_experiment(accession="ENCSR...", notes="CTCF ChIP-seq for motif analysis")

Step 2: Get JASPAR Motif Profiles

Query by TF Name

import requests

def get_jaspar_matrix(tf_name, tax_group="vertebrates", collection="CORE"):
    """Get JASPAR matrix for a TF."""
    url = "https://jaspar.genereg.net/api/v1/matrix/"
    params = {
        "name": tf_name,
        "tax_group": tax_group,
        "collection": collection,
        "format": "json"
    }
    response = requests.get(url, params=params)
    results = response.json()["results"]
    if results:
        # Return the highest-version profile
        return sorted(results, key=lambda x: x["version"], reverse=True)[0]
    return None

ctcf_profile = get_jaspar_matrix("CTCF")
print(f"ID: {ctcf_profile['matrix_id']}, Version: {ctcf_profile['version']}")

Get the Position Frequency Matrix (PFM)

def get_pfm(matrix_id):
    """Get Position Frequency Matrix from JASPAR."""
    url = f"https://jaspar.genereg.net/api/v1/matrix/{matrix_id}/"
    params = {"format": "json"}
    response = requests.get(url, params=params)
    data = response.json()
    pfm = data["pfm"]
    # pfm is a dict with keys A, C, G, T, each a list of counts per position
    return pfm

pfm = get_pfm("MA0139.1")
print(f"Motif length: {len(pfm['A'])} bp")
for base in ["A", "C", "G", "T"]:
    print(f"{base}: {pfm[base]}")

Get MEME Format (for FIMO Scanning)

def get_meme_format(matrix_id):
    """Get motif in MEME format for use with FIMO."""
    url = f"https://jaspar.genereg.net/api/v1/matrix/{matrix_id}/"
    params = {"format": "meme"}
    response = requests.get(url, params=params)
    return response.text

meme_motif = get_meme_format("MA0139.1")
# Save to file for FIMO input
with open("ctcf_motif.meme", "w") as f:
    f.write(meme_motif)

Step 3: Extract Peak Sequences

Before scanning for motifs, extract the DNA sequences underlying ENCODE peaks.

Using bedtools getfasta

# Download reference genome (if not available)
# GRCh38: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

# Extract sequences from ENCODE peak regions
# Use summit +/- 100bp for narrow peaks (better motif enrichment)
awk 'BEGIN{OFS="\t"} {mid=int(($2+$3)/2); print $1, mid-100, mid+100, $4}' \
    encode_peaks.bed > peak_summits_200bp.bed

bedtools getfasta \
    -fi hg38.fa \
    -bed peak_summits_200bp.bed \
    -fo peak_sequences.fa

Summit-Centered Extraction

For motif analysis, center on peak summits (column 10 in narrowPeak format):

# narrowPeak summit is relative to peak start (column 10)
awk 'BEGIN{OFS="\t"} {summit=$2+$10; print $1, summit-100, summit+100, $4}' \
    encode_narrowpeak.bed > summit_regions.bed

Step 4: Motif Scanning and Enrichment

Option A: FIMO (Individual Motif Occurrences)

FIMO scans sequences for individual occurrences of a given motif.

# Scan peak sequences for CTCF motif
fimo --thresh 1e-4 \
     --oc fimo_output/ \
     ctcf_motif.meme \
     peak_sequences.fa

# Output: fimo_output/fimo.tsv with columns:
# motif_id, motif_alt_id, sequence_name, start, stop, strand, score, p-value, q-value, matched_sequence

Option B: HOMER (Known and De Novo Motif Enrichment)

HOMER findMotifsGenome.pl is the most popular tool for ChIP-seq motif analysis.

# Known motif enrichment
findMotifsGenome.pl \
    encode_peaks.bed \
    hg38 \
    homer_output/ \
    -size 200 \
    -mask \
    -p 4

# Output includes:
# knownResults.html — enrichment of known motifs (including JASPAR)
# homerResults.html — de novo discovered motifs

Option C: MEME-ChIP (Comprehensive Suite)

# Full motif analysis pipeline
meme-chip \
    -oc meme_output/ \
    -db JASPAR2022_CORE_vertebrates.meme \
    peak_sequences.fa

Interpreting Enrichment Results

Enrichment ResultInterpretation
Target TF motif ranked #1 with p < 1e-100Excellent — ChIP-seq validated
Target TF motif ranked #1 but p only 1e-5Moderate — may indicate indirect binding or weak motif
Target TF motif NOT in top 10Concerning — may indicate antibody cross-reactivity, indirect binding, or wrong motif version
Unexpected TF motif ranked #1May indicate co-factor or pioneer factor binding
Multiple TF motifs highly enrichedRegulatory hub — multiple TFs co-bind

Step 5: Variant Impact on TF Motifs

When a variant falls within an ENCODE TF ChIP-seq peak, check whether it disrupts the underlying motif.

Motif Disruption Analysis

import numpy as np

def score_sequence(sequence, pwm):
    """Score a sequence against a PWM (log-odds)."""
    base_to_idx = {"A": 0, "C": 1, "G": 2, "T": 3}
    score = 0
    for i, base in enumerate(sequence.upper()):
        if base in base_to_idx:
            score += pwm[base][i]
    return score

def variant_motif_impact(ref_seq, alt_seq, pwm):
    """Calculate motif score change from variant."""
    ref_score = score_sequence(ref_seq, pwm)
    alt_score = score_sequence(alt_seq, pwm)
    delta = alt_score - ref_score
    return {
        "ref_score": ref_score,
        "alt_score": alt_score,
        "delta_score": delta,
        "disrupted": delta < -2  # Threshold: >2 log-odds decrease
    }

Using motifbreakR (R Package)

library(motifbreakR)
library(BSgenome.Hsapiens.UCSC.hg38)

# Define variant
variant <- snps.from.rsid(
    rsid = "rs7903146",
    dbSNP = SNPlocs.Hsapiens.dbSNP155.GRCh38
)

# Scan against JASPAR motifs
results <- motifbreakR(
    snpList = variant,
    filterp = TRUE,
    pwmList = MotifDb,
    threshold = 1e-4,
    method = "log",
    bkg = c(A=0.25, C=0.25, G=0.25, T=0.25),
    show.neutral = FALSE
)

Step 6: Multi-TF Co-Binding Analysis

ENCODE often has ChIP-seq for multiple TFs in the same biosample. JASPAR motifs can reveal co-binding logic.

Workflow

  1. Get all TF ChIP-seq peaks for a biosample:
encode_search_experiments(
    assay_title="TF ChIP-seq",
    biosample_term_name="K562"
)
  1. Find peaks shared by multiple TFs (co-occupied regions)
  2. Scan co-occupied peaks for JASPAR motifs to identify the binding grammar
  3. Grammar example: GATA motif + TAL1 motif within 50bp = erythroid enhancer

Motif Spacing Analysis

TF cooperativity often requires specific motif spacing:

# Find GATA and TAL1 motifs in co-occupied peaks
fimo --thresh 1e-4 gata_motif.meme cooccupied_peaks.fa > gata_hits.tsv
fimo --thresh 1e-4 tal1_motif.meme cooccupied_peaks.fa > tal1_hits.tsv

# Analyze spacing between GATA and TAL1 motifs in same peaks
# Characteristic spacing indicates cooperative binding

Step 7: Present Results

Target Validation Table

TF ChIP-seqTarget MotifJASPAR IDEnrichment p-value% Peaks with MotifValidation
CTCFCTCFMA0139.11e-245678%Strong
FOXA2FOXA2MA0047.31e-34545%Strong
HNF4AHNF4AMA0114.41e-18952%Strong
EP300No specific motifExpected (coactivator)

Co-Factor Discovery Table

TF ChIP-seq TargetUnexpected Enriched MotifJASPAR IDp-valueInterpretation
HNF4A in liverFOXA2MA0047.31e-78Known co-binding at liver enhancers
GATA1 in K562TAL1MA0140.21e-234Erythroid TF complex
TP53SP1MA0079.51e-12Co-regulation at GC-rich promoters

Pitfalls & Edge Cases

  • Motif similarity causes false matches: Many TF families share similar motifs (e.g., all bHLH factors bind E-boxes). A JASPAR motif match does not prove which specific family member is binding. Combine with ChIP-seq evidence.
  • PWM score thresholds are arbitrary: JASPAR motifs use position weight matrices but the "significant" score threshold varies by motif length and information content. Use relative scores (>80% of max) rather than absolute cutoffs.
  • Redundant motifs in JASPAR: JASPAR contains multiple profiles for the same TF from different studies. These profiles may have different quality. Prefer the most recent version and check the data source (ChIP-seq > SELEX > PBM).
  • GC content bias in background model: Motif enrichment depends heavily on the background model. Genomic sequence, shuffled peaks, and matched random regions give very different p-values. Always use matched GC-content background.
  • Motif presence ≠ binding: A genomic sequence matching a JASPAR PWM does not mean the TF actually binds there. Chromatin accessibility, co-factor availability, and DNA methylation all modulate binding. Validate with ChIP-seq or ATAC-seq footprinting.
  • Species-specific motifs: JASPAR motifs are often derived from one species. A mouse-derived motif may not perfectly represent the human TF binding preference. Check the taxon field when using motifs across species.

Walkthrough: Identifying Transcription Factor Binding Motifs in ENCODE Enhancers

Goal: Scan ENCODE-defined enhancer peaks for enriched transcription factor binding motifs using the JASPAR database to predict which TFs regulate enhancer activity. Context: ENCODE H3K27ac peaks mark active enhancers, but don't reveal which TFs bind there. JASPAR motif scanning predicts TF occupancy.

Step 1: Find H3K27ac enhancer experiments

encode_search_experiments(assay_title="Histone ChIP-seq", organ="liver", target="H3K27ac", organism="Homo sapiens")

Expected output:

{
  "total": 6,
  "results": [
    {"accession": "ENCSR100LIV", "assay_title": "Histone ChIP-seq", "target": "H3K27ac", "biosample_summary": "liver"}
  ]
}

Step 2: Download enhancer peak files

encode_list_files(accession="ENCSR100LIV", file_format="bed", output_type="IDR thresholded peaks", assembly="GRCh38")

Expected output:

{
  "files": [
    {"accession": "ENCFF150ENH", "output_type": "IDR thresholded peaks", "file_format": "bed narrowPeak", "file_size_mb": 1.1}
  ]
}

Step 3: Query JASPAR for liver-relevant TF motifs

Using JASPAR REST API (via skill guidance):

GET https://jaspar.elixir.no/api/v2/matrix/?tax_id=9606&collection=CORE&profile_class=liver

Key liver TFs and their JASPAR matrix IDs:

  • HNF4A (MA0114.4) — master hepatocyte TF
  • CEBPA (MA0102.4) — liver differentiation
  • FOXA2 (MA0047.3) — pioneer factor for liver enhancers

Step 4: Scan enhancer peaks for motif enrichment

# Extract sequences under peaks
bedtools getfasta -fi GRCh38.fa -bed ENCFF150ENH.bed -fo enhancer_seqs.fa

# Scan with MEME/FIMO using JASPAR motifs
fimo --thresh 1e-4 JASPAR_liver_motifs.meme enhancer_seqs.fa > motif_hits.tsv

Interpretation: If HNF4A motifs are enriched 5× over background in liver enhancers, this confirms HNF4A as a key driver. Unexpected motifs (e.g., TP53) may suggest stress response enhancers.

Step 5: Validate with ENCODE TF ChIP-seq

encode_search_experiments(assay_title="TF ChIP-seq", organ="liver", target="HNF4A", organism="Homo sapiens")

Expected output:

{
  "total": 3,
  "results": [
    {"accession": "ENCSR200HNF", "assay_title": "TF ChIP-seq", "target": "HNF4A", "biosample_summary": "liver"}
  ]
}

Interpretation: If JASPAR-predicted HNF4A motif sites overlap actual HNF4A ChIP-seq peaks → validated motif prediction.

Integration with downstream skills

  • Enhancer peaks from → peak-annotation provide regions for motif scanning
  • Motif-predicted TFs validated against → motif-analysis de novo motif discovery
  • TF binding predictions inform → regulatory-elements enhancer classification
  • Predicted TF-enhancer links feed into → disease-research for TF-disease connections

Code Examples

1. Find TF ChIP-seq to validate motif predictions

encode_get_facets(assay_title="TF ChIP-seq", facet_field="target.label", organ="liver", organism="Homo sapiens")

Expected output:

{
  "facets": {
    "target.label": {"HNF4A": 3, "CEBPA": 2, "FOXA2": 2, "RXRA": 2, "TP53": 1}
  }
}

2. Compare predicted vs. observed TF binding

encode_compare_experiments(accession_1="ENCSR100LIV", accession_2="ENCSR200HNF")

Expected output:

{
  "comparison": {
    "shared": {"organ": "liver", "organism": "Homo sapiens", "assembly": "GRCh38"},
    "differences": {
      "assay": ["Histone ChIP-seq", "TF ChIP-seq"],
      "target": ["H3K27ac", "HNF4A"]
    }
  }
}

3. Track motif analysis experiments

encode_track_experiment(accession="ENCSR100LIV", notes="Liver H3K27ac for JASPAR motif scanning - HNF4A/CEBPA/FOXA2")

Expected output:

{
  "status": "tracked",
  "accession": "ENCSR100LIV",
  "notes": "Liver H3K27ac for JASPAR motif scanning - HNF4A/CEBPA/FOXA2"
}

Integration

This skill produces...Feed into...Purpose
TF motif enrichment scoresmotif-analysisCompare JASPAR database motifs with de novo discovered motifs
Predicted TF binding sitespeak-annotationAnnotate enhancer peaks with predicted TF regulators
TF-enhancer regulatory linksregulatory-elementsClassify enhancers by predicted TF driver identity
Motif-disrupting variant positionsvariant-annotationIdentify SNPs that alter TF binding motifs
Tissue-specific TF motif profilescompare-biosamplesCompare TF regulatory programs between tissues
TF binding predictionsdisease-researchConnect TF motif disruption to disease mechanisms
Motif scanning resultsvisualization-workflowGenerate motif logos and enrichment heatmaps
Validated TF-target pairsgtex-expressionCheck TF expression in tissue via GTEx

Presenting Results

When reporting JASPAR motif scanning results:

  • Motif hit table: Present a table with columns: TF_name, matrix_id (e.g., MA0139.1), p-value, score, genomic_location (chr:start-end), strand, and matched_sequence
  • Always report: JASPAR version (e.g., JASPAR 2024), scanning tool and version (FIMO, MOODS, or PWMScan), p-value threshold used, and genome assembly scanned
  • Key fields to include: Total regions scanned, number of motif hits passing threshold, number of unique TFs with hits, and whether repeat masking was applied
  • Context to provide: Emphasize that motif presence does not guarantee TF binding in vivo -- chromatin accessibility (ATAC-seq/DNase-seq) and actual TF occupancy (ChIP-seq) from ENCODE provide essential validation layers
  • Enrichment context: If testing for enrichment over background, report the background model (genome-wide, GC-matched, shuffled), fold enrichment, and statistical test used
  • Next steps: Suggest
    motif-analysis
    for de novo motif discovery with HOMER/MEME, or
    regulatory-elements
    to characterize the ENCODE regulatory elements containing the motif hits

Related Skills

  • regulatory-elements
    — Characterizing ENCODE regulatory elements that motifs help annotate
  • epigenome-profiling
    — Building tissue epigenomic profiles including TF binding landscapes
  • variant-annotation
    — Assessing whether variants disrupt TF binding motifs in ENCODE peaks
  • compare-biosamples
    — Comparing TF binding and motif usage across ENCODE biosamples
  • quality-assessment
    — Using motif enrichment as a quality control metric for TF ChIP-seq
  • gwas-catalog
    — GWAS variants that disrupt TF motifs in ENCODE peaks
  • publication-trust
    — Verify literature claims backing analytical decisions

For the request: "$ARGUMENTS"