SciAgent-Skills homer-motif-analysis
De novo motif discovery and known TF motif enrichment in ChIP-seq and ATAC-seq peak sets using HOMER. findMotifsGenome.pl searches for over-represented sequence patterns against a background; annotatePeaks.pl assigns genomic context (TSS distance, gene, repeat). Use after MACS3 peak calling to identify which transcription factors are enriched in your peaks, annotate peaks with nearest genes, and validate ChIP-seq quality by checking the target TF's own motif.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/homer-motif-analysis" ~/.claude/skills/jaechang-hits-sciagent-skills-homer-motif-analysis && rm -rf "$T"
skills/genomics-bioinformatics/homer-motif-analysis/SKILL.mdHOMER — Motif Analysis and Peak Annotation
Overview
HOMER (Hypergeometric Optimization of Motif EnRichment) is a suite of Perl/C++ tools for analyzing genomic regulatory elements. Its two primary commands are
findMotifsGenome.pl, which performs de novo motif discovery and known motif enrichment against JASPAR/HOMER databases, and annotatePeaks.pl, which maps each peak to the nearest gene, distance to TSS, and genomic feature class (promoter, intron, intergenic, repeat). HOMER takes BED-format peak files from MACS3 or similar peak callers and a reference genome assembly as input, and outputs HTML/text reports ranking enriched motifs by p-value and fold enrichment over a matched background.
When to Use
- Identifying which transcription factors are bound in a ChIP-seq peak set by enriching their known motifs from JASPAR or the HOMER motif library
- Discovering novel sequence motifs de novo in open chromatin regions from ATAC-seq without prior knowledge of the binding TF
- Comparing motif landscapes between two conditions (e.g., treated vs. untreated peak sets) by running HOMER with one set as target and the other as background
- Annotating genomic peaks with nearest genes and distance to TSS for downstream functional analysis or integration with DESeq2 results
- Validating ChIP-seq experiment quality: a successful pull-down should show the target TF's canonical motif as the top hit
- Use
first to generate the peak BED files that serve as input to HOMERmacs3-peak-calling - Use
to cross-reference HOMER-discovered motifs with JASPAR IDs and additional TF metadatajaspar-database - Use
(web or local) when you need a more probabilistic ZOOPS/TCM model or the MEME Suite ecosystemMEME-CHIP - Use
(part of MEME Suite) as a faster alternative for known motif scanning without de novo discoveryAME
Prerequisites
- Software: HOMER (Perl + compiled binaries), conda or manual install
- Genomes: must download genome sequence via
after HOMER installinstallGenome.pl - Input: BED file of peaks (at minimum: chr, start, end columns); ideally summit-centered peaks from MACS3
- Python packages (for parsing/visualization):
,pandas
,matplotlibseaborn
Check before installing: The tool may already be available in the current environment (e.g., inside a
/pixienv). Runcondafirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viacommand -v findMotifsGenome.plrather than barepixi run findMotifsGenome.pl.findMotifsGenome.pl
# Install HOMER via conda (recommended — handles Perl dependencies) conda install -c bioconda homer # Verify installation findMotifsGenome.pl 2>&1 | head -3 # Usage: findMotifsGenome.pl <peak/BED file> <genome> <output directory> [options] annotatePeaks.pl 2>&1 | head -3 # Usage: annotatePeaks.pl <peak/BED file> <genome> [options] # Install reference genomes (downloads 2-way masker + sequence; ~3–10 GB each) installGenome.pl hg38 installGenome.pl mm10 # Install Python parsing dependencies pip install pandas matplotlib seaborn
Quick Start
# Run de novo + known motif enrichment on TF ChIP-seq peaks (hg38, 200 bp window) findMotifsGenome.pl peaks/tf_chip_summits.bed hg38 motif_output/ \ -size 200 -mask -p 4 # Annotate peaks with nearest genes and genomic features annotatePeaks.pl peaks/tf_chip_peaks.narrowPeak hg38 > annotated_peaks.txt echo "Top known motif:" head -2 motif_output/knownResults.txt | tail -1 | cut -f1-4 echo "Annotated peaks: $(wc -l < annotated_peaks.txt) lines"
Workflow
Step 1: Installation and Genome Setup
Install HOMER and download the reference genome sequence required for motif analysis.
# Activate conda environment (or use existing env) conda create -n homer_env -c bioconda homer python=3.10 -y conda activate homer_env # List available genomes installGenome.pl list # Install human (hg38) and mouse (mm10) genomes # Downloads masked genome sequence and annotation files installGenome.pl hg38 # Output: Installing hg38... Done. (3-5 min, ~3 GB) installGenome.pl mm10 # Output: Installing mm10... Done. (3-5 min, ~2.5 GB) # Verify genome is installed ls ~/.homer/data/genomes/hg38/ # genome.fa chrom.sizes ... # Check HOMER motif database ls ~/.homer/data/knownTFs/ # vertebrates.motifs jaspar.motifs ...
Step 2: Prepare Input Peak File
Prepare a summit-centered BED file from MACS3 output for optimal motif resolution.
# Option A: Use MACS3 summit file directly (already 1 bp summit positions) # Expand summits to ±100 bp (200 bp total) centered on summit awk 'BEGIN{OFS="\t"} { start = ($2 - 100 < 0) ? 0 : $2 - 100; print $1, start, $2 + 100, $4, $5 }' peaks/tf_chip_summits.bed > peaks/tf_chip_200bp.bed echo "Summit-centered peaks: $(wc -l < peaks/tf_chip_200bp.bed)" # Summit-centered peaks: 12453 # Option B: Use narrowPeak file directly (HOMER accepts multi-column BED) # HOMER uses columns 1-3 (chr, start, end) and centers internally with -size cp peaks/tf_chip_peaks.narrowPeak peaks/input_peaks.bed # Option C: Prepare a custom background region file (matched GC content) # HOMER auto-generates background if not provided, but explicit background # is recommended when comparing two peak sets # Use the control peak set or random genomic regions as background: bedtools shuffle -i peaks/tf_chip_peaks.narrowPeak \ -g ~/.homer/data/genomes/hg38/chrom.sizes \ -excl peaks/tf_chip_peaks.narrowPeak > peaks/background_regions.bed echo "Background regions: $(wc -l < peaks/background_regions.bed)" # Background regions: 12453
Step 3: De Novo Motif Discovery
Run
findMotifsGenome.pl for de novo motif discovery and known motif enrichment simultaneously.
mkdir -p motif_output/ # Full run: de novo + known motif enrichment # -size 200: use 200 bp window centered on peak midpoint # -mask: mask repetitive elements (recommended for clean motifs) # -p 4: use 4 CPU threads # -S 25: find top 25 de novo motifs (default) findMotifsGenome.pl peaks/tf_chip_200bp.bed hg38 motif_output/ \ -size 200 \ -mask \ -p 4 \ -S 25 # Check progress output: # Reading genome sizes for hg38 ... # Scanning for motifs... # Optimizing 25 motifs... # Done! Output in motif_output/ echo "Known results: $(wc -l < motif_output/knownResults.txt) motifs" echo "De novo motifs: $(ls motif_output/homerResults/*.motif 2>/dev/null | wc -l) motifs" # Known results: 392 motifs # De novo motifs: 25 motifs # For mouse peaks (mm10) # findMotifsGenome.pl peaks/atac_peaks.bed mm10 motif_output_mm10/ \ # -size 200 -mask -p 4
Step 4: Known Motif Enrichment Only
Scan peaks for occurrences of a specific known motif or skip de novo discovery for speed.
# Skip de novo discovery (faster when you only need known motifs) findMotifsGenome.pl peaks/tf_chip_200bp.bed hg38 motif_output_known/ \ -size 200 \ -mask \ -p 4 \ -nomotif echo "Known motif results: $(wc -l < motif_output_known/knownResults.txt)" # Known motif results: 392 # Find occurrences of a specific motif across peaks (outputs peak-level annotation) # Extract the motif matrix file for the TF of interest from homerResults/ findMotifsGenome.pl peaks/tf_chip_200bp.bed hg38 motif_scan_out/ \ -size 200 \ -mask \ -find motif_output/homerResults/motif1.motif \ > peaks_with_motif1.txt echo "Peaks containing motif1: $(wc -l < peaks_with_motif1.txt)" # Peaks containing motif1: 8941 # Custom background: compare treated vs. control peak sets findMotifsGenome.pl peaks/treated_peaks.bed hg38 motif_treated_vs_ctrl/ \ -size 200 \ -mask \ -p 4 \ -bg peaks/control_peaks.bed
Step 5: Peak Annotation
Use
annotatePeaks.pl to assign each peak to a genomic feature and nearest gene.
# Annotate peaks with nearest gene and TSS distance # Outputs a tab-delimited file with genomic context for each peak annotatePeaks.pl peaks/tf_chip_peaks.narrowPeak hg38 \ > annotated_peaks.txt echo "Annotated peaks: $(($(wc -l < annotated_peaks.txt) - 1)) peaks" # Annotated peaks: 12453 peaks # Preview column headers and first peak head -2 annotated_peaks.txt | cut -f1-10 # Annotate ATAC-seq peaks (same command, different input) annotatePeaks.pl peaks/atac_sample_peaks.narrowPeak hg38 \ > annotated_atac.txt # Generate TSS-distance histogram (for tag density plots) # annotatePeaks.pl can compute read density around peaks with -d flag # annotatePeaks.pl tss hg38 -size 4000 -hist 10 \ # -d chip_tagdir/ > tss_histogram.txt
Step 6: Parse HOMER Results with Python
Read
knownResults.txt and de novo motif files into pandas for downstream analysis.
import pandas as pd import subprocess import io # --- Parse known motif enrichment results --- # knownResults.txt columns: # Motif Name | Consensus | P-value | Log P-value | q-value | # Target Seqs w/ motif | % Target | # Bg Seqs w/ motif | % Bg known_cols = [ "motif_name", "consensus", "pvalue", "log_pvalue", "qvalue", "n_target_seqs", "pct_target", "n_bg_seqs", "pct_bg" ] known = pd.read_csv( "motif_output/knownResults.txt", sep="\t", header=0, names=known_cols ) # Convert string percentages to floats known["pct_target"] = known["pct_target"].str.rstrip("%").astype(float) known["pct_bg"] = known["pct_bg"].str.rstrip("%").astype(float) known["fold_enrichment"] = known["pct_target"] / known["pct_bg"].replace(0, 0.001) known["-log10_pvalue"] = -known["log_pvalue"] / 2.303 # log10 from natural log print(f"Total known motifs tested: {len(known)}") print(f"Significant (p < 1e-5): {(known['pvalue'].astype(float) < 1e-5).sum()}") print("\nTop 5 enriched motifs:") print( known[["motif_name", "pct_target", "pct_bg", "fold_enrichment", "-log10_pvalue"]] .head(5) .to_string(index=False) ) # Total known motifs tested: 391 # Significant (p < 1e-5): 47 # Top 5 enriched motifs: # motif_name pct_target pct_bg fold_enrichment -log10_pvalue # CTCF(Zf)/GM12878-CTCF-ChIP-Seq... 78.21 8.32 9.40 312.4 # ZNF143(Zf)/... 42.11 5.10 8.26 188.7 # ...
Step 7: Parse Peak Annotations and Visualize
Read annotated peak output, summarize genomic feature distribution, and plot motif enrichments.
import pandas as pd import matplotlib.pyplot as plt import matplotlib.patches as mpatches import seaborn as sns # --- Parse annotated peaks --- # annotatePeaks.pl output header: PeakID chr start end strand score ... # Key columns: "Annotation", "Distance to TSS", "Nearest RefSeq", "Gene Name" annot = pd.read_csv("annotated_peaks.txt", sep="\t", header=0, low_memory=False) annot.columns = annot.columns.str.strip() # Simplify annotation categories def simplify_annotation(ann): if pd.isna(ann): return "Other" ann = str(ann) if "promoter" in ann.lower(): return "Promoter (<2 kb)" elif "exon" in ann.lower(): return "Exon" elif "intron" in ann.lower(): return "Intron" elif "tts" in ann.lower() or "downstream" in ann.lower(): return "Downstream" elif "intergenic" in ann.lower(): return "Intergenic" else: return "Other" annot["simple_ann"] = annot["Annotation"].apply(simplify_annotation) ann_counts = annot["simple_ann"].value_counts() print("Peak annotation summary:") print(ann_counts.to_string()) # Peak annotation summary: # Intron 5832 # Intergenic 3102 # Promoter (<2 kb) 2218 # Exon 891 # Downstream 287 # Other 123 # --- Plot 1: Annotation pie chart --- fig, axes = plt.subplots(1, 2, figsize=(14, 5)) colors = ["#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00", "#A65628"] axes[0].pie( ann_counts.values, labels=ann_counts.index, colors=colors[:len(ann_counts)], autopct="%1.1f%%", startangle=140, textprops={"fontsize": 9} ) axes[0].set_title("Peak Genomic Annotation\n(n=12,453 peaks)", fontsize=11) # --- Plot 2: Top known motif enrichments --- known = pd.read_csv("motif_output/knownResults.txt", sep="\t", header=0) known.columns = [ "motif_name", "consensus", "pvalue", "log_pvalue", "qvalue", "n_target_seqs", "pct_target", "n_bg_seqs", "pct_bg" ] known["pct_target"] = known["pct_target"].str.rstrip("%").astype(float) known["pct_bg"] = known["pct_bg"].str.rstrip("%").astype(float) known["-log10_p"] = (-known["log_pvalue"].astype(float)) / 2.303 known["short_name"] = known["motif_name"].str.split("/").str[0] top_motifs = known.head(15).copy() sns.barplot( data=top_motifs, x="-log10_p", y="short_name", palette="Blues_r", ax=axes[1] ) axes[1].set_xlabel("-log10(p-value)", fontsize=10) axes[1].set_ylabel("Motif", fontsize=10) axes[1].set_title("Top 15 Enriched Known Motifs\n(HOMER/JASPAR database)", fontsize=11) axes[1].axvline(x=5, color="red", linestyle="--", linewidth=0.8, label="p=1e-5") axes[1].legend(fontsize=8) plt.tight_layout() plt.savefig("homer_summary.png", dpi=150, bbox_inches="tight") plt.close() print("Saved: homer_summary.png")
Key Parameters
| Parameter | Default | Range / Options | Effect |
|---|---|---|---|
| | 50–1000 | Window size (bp) centered on peak midpoint for motif search; 200 for TF ChIP-seq, 150 for ATAC-seq |
| off | flag | Mask repetitive elements (N-mask); strongly recommended to reduce false positives |
| | 1–64 | Number of parallel CPU threads; use 4–8 for typical datasets |
| | 1–100 | Number of de novo motifs to find; 25 is usually sufficient |
| off | flag | Skip de novo motif discovery; only perform known motif enrichment (10× faster) |
| auto-generated | BED file path | Custom background region file; auto-background is GC-matched if omitted |
| — | file path | Scan peaks for occurrences of a specific motif matrix; outputs per-peak annotation |
| | 0–4 | Maximum mismatches allowed when matching known motifs |
| | comma-separated integers | Motif lengths to try for de novo search; adding finds shorter motifs |
| required | , , , , , etc. | Reference genome assembly; must be installed with |
Common Recipes
Recipe 1: Python subprocess Wrapper for findMotifsGenome.pl
Run HOMER from a Python script and capture completion status.
import subprocess import sys from pathlib import Path def run_homer_motifs( peaks_bed: str, genome: str, output_dir: str, size: int = 200, mask: bool = True, cpus: int = 4, n_motifs: int = 25, denovo: bool = True, bg_bed: str = None ) -> int: """Run findMotifsGenome.pl and return exit code.""" Path(output_dir).mkdir(parents=True, exist_ok=True) cmd = [ "findMotifsGenome.pl", peaks_bed, genome, output_dir, "-size", str(size), "-p", str(cpus), "-S", str(n_motifs), ] if mask: cmd.append("-mask") if not denovo: cmd.append("-nomotif") if bg_bed: cmd.extend(["-bg", bg_bed]) print(f"Running: {' '.join(cmd)}") result = subprocess.run(cmd, capture_output=True, text=True) if result.returncode != 0: print(f"HOMER stderr:\n{result.stderr[-2000:]}", file=sys.stderr) return result.returncode print(f"Done. Output in {output_dir}/") return 0 # Usage exit_code = run_homer_motifs( peaks_bed="peaks/tf_chip_200bp.bed", genome="hg38", output_dir="motif_output/", size=200, mask=True, cpus=4, n_motifs=25 ) print(f"Exit code: {exit_code}") # Running: findMotifsGenome.pl peaks/tf_chip_200bp.bed hg38 motif_output/ -size 200 -p 4 -S 25 -mask # Done. Output in motif_output/ # Exit code: 0
Recipe 2: Batch Motif Analysis Across Multiple Peak Sets
Run HOMER on multiple ChIP-seq samples in a loop.
#!/bin/bash # Batch motif analysis for several ChIP-seq experiments (same genome) GENOME="hg38" PEAK_DIR="peaks" MOTIF_DIR="motif_results" mkdir -p "$MOTIF_DIR" SAMPLES=(CTCF H3K4me3 FOXA2 RUNX1) for sample in "${SAMPLES[@]}"; do echo "=== Processing $sample ===" peak_file="${PEAK_DIR}/${sample}_summits.bed" if [ ! -f "$peak_file" ]; then echo " Skipping $sample: $peak_file not found" continue fi # Center on summit ±100 bp awk 'BEGIN{OFS="\t"} {s=$2-100<0?0:$2-100; print $1,s,$2+100,$4,$5}' \ "$peak_file" > "${PEAK_DIR}/${sample}_200bp.bed" findMotifsGenome.pl "${PEAK_DIR}/${sample}_200bp.bed" "$GENOME" \ "${MOTIF_DIR}/${sample}/" \ -size 200 -mask -p 4 -nomotif \ 2> "${MOTIF_DIR}/${sample}.log" n=$(wc -l < "${MOTIF_DIR}/${sample}/knownResults.txt") echo " $sample: $n motifs tested. Top hit: $(sed -n '2p' "${MOTIF_DIR}/${sample}/knownResults.txt" | cut -f1)" done # === Processing CTCF === # CTCF: 392 motifs tested. Top hit: CTCF(Zf)/GM12878-CTCF-ChIP-Seq(GSE32465)/Homer # === Processing FOXA2 === # FOXA2: 392 motifs tested. Top hit: Foxa2(Forkhead)/Liver-Foxa2-ChIP-Seq(GSE25694)/Homer
Recipe 3: Parse De Novo Motif Matrices from homerResults/
Load de novo motif PWM matrices for downstream comparison or plotting.
import os import re import pandas as pd def parse_homer_motif(motif_file: str) -> dict: """Parse a HOMER .motif file into name, log_odds_threshold, and PWM.""" with open(motif_file) as f: header = f.readline().strip() # >motif_name\tlog_odds\tlog_p-value\t0\tnucs rows = [] for line in f: line = line.strip() if line: rows.append([float(x) for x in line.split("\t")]) parts = header.lstrip(">").split("\t") name = parts[0] log_odds = float(parts[1]) if len(parts) > 1 else 0.0 log_p = float(parts[2]) if len(parts) > 2 else 0.0 pwm = pd.DataFrame(rows, columns=["A", "C", "G", "T"]) return {"name": name, "log_odds": log_odds, "log_p": log_p, "pwm": pwm} # Load all de novo motifs motif_dir = "motif_output/homerResults/" motifs = [] for fn in sorted(os.listdir(motif_dir)): if fn.endswith(".motif"): motif = parse_homer_motif(os.path.join(motif_dir, fn)) motifs.append(motif) print(f"{fn}: {motif['name']} ({len(motif['pwm'])} positions, log_p={motif['log_p']:.1f})") print(f"\nLoaded {len(motifs)} de novo motifs") # motif1.motif: CTCF-motif (19 positions, log_p=-8234.1) # motif2.motif: CTCFL-motif (17 positions, log_p=-3421.7) # ... # Loaded 25 de novo motifs
Recipe 4: Annotate Peaks and Join with Differential Expression Results
Combine peak annotations with RNA-seq DE results to find regulated genes near peaks.
import pandas as pd # Load annotated peaks annot = pd.read_csv("annotated_peaks.txt", sep="\t", header=0, low_memory=False) annot.columns = annot.columns.str.strip() # Key columns from HOMER annotation peak_genes = annot[["PeakID (cmd=annotatePeaks.pl peaks.bed hg38)", "Chr", "Start", "End", "Annotation", "Distance to TSS", "Nearest RefSeq", "Gene Name"]].copy() peak_genes.columns = ["peak_id", "chr", "start", "end", "annotation", "tss_dist", "refseq", "gene_name"] # Filter promoter-proximal peaks (within 2 kb of TSS) promoter_peaks = peak_genes[peak_genes["tss_dist"].abs() < 2000].copy() print(f"Promoter-proximal peaks (|TSS| < 2kb): {len(promoter_peaks)}") # Promoter-proximal peaks (|TSS| < 2kb): 2218 # Load DESeq2 results (gene_name, log2FC, padj) de_results = pd.read_csv("deseq2_results.csv") de_sig = de_results[de_results["padj"] < 0.05].copy() # Merge: find DE genes with a nearby peak merged = promoter_peaks.merge(de_sig, on="gene_name", how="inner") print(f"DE genes with promoter-proximal peak: {len(merged)}") print(merged[["gene_name", "tss_dist", "annotation", "log2FC", "padj"]].head()) # DE genes with promoter-proximal peak: 347
Expected Outputs
| Output | Format | Description |
|---|---|---|
| TSV | All known motifs tested: name, p-value, q-value, % target, % background; primary result file |
| HTML | Interactive HTML report with motif logos, statistics, and links |
| Directory | De novo motifs: (PWM matrix), , |
| HTML | Interactive HTML report for de novo motifs |
| TSV | One row per peak: chr, start, end, annotation, TSS distance, nearest gene, RefSeq ID |
| TSV | Per-peak motif occurrence scores (from mode) |
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Genome not installed via | Run ; confirm exists |
| HOMER binaries not in | Add HOMER bin directory to PATH: (or activate conda env) |
| Poor de novo motif quality (GC-rich artifacts) | Unmasked repetitive elements | Always use ; also try increasing to 300 or reducing to 150 |
| No significant known motifs (all p > 0.01) | Too few peaks or wrong peak size | Ensure ≥500 peaks; check matches expected binding footprint (150–200 for TFs) |
| Conda HOMER conflicts (Perl module errors) | HOMER's Perl scripts conflict with conda environment Perl | Install HOMER in a dedicated conda env: |
gives all "Intergenic" | Genome annotation not installed | HOMER annotation requires which downloads GTF; check contains files |
| Very slow run time (>2 hr) | Single-threaded or large peak set | Add ; reduce to 10 for speed; use for known-only runs |
custom background gives poor motifs | Background regions have very different GC content | Use a GC-matched background; run with to generate random same-size regions |
References
- HOMER official documentation — comprehensive guide to all HOMER commands, parameters, and output formats
- Heinz S et al. (2010) "Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities." Molecular Cell 38(4):576–589. DOI:10.1016/j.molcel.2010.05.004
- HOMER GitHub: samtools/homer — source code and issue tracker
- JASPAR 2024 database — curated TF binding motif database used by HOMER for known motif enrichment