LLMs-Universal-Life-Science-and-Clinical-Skills- motif-analysis

<!--

install

source · Clone the upstream repo

git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Epigenomics/chip-seq/motif-analysis" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-motif-analysis && rm -rf "$T"

manifest: Skills/Epigenomics/chip-seq/motif-analysis/SKILL.md

source content

name: bio-chipseq-motif-analysis description: De novo motif discovery and known motif enrichment analysis using HOMER and MEME-ChIP. Identify transcription factor binding motifs in ChIP-seq, ATAC-seq, or other genomic peak data. Use when finding enriched DNA motifs in peak sequences. tool_type: cli primary_tool: HOMER measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

read_file
run_shell_command

Motif Analysis

Identify DNA sequence motifs enriched in ChIP-seq or ATAC-seq peaks to discover transcription factor binding sites.

Tool Comparison

Tool	Strengths	Use Case
HOMER	Fast, comprehensive, built-in databases	General motif analysis
MEME-ChIP	Multiple algorithms, web interface	Publication-quality
MEME	De novo discovery only	Simple discovery
FIMO	Known motif scanning	Genome-wide scanning

HOMER

Installation

conda install -c bioconda homer

# Configure genome (required once)
perl /path/to/homer/configureHomer.pl -install hg38
perl /path/to/homer/configureHomer.pl -install mm10

De Novo Motif Discovery

# Basic motif finding
findMotifsGenome.pl peaks.bed hg38 output_dir/ -size 200

# With background regions
findMotifsGenome.pl peaks.bed hg38 output_dir/ -size 200 -bg background.bed

# Specify motif lengths to search
findMotifsGenome.pl peaks.bed hg38 output_dir/ -size 200 -len 8,10,12

Key Options

Option	Description
`-size <#>`	Fragment size for analysis (default 200)
`-size given`	Use actual peak sizes
`-bg <file>`	Background regions (BED)
`-len <#,#,...>`	Motif lengths to search
`-mask`	Mask repeats
`-p <#>`	Number of CPUs
`-S <#>`	Number of motifs to find (default 25)
`-mis <#>`	Mismatches allowed (default 2)
`-noweight`	Don't adjust for GC content

Output Files

output_dir/
├── homerResults.html      # Main results page
├── knownResults.html      # Known motif enrichment
├── homerMotifs.all.motifs # All discovered motifs
├── knownResults.txt       # Known motif statistics
└── motif1.motif           # Individual motif files

Known Motif Enrichment Only

# Skip de novo, only check known motifs
findMotifsGenome.pl peaks.bed hg38 output_dir/ -size 200 -nomotif

Scan for Specific Motifs

# Find instances of motif in peaks
annotatePeaks.pl peaks.bed hg38 -m motif.motif > annotated.txt

# Scan genome for motif occurrences
scanMotifGenomeWide.pl motif.motif hg38 > motif_sites.bed

Motif Comparison

# Compare discovered motifs to known database
compareMotifs.pl motifs.motif output_dir/ -known

Create Custom Motif

# From consensus sequence
seq2profile.pl CACGTG 4 > MYC.motif

# From aligned sequences
cat aligned_seqs.txt | alignAndConvert.pl - > custom.motif

MEME Suite

Installation

conda install -c bioconda meme

Extract Sequences from Peaks

# Get FASTA sequences under peaks
bedtools getfasta -fi genome.fa -bed peaks.bed -fo peaks.fa

# Center peaks and resize
bedtools slop -i peaks.bed -g genome.sizes -b 100 | \
    bedtools getfasta -fi genome.fa -bed - -fo peaks_centered.fa

MEME (De Novo Discovery)

# Basic de novo discovery
meme peaks.fa -dna -oc meme_output -mod zoops -nmotifs 10 -minw 6 -maxw 20

# With Markov background
fasta-get-markov peaks.fa > background.model
meme peaks.fa -dna -oc meme_output -bfile background.model -mod zoops -nmotifs 10

MEME Options

Option	Description
`-mod zoops`	Zero or one per sequence (default for ChIP)
`-mod oops`	Exactly one per sequence
`-mod anr`	Any number of repeats
`-nmotifs <#>`	Number of motifs to find
`-minw <#>`	Minimum motif width
`-maxw <#>`	Maximum motif width
`-revcomp`	Search both strands
`-bfile <file>`	Background model file

MEME-ChIP (Comprehensive Pipeline)

# All-in-one ChIP-seq motif analysis
meme-chip -oc meme_chip_output -db motif_database.meme peaks.fa

MEME-ChIP runs:

MEME - De novo discovery (central enrichment)
DREME - Short motif discovery
CentriMo - Central enrichment analysis
TOMTOM - Compare to known motifs
FIMO - Find motif instances

DREME (Short Motifs)

# Find short enriched motifs
dreme -oc dreme_output -p peaks.fa -n background.fa

CentriMo (Central Enrichment)

# Test for central enrichment of known motifs
centrimo -oc centrimo_output peaks.fa motif_database.meme

TOMTOM (Motif Comparison)

# Compare discovered motifs to database
tomtom -oc tomtom_output discovered.meme database.meme

FIMO (Motif Scanning)

# Scan sequences for motif matches
fimo --oc fimo_output motif.meme sequences.fa

# Scan genome
fimo --oc fimo_output --max-stored-scores 1000000 motif.meme genome.fa

Motif Databases

HOMER Built-in

# List available motif sets
ls /path/to/homer/data/knownTFs/

# Vertebrate, known motifs (default)
findMotifsGenome.pl peaks.bed hg38 output/ -mknown vertebrates/known.motifs

JASPAR

# Download JASPAR motifs
wget https://jaspar.genereg.net/download/data/2024/CORE/JASPAR2024_CORE_vertebrates_non-redundant_pfms_meme.txt

# Use with MEME suite
meme-chip -db JASPAR2024_CORE_vertebrates_non-redundant_pfms_meme.txt peaks.fa

HOCOMOCO

# Download HOCOMOCO
wget https://hocomoco11.autosome.org/final_bundle/hocomoco11/core/HUMAN/mono/HOCOMOCOv11_core_HUMAN_mono_meme_format.meme

# Use with MEME suite
tomtom discovered.meme HOCOMOCOv11_core_HUMAN_mono_meme_format.meme

Python: Parse HOMER Results

import pandas as pd

def parse_homer_known(results_file):
    '''Parse HOMER knownResults.txt.'''
    df = pd.read_csv(results_file, sep='\t')
    df.columns = ['Motif', 'Consensus', 'P-value', 'Log P-value',
                  'q-value', 'Targets', 'Target%', 'Background', 'Background%']
    df['P-value'] = df['P-value'].astype(float)
    return df.sort_values('P-value')

known = parse_homer_known('output_dir/knownResults.txt')
print(known[['Motif', 'P-value', 'Target%']].head(20))

Python: Parse MEME Results

from Bio import motifs

def parse_meme_file(meme_file):
    '''Parse MEME output file.'''
    with open(meme_file) as f:
        record = motifs.parse(f, 'meme')
    return record

record = parse_meme_file('meme_output/meme.txt')
for m in record:
    print(f'{m.name}: {m.consensus}')
    print(m.counts)

Complete Workflows

ChIP-seq Motif Analysis

#!/bin/bash
set -euo pipefail

PEAKS=$1  # narrowPeak or BED file
GENOME=$2  # hg38, mm10, etc.
OUTDIR=$3

mkdir -p $OUTDIR

# HOMER analysis
echo "Running HOMER..."
findMotifsGenome.pl $PEAKS $GENOME ${OUTDIR}/homer \
    -size 200 -p 8 -mask

# Extract sequences for MEME
echo "Extracting sequences..."
bedtools slop -i $PEAKS -g ${GENOME}.chrom.sizes -b 0 | \
    awk 'BEGIN{OFS="\t"} {center=int(($2+$3)/2); print $1,center-100,center+100}' | \
    bedtools getfasta -fi ${GENOME}.fa -bed - -fo ${OUTDIR}/peaks.fa

# MEME-ChIP analysis
echo "Running MEME-ChIP..."
meme-chip -oc ${OUTDIR}/meme_chip \
    -db /path/to/JASPAR.meme \
    ${OUTDIR}/peaks.fa

echo "Done. Results in ${OUTDIR}/"

ATAC-seq Footprint Motifs

# Analyze motifs in footprint regions
findMotifsGenome.pl footprints.bed hg38 footprint_motifs/ \
    -size given -mask -p 8

# Compare to accessible regions background
findMotifsGenome.pl footprints.bed hg38 footprint_motifs/ \
    -size given -bg accessible_peaks.bed -mask -p 8

Visualization

HOMER Logo

# Generate sequence logo
motif2Logo.pl motif.motif > logo.eps

Plot with Python

import logomaker
import pandas as pd
import matplotlib.pyplot as plt

def plot_motif(pwm_file):
    '''Plot sequence logo from HOMER PWM.'''
    pwm = pd.read_csv(pwm_file, sep='\t', skiprows=1, header=None)
    pwm.columns = ['A', 'C', 'G', 'T']
    logo = logomaker.Logo(pwm, shade_below=0.5, fade_below=0.5)
    plt.show()

Quality Metrics

Metric	Good	Concerning
P-value	< 1e-10	> 1e-5
Target %	> 20%	< 5%
Background %	< Target/2	Similar to Target
Bit score	> 10	< 5

Common Issues

No Significant Motifs

Check peak quality (too few peaks?)
Try different peak sizes (
```
-size
```
)
Ensure genome build matches
Check for repeat masking issues

Too Many Motifs

Increase significance threshold
Use
```
-S
```
to limit number of motifs
Filter by target percentage

Wrong Background

Use matched GC content background
Consider using input/control peaks
Try shuffled sequences

Related Skills

peak-calling - Generate input peaks
peak-annotation - Annotate peaks with genes
atac-seq/footprinting - TF footprint analysis
genome-intervals - BED file operations