BioSkills bio-chipseq-peak-annotation

Annotate ChIP-seq peaks to genomic features and nearest genes. Classify peaks as promoter, exon, intron, or intergenic using ChIPseeker (R), HOMER annotatePeaks.pl (CLI), or Python (pandas/pyranges). Supports pre-built annotation databases and custom GTF files. Handles promoter definition, feature priority, category collapsing, and signed distance-to-TSS. Use when assigning genomic context to ChIP-seq peaks or linking peaks to target genes.

install

source · Clone the upstream repo

git clone https://github.com/GPTomics/bioSkills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/chip-seq/peak-annotation" ~/.claude/skills/gptomics-bioskills-bio-chipseq-peak-annotation && rm -rf "$T"

manifest: chip-seq/peak-annotation/SKILL.md

source content

Version Compatibility

Reference examples tested with: ChIPseeker 1.38+, GenomicFeatures 1.54+, rtracklayer 1.62+, HOMER 4.11+, pyranges 0.0.129+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:

R:
```
packageVersion('<pkg>')
```
then
```
?function_name
```
to verify parameters
Python:
```
pip show <package>
```
then
```
help(module.function)
```
to check signatures
CLI:
```
annotatePeaks.pl
```
(HOMER prints version on run)

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Peak Annotation

"Annotate my peaks to genes and genomic features" -> Assign each peak to a genomic feature category (promoter, exon, intron, intergenic), find the nearest gene, and calculate signed distance to TSS.

ChIPseeker::annotatePeak(peaks, TxDb=txdb)

CLI:

annotatePeaks.pl peaks.bed hg38 -gtf annotation.gtf

Python: Parse GTF, compute intervals, classify by overlap

Choosing an Annotation Approach

Context	Recommended	Why
Standard genome, pre-built annotations available	ChIPseeker with TxDb package	Simplest setup; automatic gene symbol mapping via annoDb
Custom or project-specific GTF provided	ChIPseeker + makeTxDbFromGFF, HOMER -gtf, or Python	All three handle custom annotations; choose based on pipeline
HOMER already in pipeline	HOMER annotatePeaks.pl	Reuses tag directory; combined annotation + motif workflow
Fine-grained control over classification logic	Python	Full control over priority rules, distance calculation, output format
Quick annotation with standard categories	HOMER annotatePeaks.pl	Single command; no R environment required

Critical: Always use the same annotation source for peak annotation as was used for the analysis. Mixing databases (e.g., UCSC knownGene TxDb with a GENCODE GTF alignment) produces gene name mismatches and incorrect feature assignments. When a specific GTF is provided, use it directly rather than a pre-built TxDb package.

Coordinate Systems and TSS

BED files use 0-based half-open coordinates

[start, end)

. GTF files use 1-based closed coordinates

[start, end]

. Mixing these without conversion shifts annotations by one base.

Peak center (from BED):

(start + end) // 2

TSS from GTF:

Plus-strand genes: TSS =
```
start
```
(1-based) -> 0-based:
```
start - 1
```
Minus-strand genes: TSS =
```
end
```
(1-based) -> 0-based:
```
end
```

Signed distance: Negative = upstream of TSS, positive = downstream.

Plus-strand:
```
distance = peak_center - tss
```
Minus-strand:
```
distance = -(peak_center - tss)
```

Gene Assignment Conventions

Peak annotation involves two decisions: (1) which gene to assign, and (2) what genomic feature the peak overlaps. These can come from different genes, creating a decoupling problem that affects downstream interpretation.

Nearest-TSS vs Host-Gene Assignment

Approach	Gene assigned from	Feature assigned from	Tools
Nearest-TSS	Gene with closest TSS	Physical overlap at peak center	ChIPseeker default ( `overlap='TSS'` ), HOMER
Host-gene priority	Gene whose body contains the peak	Same gene's features	ChIPseeker `overlap='all'`

Nearest-TSS (default): HOMER and ChIPseeker both use a two-step process by default. Step 1 finds the gene with the nearest TSS. Step 2 independently determines the genomic feature at the peak center. A peak inside gene A's intron but near gene B's TSS reports: nearest_gene=B, feature=intron -- but the intron belongs to gene A, not gene B.

Host-gene priority: ChIPseeker's

overlap='all'

parameter changes this. If a peak overlaps any part of a gene body, that gene is reported as nearest, coupling gene and feature. Peaks with no gene body overlap fall back to nearest TSS.

Choosing a Convention

Context	Convention	Rationale
Distal TF binding (enhancers)	Nearest-TSS	Enhancers often regulate the nearest gene, not the gene they sit in
Histone marks in gene bodies (H3K36me3, H3K27me3)	Host-gene	Mark typically reflects host gene's transcriptional state
Promoter-associated marks (H3K4me3, H3K27ac)	Either	Most peaks are at promoters where both conventions agree
Custom annotation against a specific GTF	Host-gene	Consistent gene-feature coupling avoids misleading annotations
Reproducing published HOMER results	Nearest-TSS	Matches HOMER's default behavior

When a task requests "nearest gene," clarify whether it means nearest by TSS distance or the gene whose feature the peak physically overlaps. For most annotation tasks where gene and feature should be consistent, use the host-gene convention.

Annotation with ChIPseeker (R)

Pre-built TxDb (Standard Genomes)

library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(org.Hs.eg.db)

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
peaks <- readPeakFile('peaks.narrowPeak')
peak_anno <- annotatePeak(peaks, TxDb = txdb, tssRegion = c(-3000, 3000), annoDb = 'org.Hs.eg.db')
anno_df <- as.data.frame(peak_anno)

Custom GTF Annotations

Goal: Annotate peaks using a project-specific GTF rather than a pre-built annotation package.

Approach: Build a TxDb from the GTF with

makeTxDbFromGFF()

, annotate peaks against it, then map gene symbols from the original GTF since custom TxDb objects lack the ID mappings that

annoDb

requires.

library(ChIPseeker)
library(GenomicFeatures)
library(rtracklayer)

txdb <- makeTxDbFromGFF('genes.gtf.gz', format = 'gtf')
peaks <- readPeakFile('peaks.bed')
peak_anno <- annotatePeak(peaks, TxDb = txdb, tssRegion = c(-2000, 2000), overlap = 'all')
anno_df <- as.data.frame(peak_anno)

# Map gene symbols from GTF (annoDb does not work with custom TxDb)
gtf <- import('genes.gtf.gz')
gene_map <- unique(data.frame(
    gene_id = sub('\\..*', '', gtf$gene_id),
    symbol = gtf$gene_name, stringsAsFactors = FALSE))
gene_map <- gene_map[!is.na(gene_map$symbol), ]
anno_df$geneId_base <- sub('\\..*', '', anno_df$geneId)
anno_df$SYMBOL <- gene_map$symbol[match(anno_df$geneId_base, gene_map$gene_id)]

The version-suffix stripping (

sub('\\..*', '', ...)

) handles GENCODE gene IDs like

ENSG00000142192.25

where the TxDb may store the full ID but the GTF attribute has it without the version.

Custom Promoter Definition

The

tssRegion

parameter defines the promoter window around each TSS:

Window	Use case
c(-1000, 1000)	Strict core promoter
c(-2000, 2000)	Common custom definition
c(-3000, 3000)	ChIPseeker default; broader capture
c(-2000, 500)	Asymmetric; emphasizes upstream regulatory elements

Match this to analysis requirements. Many studies define specific windows (e.g., 2kb symmetric) -- always check.

Annotation Priority

ChIPseeker resolves overlapping features using

genomicAnnotationPriority

Default:

Promoter > 5'UTR > 3'UTR > Exon > Intron > Downstream > Intergenic

A peak overlapping both a promoter of gene A and an intron of gene B receives "Promoter". To customize:

peak_anno <- annotatePeak(peaks, TxDb = txdb, tssRegion = c(-2000, 2000),
    genomicAnnotationPriority = c('Promoter', '5UTR', '3UTR', 'Exon', 'Intron',
                                   'Downstream', 'Intergenic'))

Collapse Annotation Categories

ChIPseeker returns detailed subcategories. To collapse to four standard categories:

ChIPseeker Output	Collapsed
Promoter (<=1kb), Promoter (1-2kb), Promoter (2-3kb)	promoter
5' UTR, 3' UTR, 1st Exon, Other Exon	exon
1st Intron, Other Intron	intron
Downstream (<=300), Downstream (<=1kb), Distal Intergenic	intergenic

collapse_annotation <- function(ann) {
    ifelse(grepl('Promoter', ann), 'promoter',
    ifelse(grepl("5' UTR|3' UTR|Exon", ann), 'exon',
    ifelse(grepl('Intron', ann), 'intron', 'intergenic')))
}
anno_df$feature <- collapse_annotation(anno_df$annotation)

To suppress subcategories at the source:

options(ChIPseeker.ignore_1st_exon = TRUE)
options(ChIPseeker.ignore_1st_intron = TRUE)
options(ChIPseeker.ignore_promoter_subcategory = TRUE)

Export Results

output <- data.frame(chr = anno_df$seqnames, start = anno_df$start, end = anno_df$end,
    nearest_gene = anno_df$SYMBOL, distance_to_tss = anno_df$distanceToTSS,
    feature = anno_df$feature)
write.table(output, 'annotations.tsv', sep = '\t', row.names = FALSE, quote = FALSE)

Annotation with HOMER (CLI)

Standard Genome

annotatePeaks.pl peaks.bed hg38 > annotated.txt

Custom GTF

# With installed genome
annotatePeaks.pl peaks.bed hg38 -gtf genes.gtf > annotated.txt

# Without installed genome (annotation from GTF only)
annotatePeaks.pl peaks.bed none -gtf genes.gtf > annotated.txt

For gzipped GTFs, decompress first:

gunzip -k genes.gtf.gz

Annotation Statistics

annotatePeaks.pl peaks.bed hg38 -gtf genes.gtf -annStats ann_stats.txt > annotated.txt

Parse HOMER Output

HOMER uses a two-part annotation process: (1) find the nearest TSS to assign a gene, and (2) independently classify the genomic feature at the peak center. The gene and annotation can come from different genes -- a peak in gene A's intron near gene B's TSS reports gene B with an intron annotation. This matches ChIPseeker's default

overlap='TSS'

behavior.

HOMER produces 19 tab-delimited columns. Key columns for annotation:

Column	Name	Content
8	Annotation	Feature category (promoter-TSS, exon, intron, Intergenic)
10	Distance to TSS	Signed distance (negative = upstream)
16	Gene Name	Gene symbol

awk -F'\t' 'NR>1 {print $2, $3, $4, $16, $10, $8}' OFS='\t' annotated.txt > summary.tsv

HOMER Category Mapping

HOMER Category	Collapsed
promoter-TSS	promoter
5' UTR, 3' UTR, exon, non-coding	exon
intron	intron
Intergenic, TTS	intergenic

Limitation: HOMER defines promoter as -1kb to +100bp from TSS; this window is not configurable via flags. For custom promoter windows, reclassify using the Distance to TSS column:

awk -F'\t' 'NR>1 {
    dist = ($10 < 0) ? -$10 : $10
    feat = (dist <= 2000) ? "promoter" : $8
    print $2, $3, $4, $16, $10, feat
}' OFS='\t' annotated.txt > reclassified.tsv

Annotation with Python

Parse GTF and Extract Gene Models

Goal: Build gene, exon, and TSS tables from a GTF file for peak annotation.

Approach: Parse GTF line by line, extract gene and exon features, compute TSS positions from strand and coordinates, converting from 1-based GTF to 0-based BED coordinates.

import gzip, pandas as pd

def parse_gtf(gtf_path):
    records = []
    opener = gzip.open if gtf_path.endswith('.gz') else open
    with opener(gtf_path, 'rt') as f:
        for line in f:
            if line.startswith('#'):
                continue
            fields = line.strip().split('\t')
            attrs = {}
            for item in fields[8].strip().rstrip(';').split(';'):
                item = item.strip()
                if ' ' in item:
                    key, val = item.split(' ', 1)
                    attrs[key] = val.strip('"')
            records.append({'chrom': fields[0], 'feature': fields[2],
                            'start': int(fields[3]) - 1, 'end': int(fields[4]),
                            'strand': fields[6], **attrs})
    return pd.DataFrame(records)

gtf = parse_gtf('genes.gtf.gz')
genes = gtf[gtf['feature'] == 'gene'].copy()
genes['tss'] = genes.apply(lambda r: r['start'] if r['strand'] == '+' else r['end'], axis=1)
exons = gtf[gtf['feature'] == 'exon']

Annotate Peaks with Host-Gene Convention

Goal: For each peak, classify its genomic feature and assign it to the appropriate gene with consistent gene-feature coupling.

Approach: Check features in priority order (promoter > exon > intron > intergenic). For promoter, find the nearest TSS within the window. For exon or intron, assign the host gene whose body contains the peak. For intergenic, fall back to nearest TSS. Compute strand-aware signed distance relative to the assigned gene's TSS.

peaks = pd.read_csv('peaks.bed', sep='\t', header=None,
                     names=['chr', 'start', 'end', 'peak_id', 'score'])
peaks['center'] = (peaks['start'] + peaks['end']) // 2
promoter_window = 2000  # bp from TSS; match to analysis requirements

results = []
for _, peak in peaks.iterrows():
    chrom_genes = genes[genes['chrom'] == peak['chr']]
    chrom_exons = exons[exons['chrom'] == peak['chr']]
    abs_dists = (chrom_genes['tss'] - peak['center']).abs()
    nearest_tss_gene = chrom_genes.loc[abs_dists.idxmin()]

    # Feature classification with host-gene coupling: promoter > exon > intron > intergenic
    if abs_dists.min() <= promoter_window:
        feature, assigned = 'promoter', nearest_tss_gene
    else:
        exon_hits = chrom_exons[(chrom_exons['start'] <= peak['center']) & (peak['center'] < chrom_exons['end'])]
        gene_hits = chrom_genes[(chrom_genes['start'] <= peak['center']) & (peak['center'] < chrom_genes['end'])]
        if len(exon_hits) > 0:
            host_gene_name = exon_hits.iloc[0].get('gene_name', '')
            host = chrom_genes[chrom_genes['gene_name'] == host_gene_name]
            feature, assigned = 'exon', host.iloc[0] if len(host) > 0 else nearest_tss_gene
        elif len(gene_hits) > 0:
            closest_host = gene_hits.loc[(gene_hits['tss'] - peak['center']).abs().idxmin()]
            feature, assigned = 'intron', closest_host
        else:
            feature, assigned = 'intergenic', nearest_tss_gene

    raw_dist = peak['center'] - assigned['tss']
    signed_dist = -raw_dist if assigned['strand'] == '-' else raw_dist

    results.append({'peak_id': peak['peak_id'], 'chr': peak['chr'], 'start': peak['start'],
                    'end': peak['end'], 'nearest_gene': assigned['gene_name'],
                    'distance_to_tss': int(signed_dist), 'feature': feature})

result_df = pd.DataFrame(results)
result_df.to_csv('annotations.tsv', sep='\t', index=False)

When multiple genes overlap the peak center (common on opposite strands), the host gene with the closest TSS is selected as a tiebreaker.

Alternative: pyranges

For projects with pyranges installed, GTF parsing is simpler:

import pyranges as pr

gtf = pr.read_gtf('genes.gtf.gz')  # auto-converts to 0-based half-open
genes = gtf[gtf.Feature == 'gene']
peaks = pr.read_bed('peaks.bed')
nearest = peaks.nearest(genes)  # adds Distance column

Feature classification and signed distance still require manual logic on the result DataFrame.

Visualization

plotAnnoPie(peak_anno)
plotAnnoBar(peak_anno)
plotDistToTSS(peak_anno, title = 'Distribution of peaks relative to TSS')

Compare Multiple Peak Sets

peak_files <- list(H3K4me3 = 'h3k4me3_peaks.bed', H3K27ac = 'h3k27ac_peaks.bed')
anno_list <- lapply(peak_files, function(f) annotatePeak(readPeakFile(f), TxDb = txdb))
plotAnnoBar(anno_list)
plotDistToTSS(anno_list)

Key Parameters

Parameter (ChIPseeker)	Default	Description
tssRegion	c(-3000, 3000)	Promoter window around TSS
level	"transcript"	"transcript" or "gene"; gene-level merges all isoforms
genomicAnnotationPriority	Promoter > ... > Intergenic	Feature priority for overlapping annotations
overlap	"TSS"	"TSS": gene = nearest TSS (gene and feature can be from different genes). "all": gene = host gene if peak overlaps any gene body (coupled annotation). Use "all" when gene-feature consistency matters
sameStrand	FALSE	Restrict to same-strand genes only
addFlankGeneInfo	FALSE	Include neighboring gene information

Related Skills

peak-calling - Generate peak files with MACS3 or HOMER
motif-analysis - De novo and known motif enrichment in peak regions
differential-binding - Compare peaks between conditions
chipseq-visualization - Signal tracks, heatmaps, profile plots
genome-intervals/gtf-gff-handling - Parse and convert GTF/GFF annotation files
genome-intervals/proximity-operations - bedtools closest and window operations
pathway-analysis/go-enrichment - Functional enrichment of peak-associated genes