BioSkills bio-chipseq-peak-annotation

Annotate ChIP-seq peaks to genomic features and nearest genes. Classify peaks as promoter, exon, intron, or intergenic using ChIPseeker (R), HOMER annotatePeaks.pl (CLI), or Python (pandas/pyranges). Supports pre-built annotation databases and custom GTF files. Handles promoter definition, feature priority, category collapsing, and signed distance-to-TSS. Use when assigning genomic context to ChIP-seq peaks or linking peaks to target genes.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/chip-seq/peak-annotation" ~/.claude/skills/gptomics-bioskills-bio-chipseq-peak-annotation && rm -rf "$T"
manifest: chip-seq/peak-annotation/SKILL.md
source content

Version Compatibility

Reference examples tested with: ChIPseeker 1.38+, GenomicFeatures 1.54+, rtracklayer 1.62+, HOMER 4.11+, pyranges 0.0.129+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:

  • R:
    packageVersion('<pkg>')
    then
    ?function_name
    to verify parameters
  • Python:
    pip show <package>
    then
    help(module.function)
    to check signatures
  • CLI:
    annotatePeaks.pl
    (HOMER prints version on run)

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Peak Annotation

"Annotate my peaks to genes and genomic features" -> Assign each peak to a genomic feature category (promoter, exon, intron, intergenic), find the nearest gene, and calculate signed distance to TSS.

  • R:
    ChIPseeker::annotatePeak(peaks, TxDb=txdb)
  • CLI:
    annotatePeaks.pl peaks.bed hg38 -gtf annotation.gtf
  • Python: Parse GTF, compute intervals, classify by overlap

Choosing an Annotation Approach

ContextRecommendedWhy
Standard genome, pre-built annotations availableChIPseeker with TxDb packageSimplest setup; automatic gene symbol mapping via annoDb
Custom or project-specific GTF providedChIPseeker + makeTxDbFromGFF, HOMER -gtf, or PythonAll three handle custom annotations; choose based on pipeline
HOMER already in pipelineHOMER annotatePeaks.plReuses tag directory; combined annotation + motif workflow
Fine-grained control over classification logicPythonFull control over priority rules, distance calculation, output format
Quick annotation with standard categoriesHOMER annotatePeaks.plSingle command; no R environment required

Critical: Always use the same annotation source for peak annotation as was used for the analysis. Mixing databases (e.g., UCSC knownGene TxDb with a GENCODE GTF alignment) produces gene name mismatches and incorrect feature assignments. When a specific GTF is provided, use it directly rather than a pre-built TxDb package.

Coordinate Systems and TSS

BED files use 0-based half-open coordinates

[start, end)
. GTF files use 1-based closed coordinates
[start, end]
. Mixing these without conversion shifts annotations by one base.

Peak center (from BED):

(start + end) // 2

TSS from GTF:

  • Plus-strand genes: TSS =
    start
    (1-based) -> 0-based:
    start - 1
  • Minus-strand genes: TSS =
    end
    (1-based) -> 0-based:
    end

Signed distance: Negative = upstream of TSS, positive = downstream.

  • Plus-strand:
    distance = peak_center - tss
  • Minus-strand:
    distance = -(peak_center - tss)

Gene Assignment Conventions

Peak annotation involves two decisions: (1) which gene to assign, and (2) what genomic feature the peak overlaps. These can come from different genes, creating a decoupling problem that affects downstream interpretation.

Nearest-TSS vs Host-Gene Assignment

ApproachGene assigned fromFeature assigned fromTools
Nearest-TSSGene with closest TSSPhysical overlap at peak centerChIPseeker default (
overlap='TSS'
), HOMER
Host-gene priorityGene whose body contains the peakSame gene's featuresChIPseeker
overlap='all'

Nearest-TSS (default): HOMER and ChIPseeker both use a two-step process by default. Step 1 finds the gene with the nearest TSS. Step 2 independently determines the genomic feature at the peak center. A peak inside gene A's intron but near gene B's TSS reports: nearest_gene=B, feature=intron -- but the intron belongs to gene A, not gene B.

Host-gene priority: ChIPseeker's

overlap='all'
parameter changes this. If a peak overlaps any part of a gene body, that gene is reported as nearest, coupling gene and feature. Peaks with no gene body overlap fall back to nearest TSS.

Choosing a Convention

ContextConventionRationale
Distal TF binding (enhancers)Nearest-TSSEnhancers often regulate the nearest gene, not the gene they sit in
Histone marks in gene bodies (H3K36me3, H3K27me3)Host-geneMark typically reflects host gene's transcriptional state
Promoter-associated marks (H3K4me3, H3K27ac)EitherMost peaks are at promoters where both conventions agree
Custom annotation against a specific GTFHost-geneConsistent gene-feature coupling avoids misleading annotations
Reproducing published HOMER resultsNearest-TSSMatches HOMER's default behavior

When a task requests "nearest gene," clarify whether it means nearest by TSS distance or the gene whose feature the peak physically overlaps. For most annotation tasks where gene and feature should be consistent, use the host-gene convention.

Annotation with ChIPseeker (R)

Pre-built TxDb (Standard Genomes)

library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(org.Hs.eg.db)

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
peaks <- readPeakFile('peaks.narrowPeak')
peak_anno <- annotatePeak(peaks, TxDb = txdb, tssRegion = c(-3000, 3000), annoDb = 'org.Hs.eg.db')
anno_df <- as.data.frame(peak_anno)

Custom GTF Annotations

Goal: Annotate peaks using a project-specific GTF rather than a pre-built annotation package.

Approach: Build a TxDb from the GTF with

makeTxDbFromGFF()
, annotate peaks against it, then map gene symbols from the original GTF since custom TxDb objects lack the ID mappings that
annoDb
requires.

library(ChIPseeker)
library(GenomicFeatures)
library(rtracklayer)

txdb <- makeTxDbFromGFF('genes.gtf.gz', format = 'gtf')
peaks <- readPeakFile('peaks.bed')
peak_anno <- annotatePeak(peaks, TxDb = txdb, tssRegion = c(-2000, 2000), overlap = 'all')
anno_df <- as.data.frame(peak_anno)

# Map gene symbols from GTF (annoDb does not work with custom TxDb)
gtf <- import('genes.gtf.gz')
gene_map <- unique(data.frame(
    gene_id = sub('\\..*', '', gtf$gene_id),
    symbol = gtf$gene_name, stringsAsFactors = FALSE))
gene_map <- gene_map[!is.na(gene_map$symbol), ]
anno_df$geneId_base <- sub('\\..*', '', anno_df$geneId)
anno_df$SYMBOL <- gene_map$symbol[match(anno_df$geneId_base, gene_map$gene_id)]

The version-suffix stripping (

sub('\\..*', '', ...)
) handles GENCODE gene IDs like
ENSG00000142192.25
where the TxDb may store the full ID but the GTF attribute has it without the version.

Custom Promoter Definition

The

tssRegion
parameter defines the promoter window around each TSS:

WindowUse case
c(-1000, 1000)Strict core promoter
c(-2000, 2000)Common custom definition
c(-3000, 3000)ChIPseeker default; broader capture
c(-2000, 500)Asymmetric; emphasizes upstream regulatory elements

Match this to analysis requirements. Many studies define specific windows (e.g., 2kb symmetric) -- always check.

Annotation Priority

ChIPseeker resolves overlapping features using

genomicAnnotationPriority
:

Default:

Promoter > 5'UTR > 3'UTR > Exon > Intron > Downstream > Intergenic

A peak overlapping both a promoter of gene A and an intron of gene B receives "Promoter". To customize:

peak_anno <- annotatePeak(peaks, TxDb = txdb, tssRegion = c(-2000, 2000),
    genomicAnnotationPriority = c('Promoter', '5UTR', '3UTR', 'Exon', 'Intron',
                                   'Downstream', 'Intergenic'))

Collapse Annotation Categories

ChIPseeker returns detailed subcategories. To collapse to four standard categories:

ChIPseeker OutputCollapsed
Promoter (<=1kb), Promoter (1-2kb), Promoter (2-3kb)promoter
5' UTR, 3' UTR, 1st Exon, Other Exonexon
1st Intron, Other Intronintron
Downstream (<=300), Downstream (<=1kb), Distal Intergenicintergenic
collapse_annotation <- function(ann) {
    ifelse(grepl('Promoter', ann), 'promoter',
    ifelse(grepl("5' UTR|3' UTR|Exon", ann), 'exon',
    ifelse(grepl('Intron', ann), 'intron', 'intergenic')))
}
anno_df$feature <- collapse_annotation(anno_df$annotation)

To suppress subcategories at the source:

options(ChIPseeker.ignore_1st_exon = TRUE)
options(ChIPseeker.ignore_1st_intron = TRUE)
options(ChIPseeker.ignore_promoter_subcategory = TRUE)

Export Results

output <- data.frame(chr = anno_df$seqnames, start = anno_df$start, end = anno_df$end,
    nearest_gene = anno_df$SYMBOL, distance_to_tss = anno_df$distanceToTSS,
    feature = anno_df$feature)
write.table(output, 'annotations.tsv', sep = '\t', row.names = FALSE, quote = FALSE)

Annotation with HOMER (CLI)

Standard Genome

annotatePeaks.pl peaks.bed hg38 > annotated.txt

Custom GTF

# With installed genome
annotatePeaks.pl peaks.bed hg38 -gtf genes.gtf > annotated.txt

# Without installed genome (annotation from GTF only)
annotatePeaks.pl peaks.bed none -gtf genes.gtf > annotated.txt

For gzipped GTFs, decompress first:

gunzip -k genes.gtf.gz

Annotation Statistics

annotatePeaks.pl peaks.bed hg38 -gtf genes.gtf -annStats ann_stats.txt > annotated.txt

Parse HOMER Output

HOMER uses a two-part annotation process: (1) find the nearest TSS to assign a gene, and (2) independently classify the genomic feature at the peak center. The gene and annotation can come from different genes -- a peak in gene A's intron near gene B's TSS reports gene B with an intron annotation. This matches ChIPseeker's default

overlap='TSS'
behavior.

HOMER produces 19 tab-delimited columns. Key columns for annotation:

ColumnNameContent
8AnnotationFeature category (promoter-TSS, exon, intron, Intergenic)
10Distance to TSSSigned distance (negative = upstream)
16Gene NameGene symbol
awk -F'\t' 'NR>1 {print $2, $3, $4, $16, $10, $8}' OFS='\t' annotated.txt > summary.tsv

HOMER Category Mapping

HOMER CategoryCollapsed
promoter-TSSpromoter
5' UTR, 3' UTR, exon, non-codingexon
intronintron
Intergenic, TTSintergenic

Limitation: HOMER defines promoter as -1kb to +100bp from TSS; this window is not configurable via flags. For custom promoter windows, reclassify using the Distance to TSS column:

awk -F'\t' 'NR>1 {
    dist = ($10 < 0) ? -$10 : $10
    feat = (dist <= 2000) ? "promoter" : $8
    print $2, $3, $4, $16, $10, feat
}' OFS='\t' annotated.txt > reclassified.tsv

Annotation with Python

Parse GTF and Extract Gene Models

Goal: Build gene, exon, and TSS tables from a GTF file for peak annotation.

Approach: Parse GTF line by line, extract gene and exon features, compute TSS positions from strand and coordinates, converting from 1-based GTF to 0-based BED coordinates.

import gzip, pandas as pd

def parse_gtf(gtf_path):
    records = []
    opener = gzip.open if gtf_path.endswith('.gz') else open
    with opener(gtf_path, 'rt') as f:
        for line in f:
            if line.startswith('#'):
                continue
            fields = line.strip().split('\t')
            attrs = {}
            for item in fields[8].strip().rstrip(';').split(';'):
                item = item.strip()
                if ' ' in item:
                    key, val = item.split(' ', 1)
                    attrs[key] = val.strip('"')
            records.append({'chrom': fields[0], 'feature': fields[2],
                            'start': int(fields[3]) - 1, 'end': int(fields[4]),
                            'strand': fields[6], **attrs})
    return pd.DataFrame(records)

gtf = parse_gtf('genes.gtf.gz')
genes = gtf[gtf['feature'] == 'gene'].copy()
genes['tss'] = genes.apply(lambda r: r['start'] if r['strand'] == '+' else r['end'], axis=1)
exons = gtf[gtf['feature'] == 'exon']

Annotate Peaks with Host-Gene Convention

Goal: For each peak, classify its genomic feature and assign it to the appropriate gene with consistent gene-feature coupling.

Approach: Check features in priority order (promoter > exon > intron > intergenic). For promoter, find the nearest TSS within the window. For exon or intron, assign the host gene whose body contains the peak. For intergenic, fall back to nearest TSS. Compute strand-aware signed distance relative to the assigned gene's TSS.

peaks = pd.read_csv('peaks.bed', sep='\t', header=None,
                     names=['chr', 'start', 'end', 'peak_id', 'score'])
peaks['center'] = (peaks['start'] + peaks['end']) // 2
promoter_window = 2000  # bp from TSS; match to analysis requirements

results = []
for _, peak in peaks.iterrows():
    chrom_genes = genes[genes['chrom'] == peak['chr']]
    chrom_exons = exons[exons['chrom'] == peak['chr']]
    abs_dists = (chrom_genes['tss'] - peak['center']).abs()
    nearest_tss_gene = chrom_genes.loc[abs_dists.idxmin()]

    # Feature classification with host-gene coupling: promoter > exon > intron > intergenic
    if abs_dists.min() <= promoter_window:
        feature, assigned = 'promoter', nearest_tss_gene
    else:
        exon_hits = chrom_exons[(chrom_exons['start'] <= peak['center']) & (peak['center'] < chrom_exons['end'])]
        gene_hits = chrom_genes[(chrom_genes['start'] <= peak['center']) & (peak['center'] < chrom_genes['end'])]
        if len(exon_hits) > 0:
            host_gene_name = exon_hits.iloc[0].get('gene_name', '')
            host = chrom_genes[chrom_genes['gene_name'] == host_gene_name]
            feature, assigned = 'exon', host.iloc[0] if len(host) > 0 else nearest_tss_gene
        elif len(gene_hits) > 0:
            closest_host = gene_hits.loc[(gene_hits['tss'] - peak['center']).abs().idxmin()]
            feature, assigned = 'intron', closest_host
        else:
            feature, assigned = 'intergenic', nearest_tss_gene

    raw_dist = peak['center'] - assigned['tss']
    signed_dist = -raw_dist if assigned['strand'] == '-' else raw_dist

    results.append({'peak_id': peak['peak_id'], 'chr': peak['chr'], 'start': peak['start'],
                    'end': peak['end'], 'nearest_gene': assigned['gene_name'],
                    'distance_to_tss': int(signed_dist), 'feature': feature})

result_df = pd.DataFrame(results)
result_df.to_csv('annotations.tsv', sep='\t', index=False)

When multiple genes overlap the peak center (common on opposite strands), the host gene with the closest TSS is selected as a tiebreaker.

Alternative: pyranges

For projects with pyranges installed, GTF parsing is simpler:

import pyranges as pr

gtf = pr.read_gtf('genes.gtf.gz')  # auto-converts to 0-based half-open
genes = gtf[gtf.Feature == 'gene']
peaks = pr.read_bed('peaks.bed')
nearest = peaks.nearest(genes)  # adds Distance column

Feature classification and signed distance still require manual logic on the result DataFrame.

Visualization

plotAnnoPie(peak_anno)
plotAnnoBar(peak_anno)
plotDistToTSS(peak_anno, title = 'Distribution of peaks relative to TSS')

Compare Multiple Peak Sets

peak_files <- list(H3K4me3 = 'h3k4me3_peaks.bed', H3K27ac = 'h3k27ac_peaks.bed')
anno_list <- lapply(peak_files, function(f) annotatePeak(readPeakFile(f), TxDb = txdb))
plotAnnoBar(anno_list)
plotDistToTSS(anno_list)

Key Parameters

Parameter (ChIPseeker)DefaultDescription
tssRegionc(-3000, 3000)Promoter window around TSS
level"transcript""transcript" or "gene"; gene-level merges all isoforms
genomicAnnotationPriorityPromoter > ... > IntergenicFeature priority for overlapping annotations
overlap"TSS""TSS": gene = nearest TSS (gene and feature can be from different genes). "all": gene = host gene if peak overlaps any gene body (coupled annotation). Use "all" when gene-feature consistency matters
sameStrandFALSERestrict to same-strand genes only
addFlankGeneInfoFALSEInclude neighboring gene information

Related Skills

  • peak-calling - Generate peak files with MACS3 or HOMER
  • motif-analysis - De novo and known motif enrichment in peak regions
  • differential-binding - Compare peaks between conditions
  • chipseq-visualization - Signal tracks, heatmaps, profile plots
  • genome-intervals/gtf-gff-handling - Parse and convert GTF/GFF annotation files
  • genome-intervals/proximity-operations - bedtools closest and window operations
  • pathway-analysis/go-enrichment - Functional enrichment of peak-associated genes