BioSkills bio-chipseq-peak-annotation
Annotate ChIP-seq peaks to genomic features and nearest genes. Classify peaks as promoter, exon, intron, or intergenic using ChIPseeker (R), HOMER annotatePeaks.pl (CLI), or Python (pandas/pyranges). Supports pre-built annotation databases and custom GTF files. Handles promoter definition, feature priority, category collapsing, and signed distance-to-TSS. Use when assigning genomic context to ChIP-seq peaks or linking peaks to target genes.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/chip-seq/peak-annotation" ~/.claude/skills/gptomics-bioskills-bio-chipseq-peak-annotation && rm -rf "$T"
chip-seq/peak-annotation/SKILL.mdVersion Compatibility
Reference examples tested with: ChIPseeker 1.38+, GenomicFeatures 1.54+, rtracklayer 1.62+, HOMER 4.11+, pyranges 0.0.129+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- R:
thenpackageVersion('<pkg>')
to verify parameters?function_name - Python:
thenpip show <package>
to check signatureshelp(module.function) - CLI:
(HOMER prints version on run)annotatePeaks.pl
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Peak Annotation
"Annotate my peaks to genes and genomic features" -> Assign each peak to a genomic feature category (promoter, exon, intron, intergenic), find the nearest gene, and calculate signed distance to TSS.
- R:
ChIPseeker::annotatePeak(peaks, TxDb=txdb) - CLI:
annotatePeaks.pl peaks.bed hg38 -gtf annotation.gtf - Python: Parse GTF, compute intervals, classify by overlap
Choosing an Annotation Approach
| Context | Recommended | Why |
|---|---|---|
| Standard genome, pre-built annotations available | ChIPseeker with TxDb package | Simplest setup; automatic gene symbol mapping via annoDb |
| Custom or project-specific GTF provided | ChIPseeker + makeTxDbFromGFF, HOMER -gtf, or Python | All three handle custom annotations; choose based on pipeline |
| HOMER already in pipeline | HOMER annotatePeaks.pl | Reuses tag directory; combined annotation + motif workflow |
| Fine-grained control over classification logic | Python | Full control over priority rules, distance calculation, output format |
| Quick annotation with standard categories | HOMER annotatePeaks.pl | Single command; no R environment required |
Critical: Always use the same annotation source for peak annotation as was used for the analysis. Mixing databases (e.g., UCSC knownGene TxDb with a GENCODE GTF alignment) produces gene name mismatches and incorrect feature assignments. When a specific GTF is provided, use it directly rather than a pre-built TxDb package.
Coordinate Systems and TSS
BED files use 0-based half-open coordinates
[start, end). GTF files use 1-based closed coordinates [start, end]. Mixing these without conversion shifts annotations by one base.
Peak center (from BED):
(start + end) // 2
TSS from GTF:
- Plus-strand genes: TSS =
(1-based) -> 0-based:startstart - 1 - Minus-strand genes: TSS =
(1-based) -> 0-based:endend
Signed distance: Negative = upstream of TSS, positive = downstream.
- Plus-strand:
distance = peak_center - tss - Minus-strand:
distance = -(peak_center - tss)
Gene Assignment Conventions
Peak annotation involves two decisions: (1) which gene to assign, and (2) what genomic feature the peak overlaps. These can come from different genes, creating a decoupling problem that affects downstream interpretation.
Nearest-TSS vs Host-Gene Assignment
| Approach | Gene assigned from | Feature assigned from | Tools |
|---|---|---|---|
| Nearest-TSS | Gene with closest TSS | Physical overlap at peak center | ChIPseeker default (), HOMER |
| Host-gene priority | Gene whose body contains the peak | Same gene's features | ChIPseeker |
Nearest-TSS (default): HOMER and ChIPseeker both use a two-step process by default. Step 1 finds the gene with the nearest TSS. Step 2 independently determines the genomic feature at the peak center. A peak inside gene A's intron but near gene B's TSS reports: nearest_gene=B, feature=intron -- but the intron belongs to gene A, not gene B.
Host-gene priority: ChIPseeker's
overlap='all' parameter changes this. If a peak overlaps any part of a gene body, that gene is reported as nearest, coupling gene and feature. Peaks with no gene body overlap fall back to nearest TSS.
Choosing a Convention
| Context | Convention | Rationale |
|---|---|---|
| Distal TF binding (enhancers) | Nearest-TSS | Enhancers often regulate the nearest gene, not the gene they sit in |
| Histone marks in gene bodies (H3K36me3, H3K27me3) | Host-gene | Mark typically reflects host gene's transcriptional state |
| Promoter-associated marks (H3K4me3, H3K27ac) | Either | Most peaks are at promoters where both conventions agree |
| Custom annotation against a specific GTF | Host-gene | Consistent gene-feature coupling avoids misleading annotations |
| Reproducing published HOMER results | Nearest-TSS | Matches HOMER's default behavior |
When a task requests "nearest gene," clarify whether it means nearest by TSS distance or the gene whose feature the peak physically overlaps. For most annotation tasks where gene and feature should be consistent, use the host-gene convention.
Annotation with ChIPseeker (R)
Pre-built TxDb (Standard Genomes)
library(ChIPseeker) library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(org.Hs.eg.db) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene peaks <- readPeakFile('peaks.narrowPeak') peak_anno <- annotatePeak(peaks, TxDb = txdb, tssRegion = c(-3000, 3000), annoDb = 'org.Hs.eg.db') anno_df <- as.data.frame(peak_anno)
Custom GTF Annotations
Goal: Annotate peaks using a project-specific GTF rather than a pre-built annotation package.
Approach: Build a TxDb from the GTF with
makeTxDbFromGFF(), annotate peaks against it, then map gene symbols from the original GTF since custom TxDb objects lack the ID mappings that annoDb requires.
library(ChIPseeker) library(GenomicFeatures) library(rtracklayer) txdb <- makeTxDbFromGFF('genes.gtf.gz', format = 'gtf') peaks <- readPeakFile('peaks.bed') peak_anno <- annotatePeak(peaks, TxDb = txdb, tssRegion = c(-2000, 2000), overlap = 'all') anno_df <- as.data.frame(peak_anno) # Map gene symbols from GTF (annoDb does not work with custom TxDb) gtf <- import('genes.gtf.gz') gene_map <- unique(data.frame( gene_id = sub('\\..*', '', gtf$gene_id), symbol = gtf$gene_name, stringsAsFactors = FALSE)) gene_map <- gene_map[!is.na(gene_map$symbol), ] anno_df$geneId_base <- sub('\\..*', '', anno_df$geneId) anno_df$SYMBOL <- gene_map$symbol[match(anno_df$geneId_base, gene_map$gene_id)]
The version-suffix stripping (
sub('\\..*', '', ...)) handles GENCODE gene IDs like ENSG00000142192.25 where the TxDb may store the full ID but the GTF attribute has it without the version.
Custom Promoter Definition
The
tssRegion parameter defines the promoter window around each TSS:
| Window | Use case |
|---|---|
| c(-1000, 1000) | Strict core promoter |
| c(-2000, 2000) | Common custom definition |
| c(-3000, 3000) | ChIPseeker default; broader capture |
| c(-2000, 500) | Asymmetric; emphasizes upstream regulatory elements |
Match this to analysis requirements. Many studies define specific windows (e.g., 2kb symmetric) -- always check.
Annotation Priority
ChIPseeker resolves overlapping features using
genomicAnnotationPriority:
Default:
Promoter > 5'UTR > 3'UTR > Exon > Intron > Downstream > Intergenic
A peak overlapping both a promoter of gene A and an intron of gene B receives "Promoter". To customize:
peak_anno <- annotatePeak(peaks, TxDb = txdb, tssRegion = c(-2000, 2000), genomicAnnotationPriority = c('Promoter', '5UTR', '3UTR', 'Exon', 'Intron', 'Downstream', 'Intergenic'))
Collapse Annotation Categories
ChIPseeker returns detailed subcategories. To collapse to four standard categories:
| ChIPseeker Output | Collapsed |
|---|---|
| Promoter (<=1kb), Promoter (1-2kb), Promoter (2-3kb) | promoter |
| 5' UTR, 3' UTR, 1st Exon, Other Exon | exon |
| 1st Intron, Other Intron | intron |
| Downstream (<=300), Downstream (<=1kb), Distal Intergenic | intergenic |
collapse_annotation <- function(ann) { ifelse(grepl('Promoter', ann), 'promoter', ifelse(grepl("5' UTR|3' UTR|Exon", ann), 'exon', ifelse(grepl('Intron', ann), 'intron', 'intergenic'))) } anno_df$feature <- collapse_annotation(anno_df$annotation)
To suppress subcategories at the source:
options(ChIPseeker.ignore_1st_exon = TRUE) options(ChIPseeker.ignore_1st_intron = TRUE) options(ChIPseeker.ignore_promoter_subcategory = TRUE)
Export Results
output <- data.frame(chr = anno_df$seqnames, start = anno_df$start, end = anno_df$end, nearest_gene = anno_df$SYMBOL, distance_to_tss = anno_df$distanceToTSS, feature = anno_df$feature) write.table(output, 'annotations.tsv', sep = '\t', row.names = FALSE, quote = FALSE)
Annotation with HOMER (CLI)
Standard Genome
annotatePeaks.pl peaks.bed hg38 > annotated.txt
Custom GTF
# With installed genome annotatePeaks.pl peaks.bed hg38 -gtf genes.gtf > annotated.txt # Without installed genome (annotation from GTF only) annotatePeaks.pl peaks.bed none -gtf genes.gtf > annotated.txt
For gzipped GTFs, decompress first:
gunzip -k genes.gtf.gz
Annotation Statistics
annotatePeaks.pl peaks.bed hg38 -gtf genes.gtf -annStats ann_stats.txt > annotated.txt
Parse HOMER Output
HOMER uses a two-part annotation process: (1) find the nearest TSS to assign a gene, and (2) independently classify the genomic feature at the peak center. The gene and annotation can come from different genes -- a peak in gene A's intron near gene B's TSS reports gene B with an intron annotation. This matches ChIPseeker's default
overlap='TSS' behavior.
HOMER produces 19 tab-delimited columns. Key columns for annotation:
| Column | Name | Content |
|---|---|---|
| 8 | Annotation | Feature category (promoter-TSS, exon, intron, Intergenic) |
| 10 | Distance to TSS | Signed distance (negative = upstream) |
| 16 | Gene Name | Gene symbol |
awk -F'\t' 'NR>1 {print $2, $3, $4, $16, $10, $8}' OFS='\t' annotated.txt > summary.tsv
HOMER Category Mapping
| HOMER Category | Collapsed |
|---|---|
| promoter-TSS | promoter |
| 5' UTR, 3' UTR, exon, non-coding | exon |
| intron | intron |
| Intergenic, TTS | intergenic |
Limitation: HOMER defines promoter as -1kb to +100bp from TSS; this window is not configurable via flags. For custom promoter windows, reclassify using the Distance to TSS column:
awk -F'\t' 'NR>1 { dist = ($10 < 0) ? -$10 : $10 feat = (dist <= 2000) ? "promoter" : $8 print $2, $3, $4, $16, $10, feat }' OFS='\t' annotated.txt > reclassified.tsv
Annotation with Python
Parse GTF and Extract Gene Models
Goal: Build gene, exon, and TSS tables from a GTF file for peak annotation.
Approach: Parse GTF line by line, extract gene and exon features, compute TSS positions from strand and coordinates, converting from 1-based GTF to 0-based BED coordinates.
import gzip, pandas as pd def parse_gtf(gtf_path): records = [] opener = gzip.open if gtf_path.endswith('.gz') else open with opener(gtf_path, 'rt') as f: for line in f: if line.startswith('#'): continue fields = line.strip().split('\t') attrs = {} for item in fields[8].strip().rstrip(';').split(';'): item = item.strip() if ' ' in item: key, val = item.split(' ', 1) attrs[key] = val.strip('"') records.append({'chrom': fields[0], 'feature': fields[2], 'start': int(fields[3]) - 1, 'end': int(fields[4]), 'strand': fields[6], **attrs}) return pd.DataFrame(records) gtf = parse_gtf('genes.gtf.gz') genes = gtf[gtf['feature'] == 'gene'].copy() genes['tss'] = genes.apply(lambda r: r['start'] if r['strand'] == '+' else r['end'], axis=1) exons = gtf[gtf['feature'] == 'exon']
Annotate Peaks with Host-Gene Convention
Goal: For each peak, classify its genomic feature and assign it to the appropriate gene with consistent gene-feature coupling.
Approach: Check features in priority order (promoter > exon > intron > intergenic). For promoter, find the nearest TSS within the window. For exon or intron, assign the host gene whose body contains the peak. For intergenic, fall back to nearest TSS. Compute strand-aware signed distance relative to the assigned gene's TSS.
peaks = pd.read_csv('peaks.bed', sep='\t', header=None, names=['chr', 'start', 'end', 'peak_id', 'score']) peaks['center'] = (peaks['start'] + peaks['end']) // 2 promoter_window = 2000 # bp from TSS; match to analysis requirements results = [] for _, peak in peaks.iterrows(): chrom_genes = genes[genes['chrom'] == peak['chr']] chrom_exons = exons[exons['chrom'] == peak['chr']] abs_dists = (chrom_genes['tss'] - peak['center']).abs() nearest_tss_gene = chrom_genes.loc[abs_dists.idxmin()] # Feature classification with host-gene coupling: promoter > exon > intron > intergenic if abs_dists.min() <= promoter_window: feature, assigned = 'promoter', nearest_tss_gene else: exon_hits = chrom_exons[(chrom_exons['start'] <= peak['center']) & (peak['center'] < chrom_exons['end'])] gene_hits = chrom_genes[(chrom_genes['start'] <= peak['center']) & (peak['center'] < chrom_genes['end'])] if len(exon_hits) > 0: host_gene_name = exon_hits.iloc[0].get('gene_name', '') host = chrom_genes[chrom_genes['gene_name'] == host_gene_name] feature, assigned = 'exon', host.iloc[0] if len(host) > 0 else nearest_tss_gene elif len(gene_hits) > 0: closest_host = gene_hits.loc[(gene_hits['tss'] - peak['center']).abs().idxmin()] feature, assigned = 'intron', closest_host else: feature, assigned = 'intergenic', nearest_tss_gene raw_dist = peak['center'] - assigned['tss'] signed_dist = -raw_dist if assigned['strand'] == '-' else raw_dist results.append({'peak_id': peak['peak_id'], 'chr': peak['chr'], 'start': peak['start'], 'end': peak['end'], 'nearest_gene': assigned['gene_name'], 'distance_to_tss': int(signed_dist), 'feature': feature}) result_df = pd.DataFrame(results) result_df.to_csv('annotations.tsv', sep='\t', index=False)
When multiple genes overlap the peak center (common on opposite strands), the host gene with the closest TSS is selected as a tiebreaker.
Alternative: pyranges
For projects with pyranges installed, GTF parsing is simpler:
import pyranges as pr gtf = pr.read_gtf('genes.gtf.gz') # auto-converts to 0-based half-open genes = gtf[gtf.Feature == 'gene'] peaks = pr.read_bed('peaks.bed') nearest = peaks.nearest(genes) # adds Distance column
Feature classification and signed distance still require manual logic on the result DataFrame.
Visualization
plotAnnoPie(peak_anno) plotAnnoBar(peak_anno) plotDistToTSS(peak_anno, title = 'Distribution of peaks relative to TSS')
Compare Multiple Peak Sets
peak_files <- list(H3K4me3 = 'h3k4me3_peaks.bed', H3K27ac = 'h3k27ac_peaks.bed') anno_list <- lapply(peak_files, function(f) annotatePeak(readPeakFile(f), TxDb = txdb)) plotAnnoBar(anno_list) plotDistToTSS(anno_list)
Key Parameters
| Parameter (ChIPseeker) | Default | Description |
|---|---|---|
| tssRegion | c(-3000, 3000) | Promoter window around TSS |
| level | "transcript" | "transcript" or "gene"; gene-level merges all isoforms |
| genomicAnnotationPriority | Promoter > ... > Intergenic | Feature priority for overlapping annotations |
| overlap | "TSS" | "TSS": gene = nearest TSS (gene and feature can be from different genes). "all": gene = host gene if peak overlaps any gene body (coupled annotation). Use "all" when gene-feature consistency matters |
| sameStrand | FALSE | Restrict to same-strand genes only |
| addFlankGeneInfo | FALSE | Include neighboring gene information |
Related Skills
- peak-calling - Generate peak files with MACS3 or HOMER
- motif-analysis - De novo and known motif enrichment in peak regions
- differential-binding - Compare peaks between conditions
- chipseq-visualization - Signal tracks, heatmaps, profile plots
- genome-intervals/gtf-gff-handling - Parse and convert GTF/GFF annotation files
- genome-intervals/proximity-operations - bedtools closest and window operations
- pathway-analysis/go-enrichment - Functional enrichment of peak-associated genes