SciAgent-Skills bedtools-genomic-intervals

Toolkit for genomic interval operations on BED, BAM, GFF, VCF files. Find overlapping regions, merge adjacent intervals, calculate coverage depth, extract FASTA sequences, find nearest features, and manipulate interval coordinates. Essential for ChIP-seq peak annotation, target region filtering, and genome arithmetic. Use tabix instead for indexed single-region queries; use deeptools for normalized bigWig coverage.

install

source · Clone the upstream repo

git clone https://github.com/jaechang-hits/SciAgent-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/bedtools-genomic-intervals" ~/.claude/skills/jaechang-hits-sciagent-skills-bedtools-genomic-intervals && rm -rf "$T"

manifest: skills/genomics-bioinformatics/bedtools-genomic-intervals/SKILL.md

source content

bedtools — Genomic Interval Analysis Toolkit

Overview

bedtools is the standard toolkit for operating on genomic intervals in BED, BAM, GFF, and VCF formats. It solves the core problem of genome arithmetic: finding overlaps between feature sets, computing coverage, extracting sequences, merging adjacent regions, and annotating features with nearest neighbors. bedtools operates on sorted coordinate lists and runs at C speed, making it practical for whole-genome analyses.

When to Use

Intersecting ChIP-seq peaks with gene annotations to find promoter-overlapping peaks
Merging overlapping ATAC-seq peaks or called regions across replicates
Computing read coverage depth over target capture regions
Extracting FASTA sequences for motif discovery or primer design
Finding the nearest gene to each regulatory element or variant
Subtracting blacklist or repeat regions from peak calls
Expanding genomic intervals by fixed distance (promoter regions)
Use
```
tabix
```
instead for fast indexed queries of a single genomic region
For normalized coverage bigWig tracks, use
```
deeptools bamCoverage
```
instead
Use
```
mosdepth
```
instead for whole-genome per-base depth (10× faster)

Prerequisites

Python packages: None required (command-line only)
Input requirements: BED/BAM/GFF/VCF files; FASTA reference for
```
getfasta
```
; genome file (chromosome sizes) for
```
slop
```
/
```
flank
```
/
```
genomecov
```
Sorting: Most operations require coordinate-sorted input

Check before installing: The tool may already be available in the current environment (e.g., inside a
pixi
/
conda
env). Run
command -v bedtools
first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via
pixi run bedtools
rather than bare
bedtools
.

# Bioconda (recommended)
conda install -c bioconda bedtools

# Homebrew (macOS)
brew install bedtools

# Verify
bedtools --version
# bedtools v2.31.0

# Create genome file from FASTA index
samtools faidx reference.fa
cut -f1,2 reference.fa.fai > genome.txt  # chr → size table

Quick Start

# Find peaks overlapping genes, then merge overlapping peaks
bedtools intersect -a peaks.bed -b genes.bed -wa -wb > peaks_with_genes.bed
bedtools merge -i peaks.bed > merged_peaks.bed
bedtools coverage -a genes.bed -b reads.bam > gene_coverage.bed

Core API

Module 1: Interval Intersection and Overlap Analysis

Find regions that overlap between two feature sets.

# Basic intersection: output overlapping regions
bedtools intersect -a peaks.bed -b genes.bed

# Report original A and B features for each overlap
bedtools intersect -a peaks.bed -b genes.bed -wa -wb

# Count B overlaps per A feature (adds column)
bedtools intersect -a peaks.bed -b genes.bed -c
# Output: chr1  1000  2000  peak1  gene_count

# Peaks with ANY overlap (report each peak once)
bedtools intersect -a peaks.bed -b genes.bed -u

# Peaks with NO overlap in B (invert filter)
bedtools intersect -a peaks.bed -b blacklist.bed -v

# Require reciprocal 50% overlap both ways
bedtools intersect -a exp1.bed -b exp2.bed -f 0.5 -F 0.5 -r

# Same-strand intersections only
bedtools intersect -a peaks.bed -b genes.bed -s

# Multiple database files with overlap counts per file
bedtools intersect -a query.bed -b enhancers.bed promoters.bed \
    -names enh prom -C

# Memory-efficient mode for pre-sorted large files
bedtools intersect -a sorted_peaks.bed -b sorted_genes.bed -sorted

Module 2: Interval Merging and Arithmetic

Combine overlapping intervals and perform set operations.

# Merge overlapping and adjacent intervals
sort -k1,1 -k2,2n peaks.bed | bedtools merge -i stdin

# Merge intervals within 500 bp of each other
bedtools merge -i peaks.bed -d 500

# Merge and count original features
bedtools merge -i peaks.bed -c 1 -o count
# Output: chr1  1000  5000  3 (3 original peaks merged)

# Merge and collapse feature names
bedtools merge -i peaks.bed -c 4 -o collapse -delim ";"
# Output: chr1  1000  5000  peak1;peak2;peak3

# Subtract B from A (remove covered bases)
bedtools subtract -a peaks.bed -b blacklist.bed

# Remove entire A feature if ANY B overlap
bedtools subtract -a peaks.bed -b exclusion.bed -A

# Find genomic gaps (complement of covered regions)
bedtools complement -i merged.bed -g genome.txt

Module 3: Coverage Analysis

Calculate depth and breadth of read coverage over features.

# Coverage stats per feature (count, bases covered, % covered)
bedtools coverage -a target_genes.bed -b aligned.bam
# Output: chr  start  end  gene  n_overlapping_reads  bases_covered  feature_len  fraction_covered

# Per-base depth within each feature
bedtools coverage -a targets.bed -b aligned.bam -d
# Output: chr  start  end  name  position  depth

# Coverage histogram per feature
bedtools coverage -a features.bed -b aligned.bam -hist

# Genome-wide BEDGRAPH (coverage per bin)
bedtools genomecov -ibam aligned.bam -bg -o coverage.bedgraph

# Include zero-coverage regions (for whole-genome coverage)
bedtools genomecov -ibam aligned.bam -bga > full_coverage.bedgraph

# Per-base depth for whole genome
bedtools genomecov -ibam aligned.bam -d > depth.txt

# Scaled BEDGRAPH (RPM normalization: total=50M reads → scale=1/50)
bedtools genomecov -ibam aligned.bam -bg -scale 0.00000002 > rpm.bedgraph

# Strand-specific coverage tracks
bedtools genomecov -ibam rnaseq.bam -bg -strand + > forward.bedgraph
bedtools genomecov -ibam rnaseq.bam -bg -strand - > reverse.bedgraph

Module 4: Sequence Extraction and Nearest Feature

Extract genomic sequences and annotate features with neighbors.

# Extract FASTA sequences for each BED region
bedtools getfasta -fi genome.fa -bed regions.bed -fo sequences.fasta

# Strand-aware extraction (reverse complement - strand)
bedtools getfasta -fi genome.fa -bed regions.bed -s -fo stranded.fasta

# Custom FASTA headers (name + coords)
bedtools getfasta -fi genome.fa -bed peaks.bed -name -fo named.fasta

# Extract and concatenate exons (BED12 spliced transcripts)
bedtools getfasta -fi genome.fa -bed transcripts.bed12 -split -fo exons.fasta

# Find nearest gene to each peak (with distance)
bedtools closest -a peaks.bed -b genes.bed -d
# Output: peak fields... | gene fields... | distance_bp

# Nearest feature on same strand only
bedtools closest -a peaks.bed -b genes.bed -s -d

# Ignore overlapping features (find nearest non-overlapping)
bedtools closest -a peaks.bed -b genes.bed -io -d

# Multiple annotation databases
bedtools closest -a query.bed -b genes.bed enhancers.bed \
    -names genes enhancers -d

Module 5: Interval Manipulation

Expand, contract, and shift genomic intervals.

# Expand regions by 500 bp on each side
bedtools slop -i peaks.bed -g genome.txt -b 500

# Asymmetric: 2000 bp upstream, 500 bp downstream of TSS
bedtools slop -i tss.bed -g genome.txt -l 2000 -r 500

# Strand-aware expansion (upstream = 5' side)
bedtools slop -i genes.bed -g genome.txt -l 1000 -r 200 -s

# Create flanking regions (not overlapping the feature)
bedtools flank -i genes.bed -g genome.txt -b 1000
bedtools flank -i genes.bed -g genome.txt -l 2000 -r 0 -s  # upstream only

Key Concepts

Coordinate Systems

BED files use 0-based half-open intervals: start is 0-indexed (like Python), end is exclusive. A region chr1:1000-2000 in BED covers bases 1000–1999 (1000 bases).

chr1  1000  2000  peak1   ← covers positions 1000,1001,...,1999
# BED:  0-based start, exclusive end
# VCF:  1-based position (POS)
# GFF:  1-based start and end (both inclusive)

bedtools converts internally — input format is auto-detected. Problems arise when mixing tools with different conventions.

Sorting Requirements

Most bedtools operations require coordinate-sorted input. Pre-sort with:

sort -k1,1 -k2,2n input.bed > sorted.bed
# For large files, use -S 4G for 4 GB sort buffer
sort -k1,1 -k2,2n -S 4G --parallel=8 input.bed > sorted.bed

The

-sorted

flag in

bedtools intersect

uses a sweep algorithm that requires sorted input but uses O(1) memory instead of O(N).

Common Workflows

Workflow 1: ChIP-seq Peak Annotation

Goal: Annotate peaks with overlapping genes, distances to TSS, and filter blacklisted regions.

#!/bin/bash
PEAKS="peaks.bed"
GENES="refseq_genes.bed"
TSS="refseq_tss.bed"       # BED with TSS positions
BLACKLIST="encode_blacklist_hg38.bed"
GENOME="hg38.genome"

# 1. Remove blacklisted regions
bedtools subtract -a $PEAKS -b $BLACKLIST -A > peaks_clean.bed
echo "After blacklist filter: $(wc -l < peaks_clean.bed) peaks"

# 2. Annotate with overlapping gene (allow 2 kb from gene body)
bedtools slop -i $GENES -g $GENOME -b 2000 > genes_padded.bed
bedtools intersect -a peaks_clean.bed -b genes_padded.bed -wa -wb \
    > peaks_gene_overlap.bed

# 3. For non-overlapping peaks: find nearest gene
bedtools intersect -a peaks_clean.bed -b genes_padded.bed -v > peaks_distal.bed
bedtools closest -a peaks_distal.bed -b $TSS -d > peaks_distal_nearest.bed

echo "Promoter peaks: $(wc -l < peaks_gene_overlap.bed)"
echo "Distal peaks: $(wc -l < peaks_distal.bed)"

Workflow 2: Coverage Analysis for WES Target Regions

Goal: Calculate on-target read depth and coverage breadth for exome sequencing QC.

#!/bin/bash
BAM="sample.deduped.bam"
TARGETS="capture_targets.bed"

# Per-target coverage statistics
bedtools coverage -a $TARGETS -b $BAM > per_target_coverage.bed

# Summary: total targets, mean depth, % at ≥20×
awk 'BEGIN{n=0; depth=0; covered=0}
     {n++; depth+=$7; if($8>=20) covered++}
     END{printf "Targets: %d\nMean depth: %.1f×\n%% at 20×: %.1f%%\n",
         n, depth/n, covered/n*100}' per_target_coverage.bed

# Per-base depth for IGV visualization
bedtools coverage -a $TARGETS -b $BAM -d > per_base_depth.txt
echo "Per-base depth written to per_base_depth.txt"

Key Parameters

Parameter	Command	Default	Range/Options	Effect
`-f`	intersect, coverage	1e-9	0.0–1.0	Min fraction of A that must overlap
`-F`	intersect, coverage	1e-9	0.0–1.0	Min fraction of B that must overlap
`-r`	intersect	—	flag	Require reciprocal overlap ( `-f` AND `-F` )
`-s`	Most	—	flag	Strand-aware (same strand only)
`-v`	intersect	—	flag	Report features with NO overlap (invert)
`-c`	intersect	—	flag	Append overlap count per A feature
`-d`	merge	0	integer	Max gap to merge (bp)
`-bg`	genomecov	—	flag	BEDGRAPH output format
`-scale`	genomecov	1.0	float	Multiply coverage by constant (for RPM)
`-sorted`	intersect, closest	—	flag	Use sweep algorithm (sorted input required)
`-b`	slop	—	integer	Expand interval by N bp on both sides
`-D`	closest	—	ref/a/b	Report signed distance (upstream negative)

Best Practices

Always sort before bedtools: Most bedtools commands fail silently on unsorted input. Sort with
```
sort -k1,1 -k2,2n input.bed
```
before any bedtools operation.
Use
```
-sorted
```
for large files: For pre-sorted files,
```
-sorted
```
reduces memory from O(N) to O(1). Required when intersecting multi-gigabyte BED files.
Check chromosome naming consistency: The single most common failure — some tools use
```
chr1
```
, others use
```
1
```
. Verify with
```
cut -f1 file.bed | sort -u
```
before running intersections.
Apply blacklist early: Run
```
bedtools subtract -b blacklist.bed -A
```
before any peak analysis. ENCODE blacklists remove artifactual signal in repetitive/high-copy regions.
Use
```
-f 0.5 -r
```
for peak reproducibility: When intersecting peaks across replicates, require 50% reciprocal overlap to avoid spurious short overlaps at interval boundaries.
Validate BED format: Malformed BED (wrong column count, text in numeric columns) causes silent failures. Test with
```
bedtools merge -i file.bed 2>&1 | head -5
```
.

Common Recipes

Recipe: Count Feature Overlaps Across Annotation Categories

# Report how many peaks overlap each category (genes, promoters, enhancers)
for category in genes.bed promoters.bed enhancers.bed repeats.bed; do
    label=$(basename $category .bed)
    count=$(bedtools intersect -a peaks.bed -b $category -u | wc -l)
    total=$(wc -l < peaks.bed)
    echo "$label: $count/$total ($(echo "scale=1; $count*100/$total" | bc)%)"
done

Recipe: Create Promoter Regions from Gene Annotations

# Extract 2kb upstream of TSS for ChIP annotation
# For genes on + strand: TSS = start; on - strand: TSS = end
awk 'BEGIN{OFS="\t"} $6=="+" {print $1,$2,$2+1,$4,$5,$6}
                     $6=="-" {print $1,$3-1,$3,$4,$5,$6}' genes.bed > tss.bed

bedtools slop -i tss.bed -g genome.txt -l 2000 -r 500 -s > promoters.bed
echo "Created $(wc -l < promoters.bed) promoter regions"

Recipe: Calculate Intersection Statistics

# Jaccard similarity between two peak sets (0=no overlap, 1=identical)
bedtools sort -i set1.bed > s1.bed
bedtools sort -i set2.bed > s2.bed
bedtools jaccard -a s1.bed -b s2.bed
# Output: intersection  union  jaccard  n_intersections
#         423456        2345678  0.1804  892

Troubleshooting

Problem	Cause	Solution
Empty intersect output	Chromosome name mismatch (chr1 vs 1)	Check: `cut -f1 a.bed \| sort -u` vs `cut -f1 b.bed \| sort -u`
Memory error on large files	Not using `-sorted` flag	Pre-sort inputs and add `-sorted` to intersect/closest
`getfasta: sequence not found`	FASTA headers differ from BED chr names	Index FASTA: `samtools faidx genome.fa` ; match names exactly
Zero coverage everywhere	BAM not indexed or BED/BAM chr mismatch	Run `samtools index file.bam` ; verify chr naming
Merge doesn't merge expected features	Input not sorted by coordinate	Sort: `sort -k1,1 -k2,2n file.bed \| bedtools merge -i stdin`
`getfasta` produces wrong-strand sequence	Using `-s` without strand column in BED	Ensure BED col 6 has `+` / `-` ; add strand: `awk '{$6="+"; print}' OFS="\t"`
Off-by-one in coordinates	Mixing 0-based BED and 1-based VCF/GFF	Convert GFF to BED: subtract 1 from start
Slow on large genomes	Processing unsorted files	Sort both files; use `-sorted` ; pipe through sort without writing temp files

Related Skills

samtools-bam-processing — BAM sorting, filtering, and QC before passing to bedtools
deeptools-ngs-analysis — normalized coverage bigWig tracks and heatmaps from the BAM files bedtools processes
pysam-genomic-files — Python-native BAM/BED manipulation for custom logic beyond bedtools

References

bedtools documentation — command reference and examples
GitHub: arq5x/bedtools2 — source, releases, issue tracker
Quinlan & Hall (2010) "BEDTools: a flexible suite of utilities for comparing genomic features" — Bioinformatics 26(6)
ENCODE blacklist regions — curated problematic genomic regions