SciAgent-Skills bedtools-genomic-intervals

Toolkit for genomic interval operations on BED, BAM, GFF, VCF files. Find overlapping regions, merge adjacent intervals, calculate coverage depth, extract FASTA sequences, find nearest features, and manipulate interval coordinates. Essential for ChIP-seq peak annotation, target region filtering, and genome arithmetic. Use tabix instead for indexed single-region queries; use deeptools for normalized bigWig coverage.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/bedtools-genomic-intervals" ~/.claude/skills/jaechang-hits-sciagent-skills-bedtools-genomic-intervals && rm -rf "$T"
manifest: skills/genomics-bioinformatics/bedtools-genomic-intervals/SKILL.md
source content

bedtools — Genomic Interval Analysis Toolkit

Overview

bedtools is the standard toolkit for operating on genomic intervals in BED, BAM, GFF, and VCF formats. It solves the core problem of genome arithmetic: finding overlaps between feature sets, computing coverage, extracting sequences, merging adjacent regions, and annotating features with nearest neighbors. bedtools operates on sorted coordinate lists and runs at C speed, making it practical for whole-genome analyses.

When to Use

  • Intersecting ChIP-seq peaks with gene annotations to find promoter-overlapping peaks
  • Merging overlapping ATAC-seq peaks or called regions across replicates
  • Computing read coverage depth over target capture regions
  • Extracting FASTA sequences for motif discovery or primer design
  • Finding the nearest gene to each regulatory element or variant
  • Subtracting blacklist or repeat regions from peak calls
  • Expanding genomic intervals by fixed distance (promoter regions)
  • Use
    tabix
    instead for fast indexed queries of a single genomic region
  • For normalized coverage bigWig tracks, use
    deeptools bamCoverage
    instead
  • Use
    mosdepth
    instead for whole-genome per-base depth (10× faster)

Prerequisites

  • Python packages: None required (command-line only)
  • Input requirements: BED/BAM/GFF/VCF files; FASTA reference for
    getfasta
    ; genome file (chromosome sizes) for
    slop
    /
    flank
    /
    genomecov
  • Sorting: Most operations require coordinate-sorted input

Check before installing: The tool may already be available in the current environment (e.g., inside a

pixi
/
conda
env). Run
command -v bedtools
first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via
pixi run bedtools
rather than bare
bedtools
.

# Bioconda (recommended)
conda install -c bioconda bedtools

# Homebrew (macOS)
brew install bedtools

# Verify
bedtools --version
# bedtools v2.31.0

# Create genome file from FASTA index
samtools faidx reference.fa
cut -f1,2 reference.fa.fai > genome.txt  # chr → size table

Quick Start

# Find peaks overlapping genes, then merge overlapping peaks
bedtools intersect -a peaks.bed -b genes.bed -wa -wb > peaks_with_genes.bed
bedtools merge -i peaks.bed > merged_peaks.bed
bedtools coverage -a genes.bed -b reads.bam > gene_coverage.bed

Core API

Module 1: Interval Intersection and Overlap Analysis

Find regions that overlap between two feature sets.

# Basic intersection: output overlapping regions
bedtools intersect -a peaks.bed -b genes.bed

# Report original A and B features for each overlap
bedtools intersect -a peaks.bed -b genes.bed -wa -wb

# Count B overlaps per A feature (adds column)
bedtools intersect -a peaks.bed -b genes.bed -c
# Output: chr1  1000  2000  peak1  gene_count

# Peaks with ANY overlap (report each peak once)
bedtools intersect -a peaks.bed -b genes.bed -u

# Peaks with NO overlap in B (invert filter)
bedtools intersect -a peaks.bed -b blacklist.bed -v
# Require reciprocal 50% overlap both ways
bedtools intersect -a exp1.bed -b exp2.bed -f 0.5 -F 0.5 -r

# Same-strand intersections only
bedtools intersect -a peaks.bed -b genes.bed -s

# Multiple database files with overlap counts per file
bedtools intersect -a query.bed -b enhancers.bed promoters.bed \
    -names enh prom -C

# Memory-efficient mode for pre-sorted large files
bedtools intersect -a sorted_peaks.bed -b sorted_genes.bed -sorted

Module 2: Interval Merging and Arithmetic

Combine overlapping intervals and perform set operations.

# Merge overlapping and adjacent intervals
sort -k1,1 -k2,2n peaks.bed | bedtools merge -i stdin

# Merge intervals within 500 bp of each other
bedtools merge -i peaks.bed -d 500

# Merge and count original features
bedtools merge -i peaks.bed -c 1 -o count
# Output: chr1  1000  5000  3 (3 original peaks merged)

# Merge and collapse feature names
bedtools merge -i peaks.bed -c 4 -o collapse -delim ";"
# Output: chr1  1000  5000  peak1;peak2;peak3
# Subtract B from A (remove covered bases)
bedtools subtract -a peaks.bed -b blacklist.bed

# Remove entire A feature if ANY B overlap
bedtools subtract -a peaks.bed -b exclusion.bed -A

# Find genomic gaps (complement of covered regions)
bedtools complement -i merged.bed -g genome.txt

Module 3: Coverage Analysis

Calculate depth and breadth of read coverage over features.

# Coverage stats per feature (count, bases covered, % covered)
bedtools coverage -a target_genes.bed -b aligned.bam
# Output: chr  start  end  gene  n_overlapping_reads  bases_covered  feature_len  fraction_covered

# Per-base depth within each feature
bedtools coverage -a targets.bed -b aligned.bam -d
# Output: chr  start  end  name  position  depth

# Coverage histogram per feature
bedtools coverage -a features.bed -b aligned.bam -hist
# Genome-wide BEDGRAPH (coverage per bin)
bedtools genomecov -ibam aligned.bam -bg -o coverage.bedgraph

# Include zero-coverage regions (for whole-genome coverage)
bedtools genomecov -ibam aligned.bam -bga > full_coverage.bedgraph

# Per-base depth for whole genome
bedtools genomecov -ibam aligned.bam -d > depth.txt

# Scaled BEDGRAPH (RPM normalization: total=50M reads → scale=1/50)
bedtools genomecov -ibam aligned.bam -bg -scale 0.00000002 > rpm.bedgraph

# Strand-specific coverage tracks
bedtools genomecov -ibam rnaseq.bam -bg -strand + > forward.bedgraph
bedtools genomecov -ibam rnaseq.bam -bg -strand - > reverse.bedgraph

Module 4: Sequence Extraction and Nearest Feature

Extract genomic sequences and annotate features with neighbors.

# Extract FASTA sequences for each BED region
bedtools getfasta -fi genome.fa -bed regions.bed -fo sequences.fasta

# Strand-aware extraction (reverse complement - strand)
bedtools getfasta -fi genome.fa -bed regions.bed -s -fo stranded.fasta

# Custom FASTA headers (name + coords)
bedtools getfasta -fi genome.fa -bed peaks.bed -name -fo named.fasta

# Extract and concatenate exons (BED12 spliced transcripts)
bedtools getfasta -fi genome.fa -bed transcripts.bed12 -split -fo exons.fasta
# Find nearest gene to each peak (with distance)
bedtools closest -a peaks.bed -b genes.bed -d
# Output: peak fields... | gene fields... | distance_bp

# Nearest feature on same strand only
bedtools closest -a peaks.bed -b genes.bed -s -d

# Ignore overlapping features (find nearest non-overlapping)
bedtools closest -a peaks.bed -b genes.bed -io -d

# Multiple annotation databases
bedtools closest -a query.bed -b genes.bed enhancers.bed \
    -names genes enhancers -d

Module 5: Interval Manipulation

Expand, contract, and shift genomic intervals.

# Expand regions by 500 bp on each side
bedtools slop -i peaks.bed -g genome.txt -b 500

# Asymmetric: 2000 bp upstream, 500 bp downstream of TSS
bedtools slop -i tss.bed -g genome.txt -l 2000 -r 500

# Strand-aware expansion (upstream = 5' side)
bedtools slop -i genes.bed -g genome.txt -l 1000 -r 200 -s

# Create flanking regions (not overlapping the feature)
bedtools flank -i genes.bed -g genome.txt -b 1000
bedtools flank -i genes.bed -g genome.txt -l 2000 -r 0 -s  # upstream only

Key Concepts

Coordinate Systems

BED files use 0-based half-open intervals: start is 0-indexed (like Python), end is exclusive. A region chr1:1000-2000 in BED covers bases 1000–1999 (1000 bases).

chr1  1000  2000  peak1   ← covers positions 1000,1001,...,1999
# BED:  0-based start, exclusive end
# VCF:  1-based position (POS)
# GFF:  1-based start and end (both inclusive)

bedtools converts internally — input format is auto-detected. Problems arise when mixing tools with different conventions.

Sorting Requirements

Most bedtools operations require coordinate-sorted input. Pre-sort with:

sort -k1,1 -k2,2n input.bed > sorted.bed
# For large files, use -S 4G for 4 GB sort buffer
sort -k1,1 -k2,2n -S 4G --parallel=8 input.bed > sorted.bed

The

-sorted
flag in
bedtools intersect
uses a sweep algorithm that requires sorted input but uses O(1) memory instead of O(N).

Common Workflows

Workflow 1: ChIP-seq Peak Annotation

Goal: Annotate peaks with overlapping genes, distances to TSS, and filter blacklisted regions.

#!/bin/bash
PEAKS="peaks.bed"
GENES="refseq_genes.bed"
TSS="refseq_tss.bed"       # BED with TSS positions
BLACKLIST="encode_blacklist_hg38.bed"
GENOME="hg38.genome"

# 1. Remove blacklisted regions
bedtools subtract -a $PEAKS -b $BLACKLIST -A > peaks_clean.bed
echo "After blacklist filter: $(wc -l < peaks_clean.bed) peaks"

# 2. Annotate with overlapping gene (allow 2 kb from gene body)
bedtools slop -i $GENES -g $GENOME -b 2000 > genes_padded.bed
bedtools intersect -a peaks_clean.bed -b genes_padded.bed -wa -wb \
    > peaks_gene_overlap.bed

# 3. For non-overlapping peaks: find nearest gene
bedtools intersect -a peaks_clean.bed -b genes_padded.bed -v > peaks_distal.bed
bedtools closest -a peaks_distal.bed -b $TSS -d > peaks_distal_nearest.bed

echo "Promoter peaks: $(wc -l < peaks_gene_overlap.bed)"
echo "Distal peaks: $(wc -l < peaks_distal.bed)"

Workflow 2: Coverage Analysis for WES Target Regions

Goal: Calculate on-target read depth and coverage breadth for exome sequencing QC.

#!/bin/bash
BAM="sample.deduped.bam"
TARGETS="capture_targets.bed"

# Per-target coverage statistics
bedtools coverage -a $TARGETS -b $BAM > per_target_coverage.bed

# Summary: total targets, mean depth, % at ≥20×
awk 'BEGIN{n=0; depth=0; covered=0}
     {n++; depth+=$7; if($8>=20) covered++}
     END{printf "Targets: %d\nMean depth: %.1f×\n%% at 20×: %.1f%%\n",
         n, depth/n, covered/n*100}' per_target_coverage.bed

# Per-base depth for IGV visualization
bedtools coverage -a $TARGETS -b $BAM -d > per_base_depth.txt
echo "Per-base depth written to per_base_depth.txt"

Key Parameters

ParameterCommandDefaultRange/OptionsEffect
-f
intersect, coverage1e-90.0–1.0Min fraction of A that must overlap
-F
intersect, coverage1e-90.0–1.0Min fraction of B that must overlap
-r
intersectflagRequire reciprocal overlap (
-f
AND
-F
)
-s
MostflagStrand-aware (same strand only)
-v
intersectflagReport features with NO overlap (invert)
-c
intersectflagAppend overlap count per A feature
-d
merge0integerMax gap to merge (bp)
-bg
genomecovflagBEDGRAPH output format
-scale
genomecov1.0floatMultiply coverage by constant (for RPM)
-sorted
intersect, closestflagUse sweep algorithm (sorted input required)
-b
slopintegerExpand interval by N bp on both sides
-D
closestref/a/bReport signed distance (upstream negative)

Best Practices

  1. Always sort before bedtools: Most bedtools commands fail silently on unsorted input. Sort with

    sort -k1,1 -k2,2n input.bed
    before any bedtools operation.

  2. Use

    -sorted
    for large files: For pre-sorted files,
    -sorted
    reduces memory from O(N) to O(1). Required when intersecting multi-gigabyte BED files.

  3. Check chromosome naming consistency: The single most common failure — some tools use

    chr1
    , others use
    1
    . Verify with
    cut -f1 file.bed | sort -u
    before running intersections.

  4. Apply blacklist early: Run

    bedtools subtract -b blacklist.bed -A
    before any peak analysis. ENCODE blacklists remove artifactual signal in repetitive/high-copy regions.

  5. Use

    -f 0.5 -r
    for peak reproducibility: When intersecting peaks across replicates, require 50% reciprocal overlap to avoid spurious short overlaps at interval boundaries.

  6. Validate BED format: Malformed BED (wrong column count, text in numeric columns) causes silent failures. Test with

    bedtools merge -i file.bed 2>&1 | head -5
    .

Common Recipes

Recipe: Count Feature Overlaps Across Annotation Categories

# Report how many peaks overlap each category (genes, promoters, enhancers)
for category in genes.bed promoters.bed enhancers.bed repeats.bed; do
    label=$(basename $category .bed)
    count=$(bedtools intersect -a peaks.bed -b $category -u | wc -l)
    total=$(wc -l < peaks.bed)
    echo "$label: $count/$total ($(echo "scale=1; $count*100/$total" | bc)%)"
done

Recipe: Create Promoter Regions from Gene Annotations

# Extract 2kb upstream of TSS for ChIP annotation
# For genes on + strand: TSS = start; on - strand: TSS = end
awk 'BEGIN{OFS="\t"} $6=="+" {print $1,$2,$2+1,$4,$5,$6}
                     $6=="-" {print $1,$3-1,$3,$4,$5,$6}' genes.bed > tss.bed

bedtools slop -i tss.bed -g genome.txt -l 2000 -r 500 -s > promoters.bed
echo "Created $(wc -l < promoters.bed) promoter regions"

Recipe: Calculate Intersection Statistics

# Jaccard similarity between two peak sets (0=no overlap, 1=identical)
bedtools sort -i set1.bed > s1.bed
bedtools sort -i set2.bed > s2.bed
bedtools jaccard -a s1.bed -b s2.bed
# Output: intersection  union  jaccard  n_intersections
#         423456        2345678  0.1804  892

Troubleshooting

ProblemCauseSolution
Empty intersect outputChromosome name mismatch (chr1 vs 1)Check:
cut -f1 a.bed | sort -u
vs
cut -f1 b.bed | sort -u
Memory error on large filesNot using
-sorted
flag
Pre-sort inputs and add
-sorted
to intersect/closest
getfasta: sequence not found
FASTA headers differ from BED chr namesIndex FASTA:
samtools faidx genome.fa
; match names exactly
Zero coverage everywhereBAM not indexed or BED/BAM chr mismatchRun
samtools index file.bam
; verify chr naming
Merge doesn't merge expected featuresInput not sorted by coordinateSort:
sort -k1,1 -k2,2n file.bed | bedtools merge -i stdin
getfasta
produces wrong-strand sequence
Using
-s
without strand column in BED
Ensure BED col 6 has
+
/
-
; add strand:
awk '{$6="+"; print}' OFS="\t"
Off-by-one in coordinatesMixing 0-based BED and 1-based VCF/GFFConvert GFF to BED: subtract 1 from start
Slow on large genomesProcessing unsorted filesSort both files; use
-sorted
; pipe through sort without writing temp files

Related Skills

  • samtools-bam-processing — BAM sorting, filtering, and QC before passing to bedtools
  • deeptools-ngs-analysis — normalized coverage bigWig tracks and heatmaps from the BAM files bedtools processes
  • pysam-genomic-files — Python-native BAM/BED manipulation for custom logic beyond bedtools

References