BioSkills bio-genome-intervals-bed-file-basics
BED file format fundamentals, creation, validation, and basic operations. Covers BED3 through BED12 formats, coordinate systems, sorting, and format conversion using bedtools and pybedtools. Use when working with genomic coordinates or preparing interval files for downstream tools.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/genome-intervals/bed-file-basics" ~/.claude/skills/gptomics-bioskills-bio-genome-intervals-bed-file-basics && rm -rf "$T"
genome-intervals/bed-file-basics/SKILL.mdVersion Compatibility
Reference examples tested with: bcftools 1.19+, bedtools 2.31+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
BED File Basics
"Work with BED files" → Read, create, and manipulate genomic interval files in BED format (0-based, half-open coordinates).
- Python:
(pybedtools)pybedtools.BedTool('file.bed') - CLI:
commandsbedtools
BED (Browser Extensible Data) format stores genomic intervals. Uses 0-based, half-open coordinates.
BED Format Columns
BED3: chr start end BED4: chr start end name BED5: chr start end name score BED6: chr start end name score strand BED12: chr start end name score strand thickStart thickEnd rgb blockCount blockSizes blockStarts
Coordinate System
BED uses 0-based, half-open coordinates:
- Start: 0-based (first base is 0)
- End: exclusive (not included)
- Position 100-200 means bases at positions 100-199
Create BED Files
From Text (CLI)
# Create simple BED3 echo -e "chr1\t100\t200\nchr1\t300\t400" > regions.bed # Create BED6 with name and strand echo -e "chr1\t100\t200\tpeak1\t100\t+" > peaks.bed
From Python
import pybedtools # From string bed_string = '''chr1\t100\t200\tpeak1\t100\t+ chr1\t300\t400\tpeak2\t200\t-''' bed = pybedtools.BedTool(bed_string, from_string=True) # From list of tuples intervals = [ ('chr1', 100, 200, 'peak1', 100, '+'), ('chr1', 300, 400, 'peak2', 200, '-'), ] bed = pybedtools.BedTool(intervals) # From pandas DataFrame import pandas as pd df = pd.DataFrame({ 'chrom': ['chr1', 'chr1'], 'start': [100, 300], 'end': [200, 400], 'name': ['peak1', 'peak2'], 'score': [100, 200], 'strand': ['+', '-'] }) bed = pybedtools.BedTool.from_dataframe(df) # Save to file bed.saveas('output.bed')
Sort BED Files
CLI
# Sort by chromosome and position sort -k1,1 -k2,2n input.bed > sorted.bed # Using bedtools bedtools sort -i input.bed > sorted.bed # Sort by chromosome, start, then end sort -k1,1 -k2,2n -k3,3n input.bed > sorted.bed
Python
import pybedtools bed = pybedtools.BedTool('input.bed') sorted_bed = bed.sort() sorted_bed.saveas('sorted.bed')
Validate BED Files
Check Format
# Check column count awk -F'\t' '{print NF}' input.bed | sort -u # Check for invalid coordinates (start >= end) awk '$2 >= $3' input.bed # Check for negative coordinates awk '$2 < 0 || $3 < 0' input.bed # Validate with bedtools bedtools sort -i input.bed > /dev/null 2>&1 && echo "Valid" || echo "Invalid"
Python Validation
Goal: Programmatically validate a BED file for common format errors before using it in downstream tools.
Approach: Load the file with pybedtools, iterate through all intervals checking for negative coordinates and invalid ranges (start >= end), and return a pass/fail status with error description.
import pybedtools def validate_bed(filepath): try: bed = pybedtools.BedTool(filepath) for interval in bed: if interval.start < 0 or interval.end < 0: return False, 'Negative coordinates' if interval.start >= interval.end: return False, f'Invalid interval: {interval.start} >= {interval.end}' return True, 'Valid' except Exception as e: return False, str(e) valid, msg = validate_bed('input.bed') print(f'{msg}')
Read BED Files
Python
import pybedtools # Load BED file bed = pybedtools.BedTool('input.bed') # Iterate over intervals for interval in bed: print(f'{interval.chrom}:{interval.start}-{interval.end}') print(f'Name: {interval.name}, Score: {interval.score}, Strand: {interval.strand}') # Count intervals n_intervals = bed.count() # Convert to pandas DataFrame df = bed.to_dataframe() # Access specific columns df = bed.to_dataframe(names=['chrom', 'start', 'end', 'name', 'score', 'strand'])
Filter BED Files
By Chromosome
# Single chromosome grep "^chr1\t" input.bed > chr1.bed # Multiple chromosomes grep -E "^(chr1|chr2)\t" input.bed > chr1_2.bed # Exclude chromosome grep -v "^chrM\t" input.bed > no_chrM.bed
By Size
# Intervals >= 100bp awk '($3 - $2) >= 100' input.bed > large.bed # Intervals between 100-500bp awk '($3 - $2) >= 100 && ($3 - $2) <= 500' input.bed > medium.bed
Python Filtering
import pybedtools bed = pybedtools.BedTool('input.bed') # Filter by chromosome chr1 = bed.filter(lambda x: x.chrom == 'chr1') # Filter by size large = bed.filter(lambda x: len(x) >= 100) # Filter by strand plus_strand = bed.filter(lambda x: x.strand == '+') # Filter by score high_score = bed.filter(lambda x: float(x.score) >= 500) # Chain filters result = bed.filter(lambda x: x.chrom == 'chr1' and len(x) >= 100) result.saveas('filtered.bed')
Convert Formats
BED to Other Formats
# BED to GFF bedtools bed12togff -i input.bed > output.gff # BED to FASTA (extract sequences) bedtools getfasta -fi reference.fa -bed input.bed -fo output.fa # BED to FASTA with names bedtools getfasta -fi reference.fa -bed input.bed -name -fo output.fa
From VCF to BED
# Extract variant positions bcftools query -f '%CHROM\t%POS0\t%END\n' input.vcf > variants.bed # Or using awk (simpler for SNPs) grep -v "^#" input.vcf | awk '{print $1"\t"$2-1"\t"$2}' > snps.bed
From BAM to BED
# Convert alignments to BED bedtools bamtobed -i input.bam > alignments.bed # BED12 for spliced alignments bedtools bamtobed -i input.bam -split > spliced.bed
Interval Arithmetic Basics
import pybedtools interval = pybedtools.create_interval_from_list(['chr1', '100', '200', 'peak1', '0', '+']) # Access fields print(interval.chrom) # chr1 print(interval.start) # 100 (int) print(interval.end) # 200 (int) print(interval.name) # peak1 print(interval.score) # 0 print(interval.strand) # + # Get length print(len(interval)) # 100 # Get fields list print(interval.fields) # ['chr1', '100', '200', 'peak1', '0', '+']
Make Windows - Create Genomic Intervals
CLI
# Fixed-size windows across genome bedtools makewindows -g genome.txt -w 10000 > windows_10kb.bed # Windows with step size (sliding windows) bedtools makewindows -g genome.txt -w 10000 -s 5000 > sliding_10kb.bed # Fixed number of windows per chromosome bedtools makewindows -g genome.txt -n 100 > 100_windows_per_chr.bed # Windows within BED regions bedtools makewindows -b regions.bed -w 1000 > windows_in_regions.bed # Add window ID bedtools makewindows -g genome.txt -w 10000 -i winnum > numbered_windows.bed # Source chromosome in name bedtools makewindows -g genome.txt -w 10000 -i srcwinnum > windows_with_source.bed
Python
import pybedtools # From genome file windows = pybedtools.BedTool().window_maker(g='genome.txt', w=10000) # Sliding windows windows = pybedtools.BedTool().window_maker(g='genome.txt', w=10000, s=5000) # From BED regions bed = pybedtools.BedTool('regions.bed') windows = pybedtools.BedTool().window_maker(b=bed.fn, w=1000) windows.saveas('windows.bed')
Common BED Extensions
| Format | Description | Columns |
|---|---|---|
| narrowPeak | ENCODE narrow peaks | BED6 + signalValue, pValue, qValue, peak |
| broadPeak | ENCODE broad peaks | BED6 + signalValue, pValue, qValue |
| bedGraph | Signal track | chr, start, end, value |
| bedpe | Paired intervals | chr1, s1, e1, chr2, s2, e2, name, score, strand1, strand2 |
Cleanup Temp Files
import pybedtools # At end of script pybedtools.cleanup() # Or use context manager (auto-cleanup) pybedtools.set_tempdir('/tmp/pybedtools')
Related Skills
- interval-arithmetic - intersect, subtract, merge operations
- gtf-gff-handling - annotation file parsing
- coverage-analysis - depth calculations
- alignment-files/sam-bam-basics - BAM to BED conversion