SciAgent-Skills gtars

GTARS is a Rust-backed Python library for fast genomic token arithmetic and BED file processing. Perform high-performance BED file I/O, genomic interval set operations (intersect, merge, complement, subtract), tokenization of genomic regions against a universe, and universe construction. Use for preprocessing large BED file collections, building token vocabularies for ML pipelines, and computing interval statistics at scale.

install

source · Clone the upstream repo

git clone https://github.com/jaechang-hits/SciAgent-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/gtars" ~/.claude/skills/jaechang-hits-sciagent-skills-gtars && rm -rf "$T"

manifest: skills/genomics-bioinformatics/gtars/SKILL.md

source content

GTARS: Fast Genomic Token Arithmetic and BED File Processing

Overview

GTARS is a Python library with a Rust-backed core for high-performance genomic interval operations. It provides BED file I/O, set-theoretic interval operations (intersection, union, merge, complement, subtract), genomic region tokenization against a reference universe, and utilities for building consensus universe BED files. GTARS is designed for workflows that process hundreds to thousands of BED files efficiently, serving as a preprocessing engine for ML pipelines (including geniml) and general bioinformatics pipelines.

When to Use

Read and write large BED files efficiently, leveraging Rust-backed parsing for speed over pure Python alternatives
Compute genomic interval intersections, merges, complements, or subtracts between BED file pairs or sets
Tokenize a collection of genomic regions against a fixed universe vocabulary for ML input preparation
Build consensus universe BED files from a collection of sample BED files
Count overlap statistics between two BED files without launching bedtools processes
Preprocess ATAC-seq, ChIP-seq, or ENCODE peak files before feeding into geniml or other ML tools
For full BED/BAM/SAM reading with CIGAR-level detail, use
```
pysam-genomic-files
```
instead

Prerequisites

Python packages:
```
gtars
```
,
```
numpy
```
(optional, for array conversion)
Data requirements: BED format files (3+ columns: chr, start, end); optionally a universe BED file for tokenization
Environment: Python 3.8+; Rust toolchain not required (pre-built wheels available on PyPI)

pip install gtars

Quick Start

from gtars import GenomicIntervalSet

# Load a BED file and inspect
gis = GenomicIntervalSet("peaks.bed")
print(f"Loaded {len(gis)} intervals")
print(f"First interval: {gis[0]}")
# First interval: chr1:1000-2000

# Intersect two BED files
gis2 = GenomicIntervalSet("other_peaks.bed")
overlap = gis.intersect(gis2)
print(f"Overlapping intervals: {len(overlap)}")

Core API

Module 1: GenomicIntervalSet — BED File I/O

Primary data structure for loading, indexing, and writing genomic intervals.

from gtars import GenomicIntervalSet

# Load from file
gis = GenomicIntervalSet("peaks.bed")
print(f"Intervals loaded: {len(gis)}")
print(f"Chromosomes present: {gis.chromosomes}")

# Access by index
interval = gis[0]
print(f"chr={interval.chr}, start={interval.start}, end={interval.end}")

# Iterate all intervals
for iv in gis:
    width = iv.end - iv.start
    if width > 5000:
        print(f"Wide interval: {iv.chr}:{iv.start}-{iv.end} ({width} bp)")
        break

from gtars import GenomicIntervalSet, GenomicInterval

# Create from a list of GenomicInterval objects
intervals = [
    GenomicInterval("chr1", 100, 500),
    GenomicInterval("chr1", 600, 1200),
    GenomicInterval("chr2", 300, 800),
]
gis = GenomicIntervalSet(intervals)
print(f"Created GenomicIntervalSet with {len(gis)} intervals")

# Write to BED file
gis.to_bed("output_intervals.bed")
print("Saved output_intervals.bed")

Module 2: Interval Set Operations

Compute intersections, unions, merges, subtracts, and complements between BED sets.

from gtars import GenomicIntervalSet

peaks_a = GenomicIntervalSet("condition_A.bed")
peaks_b = GenomicIntervalSet("condition_B.bed")

# Intersection: intervals present in both sets
shared = peaks_a.intersect(peaks_b)
print(f"Shared intervals: {len(shared)}")

# Subtraction: intervals in A not overlapping B
a_only = peaks_a.subtract(peaks_b)
print(f"A-specific intervals: {len(a_only)}")

# Union (merge both sets, then merge overlapping)
combined = peaks_a.union(peaks_b)
print(f"Union intervals: {len(combined)}")

from gtars import GenomicIntervalSet

# Merge overlapping/adjacent intervals within a single set
gis = GenomicIntervalSet("fragmented_peaks.bed")
print(f"Before merge: {len(gis)} intervals")

merged = gis.merge()
print(f"After merge:  {len(merged)} intervals")

# Complement: genome-wide intervals NOT covered by peaks
# Requires chromosome sizes
chrom_sizes = {"chr1": 248956422, "chr2": 242193529, "chrX": 156040895}
complement = gis.complement(chrom_sizes)
print(f"Complement (uncovered) intervals: {len(complement)}")

Module 3: Tokenization

Convert genomic intervals to integer token IDs against a reference universe vocabulary.

from gtars import Tokenizer

# Initialize tokenizer with a universe BED file
tokenizer = Tokenizer("universe.bed")
print(f"Universe vocabulary size: {len(tokenizer)}")

# Tokenize a single BED file → list of token IDs
from gtars import GenomicIntervalSet
gis = GenomicIntervalSet("sample_peaks.bed")
tokens = tokenizer.tokenize(gis)
print(f"Token IDs: {tokens[:10]} ...")
print(f"Total tokens: {len(tokens)}")

from gtars import Tokenizer, GenomicIntervalSet

tokenizer = Tokenizer("universe.bed")

# Tokenize and convert to numpy array for ML
import numpy as np
gis = GenomicIntervalSet("sample_peaks.bed")
tokens = tokenizer.tokenize(gis)
token_array = np.array(tokens)
print(f"Token array shape: {token_array.shape}, dtype: {token_array.dtype}")

# Build a binary presence/absence vector over the full universe
vocab_size = len(tokenizer)
presence_vector = np.zeros(vocab_size, dtype=np.float32)
presence_vector[token_array] = 1.0
print(f"Presence vector shape: {presence_vector.shape}")
print(f"Fraction of universe covered: {presence_vector.mean():.4f}")

Module 4: Universe Building

Construct a consensus non-overlapping universe from a collection of BED files.

from gtars import UniverseBuilder

# Build universe from multiple BED files
bed_files = ["sample_1.bed", "sample_2.bed", "sample_3.bed", "sample_4.bed"]
builder = UniverseBuilder()
universe = builder.build(bed_files)

print(f"Universe regions: {len(universe)}")
universe.to_bed("consensus_universe.bed")
print("Saved consensus_universe.bed")

from gtars import UniverseBuilder

# Build with coverage threshold (region must appear in >= N% of samples)
bed_files = [f"chip_{i}.bed" for i in range(1, 21)]
builder = UniverseBuilder(fraction=0.25)   # present in >= 25% of samples
universe = builder.build(bed_files)
print(f"Universe (fraction>=0.25): {len(universe)} regions")

# Compare thresholds
for frac in [0.1, 0.25, 0.5]:
    b = UniverseBuilder(fraction=frac)
    u = b.build(bed_files)
    print(f"  fraction={frac}: {len(u)} regions")

Module 5: Interval Statistics and Utilities

Compute coverage, overlap counts, and basic statistics over BED files.

from gtars import GenomicIntervalSet

gis = GenomicIntervalSet("peaks.bed")

# Basic statistics
widths = [iv.end - iv.start for iv in gis]
import numpy as np
print(f"Interval count:       {len(gis)}")
print(f"Mean width (bp):      {np.mean(widths):.1f}")
print(f"Median width (bp):    {np.median(widths):.1f}")
print(f"Total coverage (bp):  {sum(widths):,}")

# Per-chromosome counts
from collections import Counter
chrom_counts = Counter(iv.chr for iv in gis)
for chrom, count in sorted(chrom_counts.items())[:5]:
    print(f"  {chrom}: {count} intervals")

from gtars import GenomicIntervalSet

# Overlap count matrix between two sets
peaks_a = GenomicIntervalSet("condition_A.bed")
peaks_b = GenomicIntervalSet("condition_B.bed")

# Count how many intervals in A overlap at least one interval in B
overlapping_a = peaks_a.intersect(peaks_b)
pct_overlap = len(overlapping_a) / len(peaks_a) * 100
print(f"A intervals overlapping B: {len(overlapping_a)}/{len(peaks_a)} ({pct_overlap:.1f}%)")

# Filter peaks by minimum width
min_width = 200
filtered = GenomicIntervalSet([iv for iv in peaks_a if (iv.end - iv.start) >= min_width])
print(f"Peaks >= {min_width} bp: {len(filtered)}/{len(peaks_a)}")
filtered.to_bed("filtered_peaks.bed")

Key Concepts

Tokenization and Universe Vocabulary

GTARS tokenization maps genomic regions to integer indices by finding the universe region that best overlaps each query interval. If a query interval does not overlap any universe region, it receives a special out-of-vocabulary (OOV) token. This vocabulary approach is analogous to NLP tokenization and enables treating genomic intervals as discrete tokens for transformer- and embedding-based models.

from gtars import Tokenizer, GenomicIntervalSet

tokenizer = Tokenizer("universe.bed")
# OOV token index is typically 0 or len(tokenizer)
print(f"OOV token ID: {tokenizer.unknown_token}")
gis = GenomicIntervalSet("test.bed")
tokens = tokenizer.tokenize(gis)
oov_count = sum(1 for t in tokens if t == tokenizer.unknown_token)
print(f"OOV rate: {oov_count}/{len(tokens)} ({oov_count/len(tokens)*100:.1f}%)")

Common Workflows

Workflow 1: BED File Preprocessing Pipeline

Goal: Load raw ATAC-seq peaks, filter by width, merge overlaps, and tokenize for ML.

from gtars import GenomicIntervalSet, GenomicInterval, Tokenizer
import numpy as np

# Step 1: Load peaks
peaks = GenomicIntervalSet("atac_peaks_raw.bed")
print(f"Raw peaks: {len(peaks)}")

# Step 2: Filter by minimum width (remove very short noise peaks)
min_width = 150
filtered_intervals = [iv for iv in peaks if (iv.end - iv.start) >= min_width]
peaks_filtered = GenomicIntervalSet(filtered_intervals)
print(f"After width filter (>={min_width} bp): {len(peaks_filtered)}")

# Step 3: Merge overlapping peaks
peaks_merged = peaks_filtered.merge()
print(f"After merge: {len(peaks_merged)}")
peaks_merged.to_bed("atac_peaks_processed.bed")

# Step 4: Tokenize against universe
tokenizer = Tokenizer("universe.bed")
tokens = tokenizer.tokenize(peaks_merged)
token_array = np.array(tokens)
print(f"Token array: {token_array.shape}, OOV rate: {(token_array == tokenizer.unknown_token).mean():.3f}")

# Step 5: Build presence vector for ML input
vocab_size = len(tokenizer)
presence = np.zeros(vocab_size, dtype=np.float32)
valid_tokens = token_array[token_array != tokenizer.unknown_token]
presence[valid_tokens] = 1.0
np.save("sample_presence_vector.npy", presence)
print(f"Saved presence vector: {presence.shape}, coverage={presence.mean():.4f}")

Workflow 2: Batch BED File Statistics

Goal: Compute overlap and coverage statistics across a collection of BED files.

from gtars import GenomicIntervalSet
from pathlib import Path
import pandas as pd
import numpy as np

bed_dir = Path("chip_peaks/")
bed_files = sorted(bed_dir.glob("*.bed"))
reference = GenomicIntervalSet("reference_regions.bed")

records = []
for bed_file in bed_files:
    gis = GenomicIntervalSet(str(bed_file))
    widths = [iv.end - iv.start for iv in gis]
    overlap = gis.intersect(reference)
    records.append({
        "sample":          bed_file.stem,
        "n_peaks":         len(gis),
        "mean_width_bp":   np.mean(widths),
        "total_coverage_bp": sum(widths),
        "overlap_with_ref": len(overlap),
        "pct_overlap":     len(overlap) / len(gis) * 100 if len(gis) > 0 else 0,
    })

df = pd.DataFrame(records)
print(df.to_string(index=False))
df.to_csv("bed_file_statistics.csv", index=False)
print(f"\nSaved bed_file_statistics.csv ({len(df)} samples)")

Workflow 3: Build Universe and Tokenize Corpus

Goal: From scratch — build a universe from a BED corpus and tokenize all files.

from gtars import UniverseBuilder, Tokenizer, GenomicIntervalSet
from pathlib import Path
import numpy as np

bed_dir = Path("atac_peaks/")
bed_files = [str(f) for f in sorted(bed_dir.glob("*.bed"))]
print(f"Building universe from {len(bed_files)} BED files...")

# Build universe
builder = UniverseBuilder(fraction=0.2)
universe = builder.build(bed_files)
universe.to_bed("corpus_universe.bed")
print(f"Universe: {len(universe)} regions → corpus_universe.bed")

# Tokenize all BED files
tokenizer = Tokenizer("corpus_universe.bed")
vocab_size = len(tokenizer)
print(f"Tokenizer vocab size: {vocab_size}")

presence_matrix = []
for f in bed_files:
    gis = GenomicIntervalSet(f)
    tokens = np.array(tokenizer.tokenize(gis))
    presence = np.zeros(vocab_size, dtype=np.float32)
    valid = tokens[tokens != tokenizer.unknown_token]
    presence[valid] = 1.0
    presence_matrix.append(presence)

X = np.stack(presence_matrix)
print(f"Presence matrix: {X.shape}  (samples × universe_regions)")
np.save("corpus_presence_matrix.npy", X)
print("Saved corpus_presence_matrix.npy")

Key Parameters

Parameter	Module	Default	Range / Options	Effect
`fraction`	UniverseBuilder	`0.5`	`0.0` – `1.0`	Minimum fraction of samples a region must appear in to enter universe
`merge_dist`	merge()	`0`	`≥0` bp	Merge intervals within this distance; `0` merges only overlapping/touching
`min_overlap`	intersect()	`1` bp	`1` –region width	Minimum overlap length required to count as an intersection
OOV token	Tokenizer	`tokenizer.unknown_token`	—	Token ID for regions not overlapping any universe region
Chromosome sizes	complement()	required dict	`{chr: length}`	Genome-wide sizes used to compute uncovered complement intervals

Common Recipes

Recipe: Filter BED File by Chromosome List

When to use: Remove non-canonical chromosomes (patches, alternative contigs) before analysis.

from gtars import GenomicIntervalSet, GenomicInterval

gis = GenomicIntervalSet("raw_peaks.bed")
canonical_chroms = {f"chr{i}" for i in list(range(1, 23)) + ["X", "Y", "M"]}
filtered = GenomicIntervalSet([iv for iv in gis if iv.chr in canonical_chroms])
filtered.to_bed("canonical_peaks.bed")
print(f"Canonical peaks: {len(filtered)}/{len(gis)}")

Recipe: Compute Pairwise Jaccard Similarity Between BED Sets

When to use: Measure similarity between two peak sets without external tools.

from gtars import GenomicIntervalSet

def jaccard(a: GenomicIntervalSet, b: GenomicIntervalSet) -> float:
    intersection = len(a.intersect(b))
    union = len(a.union(b))
    return intersection / union if union > 0 else 0.0

peaks_a = GenomicIntervalSet("sample_a.bed")
peaks_b = GenomicIntervalSet("sample_b.bed")
j = jaccard(peaks_a, peaks_b)
print(f"Jaccard similarity: {j:.4f}")

Expected Outputs

GenomicIntervalSet: iterable of
```
GenomicInterval
```
objects with
```
.chr
```
,
```
.start
```
,
```
.end
```
BED files: written via
```
.to_bed()
```
— 3-column tab-separated (chr, start, end)

Token arrays:

list[int]

from

tokenizer.tokenize()

— convert with

np.array(tokens)

Presence vectors:
```
np.ndarray
```
of shape
```
(vocab_size,)
```
built from token IDs
Universe BED: non-overlapping consensus regions written via
```
universe.to_bed()
```

Troubleshooting

Problem	Cause	Solution
`ImportError: gtars`	Package not installed	`pip install gtars` ; check Python version ≥ 3.8
All intervals report OOV token	BED coordinate system mismatch	Verify both BED and universe use same genome assembly and `chr` prefix convention
`intersect` returns empty set	Non-overlapping chromosomes or shifted coordinates	Check chromosome names match exactly (e.g., `chr1` vs `1` )
Very large universe (>5M regions)	Low `fraction` threshold with many short samples	Increase `fraction` to 0.3–0.5; pre-filter BED files to remove noise peaks
`to_bed()` produces unsorted output	Intervals added in arbitrary order	Sort output: `sorted_gis = GenomicIntervalSet(sorted(gis, key=lambda r: (r.chr, r.start)))`
Memory error on large BED collections	Holding all intervals in memory	Process files in batches; use streaming/chunked loading if available

Related Skills

geniml — uses GTARS-built universes and tokenizers for region2vec training and BEDSpace indexing
pysam-genomic-files — for BAM/SAM/VCF-level access with CIGAR and read-level detail

References

GTARS GitHub repository — source, issue tracker, and changelog
PyPI: gtars — installation instructions and version history
Geniml documentation — ecosystem context; GTARS is the preprocessing engine for geniml