Medical-research-skills gtars
A high-performance Rust toolkit (with Python bindings and a CLI) for genomic interval analysis; use it when you need fast overlap queries, coverage track generation, genomic tokenization for ML, reference sequence verification, or fragment processing.
install
source · Clone the upstream repo
git clone https://github.com/aipoch/medical-research-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Data Analysis/gtars" ~/.claude/skills/aipoch-medical-research-skills-gtars && rm -rf "$T"
manifest:
scientific-skills/Data Analysis/gtars/SKILL.mdsource content
When to Use
- Overlap and set operations on genomic intervals (e.g., peak/promoter overlap, variant annotation, shared-feature detection).
- Coverage track generation from interval-like inputs (e.g., ATAC-seq/ChIP-seq/RNA-seq coverage for visualization in genome browsers).
- Machine-learning preprocessing where genomic regions must be converted into discrete tokens (e.g., Transformer-style models, geniml-style pipelines).
- Reference sequence management and verification (e.g., subsequence retrieval, digest calculation aligned with GA4GH refget concepts).
- Single-cell fragment workflows (e.g., splitting fragments by barcode/cluster, scoring fragments against reference region sets).
Key Features
- Rust performance with low overhead; designed for large genomic datasets.
- Python bindings for integration into analysis notebooks/pipelines.
- CLI tooling for batch processing and shell workflows.
- Fast overlap detection via IGD-style indexing and interval operations.
- Coverage track generation (WIG/BigWig workflows via the
functionality).uniwig - Genomic tokenizers for ML-ready representations of genomic regions.
- Reference sequence utilities (FASTA-backed stores, subsequence retrieval, digesting).
- Fragment processing and scoring for common single-cell genomics tasks.
Additional module-specific guidance may be available in:
,references/overlap.md,references/coverage.md,references/tokenizers.md,references/refget.md, andreferences/python-api.md.references/cli.md
Dependencies
- Python package:
(version not specified in the source document)gtars - Rust toolchain (for CLI install):
(version not specified)cargo - Rust crate:
(as shown in the example)gtars = "0.1"
Example Usage
Python: overlap analysis workflow (runnable)
import gtars # Load two region sets peaks = gtars.RegionSet.from_bed("chip_peaks.bed") promoters = gtars.RegionSet.from_bed("promoters.bed") # Find overlaps (peaks that overlap promoters) overlapping_peaks = peaks.filter_overlapping(promoters) # Export results overlapping_peaks.to_bed("peaks_in_promoters.bed")
CLI: generate coverage tracks (runnable)
# Generate WIG coverage at a given resolution gtars uniwig generate --input atac_fragments.bed --output coverage.wig --resolution 10 # Generate BigWig coverage for genome browser visualization gtars uniwig generate --input atac_fragments.bed --output coverage.bw --format bigwig
Python: ML tokenization (runnable)
import gtars from gtars.tokenizers import TreeTokenizer # Load regions and build a tokenizer from BED regions = gtars.RegionSet.from_bed("training_peaks.bed") tokenizer = TreeTokenizer.from_bed_file("training_peaks.bed") # Tokenize each region into a discrete representation tokens = [tokenizer.tokenize(r.chromosome, r.start, r.end) for r in regions] print(tokens[:5])
Implementation Details
- Interval overlap & indexing: Overlap queries are designed around an IGD-like index to accelerate repeated interval queries (build once, query many). Typical parameters are chromosome, start, end; results are overlapping intervals or derived set operations.
- Coverage generation (
): Produces coverage tracks from interval/fragments input. Common knobs include output format (e.g., WIG vs BigWig) and resolution/binning for track granularity.uniwig - Tokenization: Tokenizers (e.g.,
) map genomic coordinates to discrete tokens suitable for ML pipelines. Token vocabularies are commonly derived from a BED-defined training region universe.TreeTokenizer - Reference sequence store: FASTA-backed reference access supports subsequence retrieval and digesting/verification workflows aligned with refget-style usage.
- Fragment workflows: Fragment splitting and scoring operate on fragment-like inputs (often TSV/BED-style) and can be used for barcode/cluster partitioning and enrichment-style scoring against reference region sets.