Awesome-Agent-Skills-for-Empirical-Research genotex-benchmark-guide
Benchmark for LLM agents on gene expression data analysis
install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/domains/biomedical/genotex-benchmark-guide" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-genotex-benchmark && rm -rf "$T"
manifest:
skills/43-wentorai-research-plugins/skills/domains/biomedical/genotex-benchmark-guide/SKILL.mdsource content
GenoTEX Benchmark Guide
Overview
GenoTEX is a benchmark for evaluating LLM-based agents on gene expression data analysis tasks. It provides curated datasets from GEO (Gene Expression Omnibus) with ground-truth analysis pipelines, testing agents on data preprocessing, differential expression, enrichment analysis, and biological interpretation. Published at MLCB 2025 as an oral presentation.
Benchmark Structure
GenoTEX Benchmark ├── Data Collection │ └── Curated GEO datasets with ground truth ├── Task Categories │ ├── Data preprocessing (QC, normalization) │ ├── Differential expression analysis │ ├── Gene set enrichment analysis │ ├── Clustering and classification │ └── Biological interpretation ├── Evaluation │ ├── Code correctness (executes without error) │ ├── Statistical validity (appropriate tests) │ ├── Result accuracy (vs ground truth) │ └── Interpretation quality (biological insight) └── Baselines ├── GPT-4 agent ├── Claude agent └── Domain-specific fine-tuned models
Usage
from genotex import GenoTEXBenchmark bench = GenoTEXBenchmark() # List available tasks tasks = bench.list_tasks() for task in tasks[:5]: print(f"Task: {task.id}") print(f" Dataset: {task.geo_accession}") print(f" Category: {task.category}") print(f" Difficulty: {task.difficulty}") # Get a specific task task = bench.get_task("GSE12345_DEG") print(f"Description: {task.description}") print(f"Input files: {task.input_files}") print(f"Expected output: {task.expected_output_type}")
Running Evaluations
# Evaluate an agent on GenoTEX from genotex import evaluate_agent results = evaluate_agent( agent_fn=my_agent_function, tasks="all", # or specific task IDs timeout_per_task=300, # seconds ) print(f"Tasks completed: {results.completed}/{results.total}") print(f"Code correctness: {results.code_correct_rate:.1%}") print(f"Statistical validity: {results.stats_valid_rate:.1%}") print(f"Result accuracy: {results.accuracy:.3f}")
Task Examples
# Example: Differential Expression Analysis task = { "id": "GSE12345_DEG", "description": "Identify differentially expressed genes " "between treatment and control groups in " "this RNA-seq dataset.", "input": "GSE12345_counts.csv", # Raw count matrix "metadata": "GSE12345_metadata.csv", # Sample info "expected": { "method": "DESeq2 or limma-voom", "output": "DEG table with log2FC, p-value, adj.p", "ground_truth": "GSE12345_deg_truth.csv", }, } # Example: Gene Set Enrichment task = { "id": "GSE12345_GSEA", "description": "Perform gene set enrichment analysis on " "the DEGs and identify enriched pathways.", "input": "GSE12345_deg_results.csv", "expected": { "method": "fgsea, clusterProfiler, or enrichR", "output": "Enriched pathways with NES and FDR", }, }
Use Cases
- Agent evaluation: Test bioinformatics agents on real tasks
- Method comparison: Compare LLM agents on genomics
- Benchmark development: Extend with new GEO datasets
- Teaching: Standard tasks for bioinformatics education
- Tool development: Test new analysis pipelines