BioClaw bio-dataset-search

bio-dataset-search

install

source · Clone the upstream repo

git clone https://github.com/Runchuan-BU/BioClaw

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/Runchuan-BU/BioClaw "$T" && mkdir -p ~/.claude/skills && cp -r "$T/container/skills/bio-dataset-search" ~/.claude/skills/runchuan-bu-bioclaw-bio-dataset-search && rm -rf "$T"

manifest: container/skills/bio-dataset-search/SKILL.md

source content

bio-dataset-search

Step 3: Dataset search and task matching (数据集搜索与匹配)

Find suitable datasets for each task and map datasets to the task system defined earlier in the manuscript pipeline.

Purpose

Extract datasets from related papers when possible
Search public repositories directly when needed
Normalize dataset metadata into a common structure
Match datasets to tasks in a defendable way

Input Format

topic: [research topic]
task_system: [task system from Step 2]
paper_count: [number of related papers]
existing_papers: [optional list of related papers]

Workflow

Step 3.1: Extract datasets from existing work

paper_count >= 5

, start from the strongest existing papers.

Read Methods / Data Availability sections and extract:

dataset name
data source
platform
modality
sample scale
download path
annotation availability

Step 3.2: Search datasets directly

If there is not enough prior work, search repositories such as:

GEO
ArrayExpress
project-specific public portals

Use keyword sets built from:

topic
modality
tissue / disease
benchmark intent

Step 3.3: Normalize dataset metadata

For each dataset, record:

source
platform
species
tissue / disease
sample size
feature count
modalities
annotation quality
histology / region metadata
format
preprocessing needs
recommended task fit

Step 3.4: Match datasets to tasks

A good match should satisfy:

Every major task has at least one viable dataset
Dataset structure matches the task's technical assumptions
Download remains feasible
Metadata quality is sufficient for evaluation
Prefer at least one backup dataset per important task

Output Format

# Dataset Catalog

## Data Sources
- Extracted from related papers:
- Direct repository search:
- Borrowed from adjacent domains:

## Dataset Entries

### Dataset 1: [name]
- Source:
- Platform:
- Species:
- Tissue / disease:
- Modalities:
- Sample scale:
- Annotation quality:
- Download URL:
- Format:
- Recommended tasks:
- Why it fits:

## Dataset-Task Mapping
| Task | Recommended dataset | Why it fits | Notes |
|------|---------------------|-------------|-------|
| ... | ... | ... | ... |

## Acquisition Notes
- GEO download hints
- Public portal download hints

## Preprocessing Recommendations
| Dataset | Preprocessing needs | Suggested skill / tool |
|---------|---------------------|------------------------|
| ... | ... | ... |

## Next Step
- Build the metric system in Step 4

Usage

/bio-dataset-search "spatial multi-omics integration | paper_count: 5 | task_system: [task system from Step 2]"

Notes

Prefer datasets already used in related work when possible.
Verify links before committing them to the benchmark plan.
Capture QC and annotation metadata whenever available.
Match datasets to tasks based on actual experimental needs, not just popularity.