Awesome-Agent-Skills-for-Empirical-Research dataset-finder-guide

Search and download research datasets from Kaggle, HuggingFace, and repos

install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/tools/scraping/dataset-finder-guide" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-dataset-finder-gu && rm -rf "$T"
manifest: skills/43-wentorai-research-plugins/skills/tools/scraping/dataset-finder-guide/SKILL.md
source content

Dataset Finder Guide

Search, evaluate, and download research datasets from major repositories including Kaggle, Hugging Face, Google Dataset Search, Zenodo, UCI Machine Learning Repository, and domain-specific archives. This skill helps researchers locate the right data for their experiments efficiently.

Overview

Finding suitable datasets is often one of the most time-consuming phases of empirical research. Datasets are scattered across dozens of platforms, each with different APIs, licensing terms, download mechanisms, and metadata standards. A single research project might require datasets from Kaggle for benchmarking, Hugging Face for NLP tasks, Zenodo for supplementary materials from published papers, and government open data portals for demographic or economic variables.

This skill provides a unified approach to dataset discovery: formulating search queries, evaluating dataset quality and suitability, understanding licensing implications, and efficiently downloading and organizing data. It covers both general-purpose repositories and domain-specific archives that researchers in various fields need.

The emphasis is on reproducibility -- every dataset used in research should be citable, versioned, and documented. This skill includes patterns for recording dataset provenance, creating data cards, and managing dataset versions across experiments.

Dataset Repositories

General-Purpose Repositories

RepositoryStrengthsAPICitation Support
KaggleML benchmarks, competitions, community kernelsREST + CLIDOI via dataset cards
Hugging Face DatasetsNLP, CV, audio; streaming supportPython libraryBuilt-in citation
ZenodoAny research data, DOI minting, EU-fundedREST APIAutomatic DOI
Google Dataset SearchMeta-search across repositoriesWeb onlyLinks to source
UCI ML RepositoryClassic ML benchmarksDirect downloadBibTeX provided
FigshareFigures, datasets, media, preprintsREST APIDOI per item
DryadEcology, biology, environmental scienceREST APIDOI per dataset
ICPSRSocial science survey dataRestricted APIPersistent IDs
Harvard DataverseMulti-discipline, institutionalREST APIDOI per dataset

Domain-Specific Archives

DomainRepositoryNotable Datasets
GenomicsNCBI GEO, ENAGene expression, sequencing data
AstronomyNASA archives, SDSSSky surveys, spectral data
EconomicsFRED, World Bank, IMFTime series, macro indicators
ClimateNOAA, CMIP6Temperature, precipitation records
LinguisticsLDC, CLARINCorpora, treebanks
MedicalPhysioNet, MIMICClinical records, ECG/EEG
ChemistryPubChem, ChEMBLMolecular structures, bioassays

Searching for Datasets

Kaggle CLI

# Install and configure
pip install kaggle
# Place kaggle.json in ~/.kaggle/

# Search datasets
kaggle datasets list -s "sentiment analysis" --sort-by votes
kaggle datasets list -s "medical imaging" --file-type csv --min-size 100MB

# Get dataset details
kaggle datasets metadata -d stanford/imdb-review-dataset

# Download dataset
kaggle datasets download -d stanford/imdb-review-dataset -p ./data/
unzip ./data/imdb-review-dataset.zip -d ./data/imdb/

# Download competition data
kaggle competitions download -c titanic -p ./data/

Hugging Face Datasets

from datasets import load_dataset, list_datasets

# Search for datasets by task
from huggingface_hub import HfApi
api = HfApi()
datasets = api.list_datasets(
    search="scientific papers",
    sort="downloads",
    direction=-1,
    limit=20
)
for ds in datasets:
    print(f"{ds.id}: {ds.downloads} downloads")

# Load a dataset (with streaming for large datasets)
dataset = load_dataset("scientific_papers", "arxiv", streaming=True)

# Inspect structure
print(dataset["train"].features)
print(f"Number of examples: {dataset['train'].num_rows}")

# Load specific split and subset
validation = load_dataset(
    "scientific_papers", "arxiv",
    split="validation[:1000]"
)

Google Dataset Search (Programmatic)

import requests
from bs4 import BeautifulSoup

def search_google_datasets(query, num_results=10):
    """Search Google Dataset Search and extract results."""
    url = f"https://datasetsearch.research.google.com/search"
    params = {"query": query, "docid": ""}
    # Note: Google Dataset Search does not have an official API
    # Use the web interface or alternative approaches
    print(f"Search at: {url}?query={query.replace(' ', '+')}")
    return url

Zenodo API

import requests

def search_zenodo(query, resource_type="dataset", size=10):
    """Search Zenodo for research datasets."""
    url = "https://zenodo.org/api/records"
    params = {
        "q": query,
        "type": resource_type,
        "size": size,
        "sort": "mostrecent",
        "access_right": "open"
    }
    response = requests.get(url, params=params)
    results = response.json()

    for hit in results.get("hits", {}).get("hits", []):
        meta = hit["metadata"]
        print(f"Title: {meta['title']}")
        print(f"DOI: {meta.get('doi', 'N/A')}")
        print(f"License: {meta.get('license', {}).get('id', 'N/A')}")
        print(f"Size: {sum(f['size'] for f in hit.get('files', []))/1e6:.1f} MB")
        print("---")

    return results

Dataset Evaluation Checklist

Before using a dataset in research, verify the following:

Quality Assessment

  • Completeness: What percentage of values are missing? Are missing values random or systematic?
  • Accuracy: Are values within expected ranges? Are there obvious errors?
  • Consistency: Are formats uniform (dates, categories, units)?
  • Timeliness: When was the data collected? Is it current enough for your research question?
  • Sample size: Is the dataset large enough for your intended analysis?

Licensing and Ethics

LicenseCommercial UseModificationAttribution Required
CC0YesYesNo
CC-BY 4.0YesYesYes
CC-BY-SA 4.0YesYes (share-alike)Yes
CC-BY-NC 4.0NoYesYes
ODC-ODbLYesYes (share-alike)Yes
Custom/RestrictedVariesVariesVaries

Reproducibility Documentation

## Data Card

**Dataset**: [Name]
**Source**: [URL]
**Version**: [Version/Date]
**DOI**: [DOI if available]
**License**: [License name]
**Downloaded**: [YYYY-MM-DD]
**Size**: [X rows, Y columns, Z MB]
**Description**: [Brief description]
**Preprocessing**: [Steps applied before use]
**Citation**: [BibTeX entry]

Download and Organization

Project Data Structure

project/
  data/
    raw/              # Original downloaded data (never modify)
      dataset_v1.csv
      README.md       # Data card with provenance
    processed/        # Cleaned and transformed data
      train.csv
      test.csv
    external/         # Third-party reference data
  scripts/
    download_data.py  # Reproducible download script
    preprocess.py     # Data cleaning pipeline

Reproducible Download Script

"""download_data.py - Reproducible dataset download."""
import hashlib
from pathlib import Path
import requests

DATASETS = {
    "main_dataset": {
        "url": "https://zenodo.org/record/12345/files/data.csv",
        "sha256": "abc123...",
        "filename": "raw/main_dataset.csv"
    }
}

DATA_DIR = Path("data")

for name, info in DATASETS.items():
    path = DATA_DIR / info["filename"]
    if path.exists():
        print(f"Already downloaded: {name}")
        continue

    path.parent.mkdir(parents=True, exist_ok=True)
    print(f"Downloading {name}...")
    response = requests.get(info["url"])
    path.write_bytes(response.content)

    # Verify integrity
    sha256 = hashlib.sha256(response.content).hexdigest()
    assert sha256 == info["sha256"], f"Checksum mismatch for {name}"
    print(f"Verified: {name}")

References