Skills hugging-face-dataset-viewer

Use this skill for Hugging Face Dataset Viewer API workflows that fetch subset/split metadata, paginate rows, search text, apply filters, download parquet URLs, and read size or statistics.

install
source · Clone the upstream repo
git clone https://github.com/tayyabexe/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/tayyabexe/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/hugging-face-dataset-viewer" ~/.claude/skills/tayyabexe-skills-hugging-face-dataset-viewer && rm -rf "$T"
manifest: skills/hugging-face-dataset-viewer/SKILL.md
source content

Hugging Face Dataset Viewer

Use this skill to execute read-only Dataset Viewer API calls for dataset exploration and extraction.

Core workflow

  1. Optionally validate dataset availability with
    /is-valid
    .
  2. Resolve
    config
    +
    split
    with
    /splits
    .
  3. Preview with
    /first-rows
    .
  4. Paginate content with
    /rows
    using
    offset
    and
    length
    (max 100).
  5. Use
    /search
    for text matching and
    /filter
    for row predicates.
  6. Retrieve parquet links via
    /parquet
    and totals/metadata via
    /size
    and
    /statistics
    .

Defaults

  • Base URL:
    https://datasets-server.huggingface.co
  • Default API method:
    GET
  • Query params should be URL-encoded.
  • offset
    is 0-based.
  • length
    max is usually
    100
    for row-like endpoints.
  • Gated/private datasets require
    Authorization: Bearer <HF_TOKEN>
    .

Dataset Viewer

  • Validate dataset
    :
    /is-valid?dataset=<namespace/repo>
  • List subsets and splits
    :
    /splits?dataset=<namespace/repo>
  • Preview first rows
    :
    /first-rows?dataset=<namespace/repo>&config=<config>&split=<split>
  • Paginate rows
    :
    /rows?dataset=<namespace/repo>&config=<config>&split=<split>&offset=<int>&length=<int>
  • Search text
    :
    /search?dataset=<namespace/repo>&config=<config>&split=<split>&query=<text>&offset=<int>&length=<int>
  • Filter with predicates
    :
    /filter?dataset=<namespace/repo>&config=<config>&split=<split>&where=<predicate>&orderby=<sort>&offset=<int>&length=<int>
  • List parquet shards
    :
    /parquet?dataset=<namespace/repo>
  • Get size totals
    :
    /size?dataset=<namespace/repo>
  • Get column statistics
    :
    /statistics?dataset=<namespace/repo>&config=<config>&split=<split>
  • Get Croissant metadata (if available)
    :
    /croissant?dataset=<namespace/repo>

Pagination pattern:

curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100"
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100"

When pagination is partial, use response fields such as

num_rows_total
,
num_rows_per_page
, and
partial
to drive continuation logic.

Search/filter notes:

  • /search
    matches string columns (full-text style behavior is internal to the API).
  • /filter
    requires predicate syntax in
    where
    and optional sort in
    orderby
    .
  • Keep filtering and searches read-only and side-effect free.

Querying Datasets

Use

npx parquetlens
with Hub parquet alias paths for SQL querying.

Parquet alias shape:

hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet

Derive

<config>
,
<split>
, and
<shard>
from Dataset Viewer
/parquet
:

curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \
  | jq -r '.parquet_files[] | "hf://datasets/\(.dataset)@~parquet/\(.config)/\(.split)/\(.filename)"'

Run SQL query:

npx -y -p parquetlens -p @parquetlens/sql parquetlens \
  "hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet" \
  --sql "SELECT * FROM data LIMIT 20"

SQL export

  • CSV:
    --sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')"
  • JSON:
    --sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)"
  • Parquet:
    --sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"

Creating and Uploading Datasets

Use one of these flows depending on dependency constraints.

Zero local dependencies (Hub UI):

  • Create dataset repo in browser:
    https://huggingface.co/new-dataset
  • Upload parquet files in the repo "Files and versions" page.
  • Verify shards appear in Dataset Viewer:
curl -s "https://datasets-server.huggingface.co/parquet?dataset=<namespace>/<repo>"

Low dependency CLI flow (

npx @huggingface/hub
/
hfjs
):

  • Set auth token:
export HF_TOKEN=<your_hf_token>
  • Upload parquet folder to a dataset repo (auto-creates repo if missing):
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data
  • Upload as private repo on creation:
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data --private

After upload, call

/parquet
to discover
<config>/<split>/<shard>
values for querying with
@~parquet
.