Awesome-skills-cn huggingface-datasets

Use this skill for Hugging Face Dataset Viewer API workflows that fetch subset/split metadata, paginate rows, search text, apply filters, download parquet URLs, and read size or statistics.

install

source · Clone the upstream repo

git clone https://github.com/lingxling/awesome-skills-cn

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/lingxling/awesome-skills-cn "$T" && mkdir -p ~/.claude/skills && cp -r "$T/huggingface-skills/skills/huggingface-datasets" ~/.claude/skills/lingxling-awesome-skills-cn-huggingface-datasets && rm -rf "$T"

manifest: huggingface-skills/skills/huggingface-datasets/SKILL.md

source content

Hugging Face Dataset Viewer

Use this skill to execute read-only Dataset Viewer API calls for dataset exploration and extraction.

Core workflow

Optionally validate dataset availability with
```
/is-valid
```
.
Resolve
```
config
```
+
```
split
```
with
```
/splits
```
.
Preview with
```
/first-rows
```
.
Paginate content with
```
/rows
```
using
```
offset
```
and
```
length
```
(max 100).
Use
```
/search
```
for text matching and
```
/filter
```
for row predicates.
Retrieve parquet links via
```
/parquet
```
and totals/metadata via
```
/size
```
and
```
/statistics
```
.

Defaults

Base URL:
```
https://datasets-server.huggingface.co
```
Default API method:
```
GET
```
Query params should be URL-encoded.
```
offset
```
is 0-based.
```
length
```
max is usually
```
100
```
for row-like endpoints.
Gated/private datasets require
```
Authorization: Bearer <HF_TOKEN>
```
.

Dataset Viewer

Validate dataset

/is-valid?dataset=<namespace/repo>

List subsets and splits

/splits?dataset=<namespace/repo>

Preview first rows

/first-rows?dataset=<namespace/repo>&config=<config>&split=<split>

Paginate rows

/rows?dataset=<namespace/repo>&config=<config>&split=<split>&offset=<int>&length=<int>

Search text

/search?dataset=<namespace/repo>&config=<config>&split=<split>&query=<text>&offset=<int>&length=<int>

Filter with predicates

/filter?dataset=<namespace/repo>&config=<config>&split=<split>&where=<predicate>&orderby=<sort>&offset=<int>&length=<int>

List parquet shards

/parquet?dataset=<namespace/repo>

Get size totals

/size?dataset=<namespace/repo>

Get column statistics

/statistics?dataset=<namespace/repo>&config=<config>&split=<split>

Get Croissant metadata (if available)

/croissant?dataset=<namespace/repo>

Pagination pattern:

curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100"
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100"

When pagination is partial, use response fields such as

num_rows_total

num_rows_per_page

, and

partial

to drive continuation logic.

Search/filter notes:

```
/search
```
matches string columns (full-text style behavior is internal to the API).
```
/filter
```
requires predicate syntax in
```
where
```
and optional sort in
```
orderby
```
.
Keep filtering and searches read-only and side-effect free.

Querying Datasets

Use

npx parquetlens

with Hub parquet alias paths for SQL querying.

Parquet alias shape:

hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet

Derive

<config>

<split>

, and

<shard>

from Dataset Viewer

/parquet

curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \
  | jq -r '.parquet_files[] | "hf://datasets/\(.dataset)@~parquet/\(.config)/\(.split)/\(.filename)"'

Run SQL query:

npx -y -p parquetlens -p @parquetlens/sql parquetlens \
  "hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet" \
  --sql "SELECT * FROM data LIMIT 20"

SQL export

CSV:

--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')"

JSON:

--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)"

Parquet:

--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"

Creating and Uploading Datasets

Use one of these flows depending on dependency constraints.

Zero local dependencies (Hub UI):

Create dataset repo in browser:
```
https://huggingface.co/new-dataset
```
Upload parquet files in the repo "Files and versions" page.
Verify shards appear in Dataset Viewer:

curl -s "https://datasets-server.huggingface.co/parquet?dataset=<namespace>/<repo>"

Low dependency CLI flow (

npx @huggingface/hub

hfjs

Set auth token:

export HF_TOKEN=<your_hf_token>

Upload parquet folder to a dataset repo (auto-creates repo if missing):

npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data

Upload as private repo on creation:

npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data --private

After upload, call

/parquet

to discover

<config>/<split>/<shard>

values for querying with

@~parquet