Awesome-skills-cn huggingface-datasets
Use this skill for Hugging Face Dataset Viewer API workflows that fetch subset/split metadata, paginate rows, search text, apply filters, download parquet URLs, and read size or statistics.
install
source · Clone the upstream repo
git clone https://github.com/lingxling/awesome-skills-cn
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/lingxling/awesome-skills-cn "$T" && mkdir -p ~/.claude/skills && cp -r "$T/huggingface-skills/skills/huggingface-datasets" ~/.claude/skills/lingxling-awesome-skills-cn-huggingface-datasets && rm -rf "$T"
manifest:
huggingface-skills/skills/huggingface-datasets/SKILL.mdsource content
Hugging Face Dataset Viewer
Use this skill to execute read-only Dataset Viewer API calls for dataset exploration and extraction.
Core workflow
- Optionally validate dataset availability with
./is-valid - Resolve
+config
withsplit
./splits - Preview with
./first-rows - Paginate content with
using/rows
andoffset
(max 100).length - Use
for text matching and/search
for row predicates./filter - Retrieve parquet links via
and totals/metadata via/parquet
and/size
./statistics
Defaults
- Base URL:
https://datasets-server.huggingface.co - Default API method:
GET - Query params should be URL-encoded.
is 0-based.offset
max is usuallylength
for row-like endpoints.100- Gated/private datasets require
.Authorization: Bearer <HF_TOKEN>
Dataset Viewer
:Validate dataset/is-valid?dataset=<namespace/repo>
:List subsets and splits/splits?dataset=<namespace/repo>
:Preview first rows/first-rows?dataset=<namespace/repo>&config=<config>&split=<split>
:Paginate rows/rows?dataset=<namespace/repo>&config=<config>&split=<split>&offset=<int>&length=<int>
:Search text/search?dataset=<namespace/repo>&config=<config>&split=<split>&query=<text>&offset=<int>&length=<int>
:Filter with predicates/filter?dataset=<namespace/repo>&config=<config>&split=<split>&where=<predicate>&orderby=<sort>&offset=<int>&length=<int>
:List parquet shards/parquet?dataset=<namespace/repo>
:Get size totals/size?dataset=<namespace/repo>
:Get column statistics/statistics?dataset=<namespace/repo>&config=<config>&split=<split>
:Get Croissant metadata (if available)/croissant?dataset=<namespace/repo>
Pagination pattern:
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100" curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100"
When pagination is partial, use response fields such as
num_rows_total, num_rows_per_page, and partial to drive continuation logic.
Search/filter notes:
matches string columns (full-text style behavior is internal to the API)./search
requires predicate syntax in/filter
and optional sort inwhere
.orderby- Keep filtering and searches read-only and side-effect free.
Querying Datasets
Use
npx parquetlens with Hub parquet alias paths for SQL querying.
Parquet alias shape:
hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet
Derive
<config>, <split>, and <shard> from Dataset Viewer /parquet:
curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \ | jq -r '.parquet_files[] | "hf://datasets/\(.dataset)@~parquet/\(.config)/\(.split)/\(.filename)"'
Run SQL query:
npx -y -p parquetlens -p @parquetlens/sql parquetlens \ "hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet" \ --sql "SELECT * FROM data LIMIT 20"
SQL export
- CSV:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')" - JSON:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)" - Parquet:
--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"
Creating and Uploading Datasets
Use one of these flows depending on dependency constraints.
Zero local dependencies (Hub UI):
- Create dataset repo in browser:
https://huggingface.co/new-dataset - Upload parquet files in the repo "Files and versions" page.
- Verify shards appear in Dataset Viewer:
curl -s "https://datasets-server.huggingface.co/parquet?dataset=<namespace>/<repo>"
Low dependency CLI flow (
npx @huggingface/hub / hfjs):
- Set auth token:
export HF_TOKEN=<your_hf_token>
- Upload parquet folder to a dataset repo (auto-creates repo if missing):
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data
- Upload as private repo on creation:
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data --private
After upload, call
/parquet to discover <config>/<split>/<shard> values for querying with @~parquet.