Awesome-omni-skills hugging-face-datasets
Overview workflow skill. Use this skill when the user needs Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows and the operator should preserve the upstream workflow, copied support files, and provenance before merging or handing off.
git clone https://github.com/diegosouzapw/awesome-omni-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/hugging-face-datasets" ~/.claude/skills/diegosouzapw-awesome-omni-skills-hugging-face-datasets && rm -rf "$T"
skills/hugging-face-datasets/SKILL.mdOverview
Overview
This public intake copy packages
plugins/antigravity-awesome-skills-claude/skills/hugging-face-datasets from https://github.com/sickn33/antigravity-awesome-skills into the native Omni Skills editorial shape without hiding its origin.
Use it when the operator needs the upstream workflow, support files, and repository context to stay intact while the public validator and private enhancer continue their normal downstream flow.
This intake keeps the copied upstream files intact and uses
metadata.json plus ORIGIN.md as the provenance anchor for review.
Overview This skill provides tools to manage datasets on the Hugging Face Hub with a focus on creation, configuration, content management, and SQL-based data manipulation. It is designed to complement the existing Hugging Face MCP server by providing dataset editing and querying capabilities.
Imported source sections that did not map cleanly to the public headings are still preserved below or in the support files. Notable imported sections: Integration with HF MCP Server, 1. Dataset Lifecycle Management, 2. SQL-Based Dataset Querying (NEW), 3. Multi-Format Dataset Support, 4. Quality Assurance Features, SQL Query Syntax.
When to Use This Skill
Use this section as the trigger filter. It should make the activation boundary explicit before the operator loads files, runs commands, or opens a pull request.
- You need to create, configure, or update datasets on the Hugging Face Hub.
- You want SQL-style querying, transformation, or export flows over Hub datasets.
- You are managing dataset content and metadata directly rather than only searching existing datasets.
- Use when the request clearly matches the imported source intent: Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP....
- Use when the operator should preserve upstream workflow detail instead of rewriting the process from scratch.
- Use when provenance needs to stay visible in the answer, PR, or review packet.
Operating Table
| Situation | Start here | Why it matters |
|---|---|---|
| First-time use | | Confirms repository, branch, commit, and imported path before touching the copied workflow |
| Provenance review | | Gives reviewers a plain-language audit trail for the imported source |
| Workflow execution | | Starts with the smallest copied file that materially changes execution |
| Supporting context | | Adds the next most relevant copied source file without loading the entire package |
| Handoff decision | | Helps the operator switch to a stronger native skill when the task drifts |
Workflow
This workflow is intentionally editorial and operational at the same time. It keeps the imported source useful to the operator while still satisfying the public intake standards that feed the downstream enhancer flow.
- Confirm the user goal, the scope of the imported workflow, and whether this skill is still the right router for the task.
- Read the overview and provenance files before loading any copied upstream support files.
- Load only the references, examples, prompts, or scripts that materially change the outcome for the current request.
- Execute the upstream workflow while keeping provenance and source boundaries explicit in the working notes.
- Validate the result against the upstream expectations and the evidence you can point to in the copied files.
- Escalate or hand off to a related skill when the work moves out of this imported workflow's center of gravity.
- Before merge or closure, record what was used, what changed, and what the reviewer still needs to verify.
Imported Workflow Notes
Imported: Integration with HF MCP Server
- Use HF MCP Server for: Dataset discovery, search, and metadata retrieval
- Use This Skill for: Dataset creation, content editing, SQL queries, data transformation, and structured data formatting
Version
2.1.0
Dependencies
This skill uses PEP 723 scripts with inline dependency management
Scripts auto-install requirements when run with: uv run scripts/script_name.py
- uv (Python package manager)
- Getting Started: See "Usage Instructions" below for PEP 723 usage
Core Capabilities
Examples
Example 1: Ask for the upstream workflow directly
Use @hugging-face-datasets to handle <task>. Start from the copied upstream workflow, load only the files that change the outcome, and keep provenance visible in the answer.
Explanation: This is the safest starting point when the operator needs the imported workflow, but not the entire repository.
Example 2: Ask for a provenance-grounded review
Review @hugging-face-datasets against metadata.json and ORIGIN.md, then explain which copied upstream files you would load first and why.
Explanation: Use this before review or troubleshooting when you need a precise, auditable explanation of origin and file selection.
Example 3: Narrow the copied support files before execution
Use @hugging-face-datasets for <task>. Load only the copied references, examples, or scripts that change the outcome, and name the files explicitly before proceeding.
Explanation: This keeps the skill aligned with progressive disclosure instead of loading the whole copied package by default.
Example 4: Build a reviewer packet
Review @hugging-face-datasets using the copied upstream files plus provenance, then summarize any gaps before merge.
Explanation: This is useful when the PR is waiting for human review and you want a repeatable audit packet.
Imported Usage Notes
Imported: Quick Start
# Query a dataset uv run scripts/sql_manager.py query \ --dataset "cais/mmlu" \ --sql "SELECT * FROM data WHERE subject='nutrition' LIMIT 10" # Get dataset schema uv run scripts/sql_manager.py describe --dataset "cais/mmlu" # Sample random rows uv run scripts/sql_manager.py sample --dataset "cais/mmlu" --n 5 # Count rows with filter uv run scripts/sql_manager.py count --dataset "cais/mmlu" --where "subject='nutrition'"
Imported: Python API Usage
from sql_manager import HFDatasetSQL sql = HFDatasetSQL() # Query results = sql.query("cais/mmlu", "SELECT * FROM data WHERE subject='nutrition' LIMIT 10") # Get schema schema = sql.describe("cais/mmlu") # Sample samples = sql.sample("cais/mmlu", n=5, seed=42) # Count count = sql.count("cais/mmlu", where="subject='nutrition'") # Histogram dist = sql.histogram("cais/mmlu", "subject") # Filter and transform results = sql.filter_and_transform( "cais/mmlu", select="subject, COUNT(*) as cnt", group_by="subject", order_by="cnt DESC", limit=10 ) # Push to Hub url = sql.push_to_hub( "cais/mmlu", "username/nutrition-subset", sql="SELECT * FROM data WHERE subject='nutrition'", private=True ) # Export locally sql.export_to_parquet("cais/mmlu", "output.parquet", sql="SELECT * FROM data LIMIT 100") sql.close()
Imported: TRAINING EXAMPLE STRUCTURE
[Specify exact JSON schema for chat templating]
Imported: EXAMPLE CATEGORIES
[List development workflows, debugging scenarios, data management tasks]
### Example Categories & Templates The skill includes diverse training examples beyond just MCP usage: **Available Example Sets:** - `training_examples.json` - MCP tool usage examples (debugging, project setup, database analysis) - `diverse_training_examples.json` - Broader scenarios including: - **Educational Chat** - Explaining programming concepts, tutorials - **Git Workflows** - Feature branches, version control guidance - **Code Analysis** - Performance optimization, architecture review - **Content Generation** - Professional writing, creative brainstorming - **Codebase Navigation** - Legacy code exploration, systematic analysis - **Conversational Support** - Problem-solving, technical discussions **Using Different Example Sets:** ```bash # Add MCP-focused examples uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \ --rows_json "$(cat examples/training_examples.json)" # Add diverse conversational examples uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \ --rows_json "$(cat examples/diverse_training_examples.json)" # Mix both for comprehensive training data uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \ --rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"
Commands Reference
List Available Templates:
uv run scripts/dataset_manager.py list_templates
Quick Setup (Recommended):
uv run scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classification
Manual Setup:
# Initialize repository uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private] # Configure with system prompt uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "Your prompt here" # Add data with validation uv run scripts/dataset_manager.py add_rows \ --repo_id "your-username/dataset-name" \ --template qa \ --rows_json '[{"question": "What is AI?", "answer": "Artificial Intelligence..."}]'
View Dataset Statistics:
uv run scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"
Error Handling
- Repository exists: Script will notify and continue with configuration
- Invalid JSON: Clear error message with parsing details
- Network issues: Automatic retry for transient failures
- Token permissions: Validation before operations begin
Combined Workflow Examples
Imported: Example 1: Create Training Subset from Existing Dataset
# 1. Explore the source dataset uv run scripts/sql_manager.py describe --dataset "cais/mmlu" uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" # 2. Query and create subset uv run scripts/sql_manager.py query \ --dataset "cais/mmlu" \ --sql "SELECT * FROM data WHERE subject IN ('nutrition', 'anatomy', 'clinical_knowledge')" \ --push-to "username/mmlu-medical-subset" \ --private
Imported: Example 2: Transform and Reshape Data
# Transform MMLU to QA format with correct answers extracted uv run scripts/sql_manager.py query \ --dataset "cais/mmlu" \ --sql "SELECT question, choices[answer] as correct_answer, subject FROM data" \ --push-to "username/mmlu-qa-format"
Imported: Example 3: Merge Multiple Dataset Splits
# Export multiple splits and combine uv run scripts/sql_manager.py export \ --dataset "cais/mmlu" \ --split "*" \ --output "mmlu_all.parquet"
Imported: Example 4: Quality Filtering
# Filter for high-quality examples uv run scripts/sql_manager.py query \ --dataset "squad" \ --sql "SELECT * FROM data WHERE LENGTH(context) > 500 AND LENGTH(question) > 20" \ --push-to "username/squad-filtered"
Imported: Example 5: Create Custom Training Dataset
# 1. Query source data uv run scripts/sql_manager.py export \ --dataset "cais/mmlu" \ --sql "SELECT question, subject FROM data WHERE subject='nutrition'" \ --output "nutrition_source.jsonl" \ --format jsonl # 2. Process with your pipeline (add answers, format, etc.) # 3. Push processed data uv run scripts/dataset_manager.py init --repo_id "username/nutrition-training" uv run scripts/dataset_manager.py add_rows \ --repo_id "username/nutrition-training" \ --template qa \ --rows_json "$(cat processed_data.json)"
Best Practices
Treat the generated public skill as a reviewable packaging layer around the upstream repository. The goal is to keep provenance explicit and load only the copied source material that materially improves execution.
- [Detail requirements for realistic scenarios, progressive complexity, proper tool usage]
- Keep the imported skill grounded in the upstream repository; do not invent steps that the source material cannot support.
- Prefer the smallest useful set of support files so the workflow stays auditable and fast to review.
- Keep provenance, source commit, and imported file paths visible in notes and PR descriptions.
- Point directly at the copied upstream files that justify the workflow instead of relying on generic review boilerplate.
- Treat generated examples as scaffolding; adapt them to the concrete task before execution.
- Route to a stronger native skill when architecture, debugging, design, or security concerns become dominant.
Imported Operating Notes
Imported: QUALITY GUIDELINES
[Detail requirements for realistic scenarios, progressive complexity, proper tool usage]
Troubleshooting
Problem: The operator skipped the imported context and answered too generically
Symptoms: The result ignores the upstream workflow in
plugins/antigravity-awesome-skills-claude/skills/hugging-face-datasets, fails to mention provenance, or does not use any copied source files at all.
Solution: Re-open metadata.json, ORIGIN.md, and the most relevant copied upstream files. Load only the files that materially change the answer, then restate the provenance before continuing.
Problem: The imported workflow feels incomplete during review
Symptoms: Reviewers can see the generated
SKILL.md, but they cannot quickly tell which references, examples, or scripts matter for the current task.
Solution: Point at the exact copied references, examples, scripts, or assets that justify the path you took. If the gap is still real, record it in the PR instead of hiding it.
Problem: The task drifted into a different specialization
Symptoms: The imported skill starts in the right place, but the work turns into debugging, architecture, design, security, or release orchestration that a native skill handles better. Solution: Use the related skills section to hand off deliberately. Keep the imported provenance visible so the next skill inherits the right context instead of starting blind.
Related Skills
- Use when the work is better handled by that native specialization after this imported skill establishes context.@github-issue-creator
- Use when the work is better handled by that native specialization after this imported skill establishes context.@github-workflow-automation
- Use when the work is better handled by that native specialization after this imported skill establishes context.@gitlab-automation
- Use when the work is better handled by that native specialization after this imported skill establishes context.@gitlab-ci-patterns
Additional Resources
Use this support matrix and the linked files below as the operator packet for this imported skill. They should reflect real copied source material, not generic scaffolding.
| Resource family | What it gives the reviewer | Example path |
|---|---|---|
| copied reference notes, guides, or background material from upstream | |
| worked examples or reusable prompts copied from upstream | |
| upstream helper scripts that change execution or validation | |
| routing or delegation notes that are genuinely part of the imported package | |
| supporting assets or schemas copied from the source package | |
Imported Reference Notes
Imported: 1. Dataset Lifecycle Management
- Initialize: Create new dataset repositories with proper structure
- Configure: Store detailed configuration including system prompts and metadata
- Stream Updates: Add rows efficiently without downloading entire datasets
Imported: 2. SQL-Based Dataset Querying (NEW)
Query any Hugging Face dataset using DuckDB SQL via
scripts/sql_manager.py:
- Direct Queries: Run SQL on datasets using the
protocolhf:// - Schema Discovery: Describe dataset structure and column types
- Data Sampling: Get random samples for exploration
- Aggregations: Count, histogram, unique values analysis
- Transformations: Filter, join, reshape data with SQL
- Export & Push: Save results locally or push to new Hub repos
Imported: 3. Multi-Format Dataset Support
Supports diverse dataset types through template system:
- Chat/Conversational: Chat templating, multi-turn dialogues, tool usage examples
- Text Classification: Sentiment analysis, intent detection, topic classification
- Question-Answering: Reading comprehension, factual QA, knowledge bases
- Text Completion: Language modeling, code completion, creative writing
- Tabular Data: Structured data for regression/classification tasks
- Custom Formats: Flexible schema definition for specialized needs
Imported: 4. Quality Assurance Features
- JSON Validation: Ensures data integrity during uploads
- Batch Processing: Efficient handling of large datasets
- Error Recovery: Graceful handling of upload failures and conflicts
Usage Instructions
The skill includes two Python scripts that use PEP 723 inline dependency management:
All paths are relative to the directory containing this SKILL.md file. Scripts are run with:
uv run scripts/script_name.py [arguments]
- Dataset creation and managementscripts/dataset_manager.py
- SQL-based dataset querying and transformationscripts/sql_manager.py
Prerequisites
package manager installeduv
environment variable must be set with a Write-access tokenHF_TOKEN
SQL Dataset Querying (sql_manager.py)
Query, transform, and push Hugging Face datasets using DuckDB SQL. The
hf:// protocol provides direct access to any public dataset (or private with token).
Imported: SQL Query Syntax
Use
data as the table name in your SQL - it gets replaced with the actual hf:// path:
-- Basic select SELECT * FROM data LIMIT 10 -- Filtering SELECT * FROM data WHERE subject='nutrition' -- Aggregations SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject ORDER BY cnt DESC -- Column selection and transformation SELECT question, choices[answer] AS correct_answer FROM data -- Regex matching SELECT * FROM data WHERE regexp_matches(question, 'nutrition|diet') -- String functions SELECT regexp_replace(question, '\n', '') AS cleaned FROM data
Imported: Common Operations
1. Explore Dataset Structure
# Get schema uv run scripts/sql_manager.py describe --dataset "cais/mmlu" # Get unique values in column uv run scripts/sql_manager.py unique --dataset "cais/mmlu" --column "subject" # Get value distribution uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" --bins 20
2. Filter and Transform
# Complex filtering with SQL uv run scripts/sql_manager.py query \ --dataset "cais/mmlu" \ --sql "SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject HAVING cnt > 100" # Using transform command uv run scripts/sql_manager.py transform \ --dataset "cais/mmlu" \ --select "subject, COUNT(*) as cnt" \ --group-by "subject" \ --order-by "cnt DESC" \ --limit 10
3. Create Subsets and Push to Hub
# Query and push to new dataset uv run scripts/sql_manager.py query \ --dataset "cais/mmlu" \ --sql "SELECT * FROM data WHERE subject='nutrition'" \ --push-to "username/mmlu-nutrition-subset" \ --private # Transform and push uv run scripts/sql_manager.py transform \ --dataset "ibm/duorc" \ --config "ParaphraseRC" \ --select "question, answers" \ --where "LENGTH(question) > 50" \ --push-to "username/duorc-long-questions"
4. Export to Local Files
# Export to Parquet uv run scripts/sql_manager.py export \ --dataset "cais/mmlu" \ --sql "SELECT * FROM data WHERE subject='nutrition'" \ --output "nutrition.parquet" \ --format parquet # Export to JSONL uv run scripts/sql_manager.py export \ --dataset "cais/mmlu" \ --sql "SELECT * FROM data LIMIT 100" \ --output "sample.jsonl" \ --format jsonl
5. Working with Dataset Configs/Splits
# Specify config (subset) uv run scripts/sql_manager.py query \ --dataset "ibm/duorc" \ --config "ParaphraseRC" \ --sql "SELECT * FROM data LIMIT 5" # Specify split uv run scripts/sql_manager.py query \ --dataset "cais/mmlu" \ --split "test" \ --sql "SELECT COUNT(*) FROM data" # Query all splits uv run scripts/sql_manager.py query \ --dataset "cais/mmlu" \ --split "*" \ --sql "SELECT * FROM data LIMIT 10"
6. Raw SQL with Full Paths
For complex queries or joining datasets:
uv run scripts/sql_manager.py raw --sql " SELECT a.*, b.* FROM 'hf://datasets/dataset1@~parquet/default/train/*.parquet' a JOIN 'hf://datasets/dataset2@~parquet/default/train/*.parquet' b ON a.id = b.id LIMIT 100 "
Imported: HF Path Format
DuckDB uses the
hf:// protocol to access datasets:
hf://datasets/{dataset_id}@{revision}/{config}/{split}/*.parquet
Examples:
hf://datasets/cais/mmlu@~parquet/default/train/*.parquethf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/*.parquet
The
@~parquet revision provides auto-converted Parquet files for any dataset format.
Imported: Useful DuckDB SQL Functions
-- String functions LENGTH(column) -- String length regexp_replace(col, '\n', '') -- Regex replace regexp_matches(col, 'pattern') -- Regex match LOWER(col), UPPER(col) -- Case conversion -- Array functions choices[0] -- Array indexing (0-based) array_length(choices) -- Array length unnest(choices) -- Expand array to rows -- Aggregations COUNT(*), SUM(col), AVG(col) GROUP BY col HAVING condition -- Sampling USING SAMPLE 10 -- Random sample USING SAMPLE 10 (RESERVOIR, 42) -- Reproducible sample -- Window functions ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)
Dataset Creation (dataset_manager.py)
Recommended Workflow
1. Discovery (Use HF MCP Server):
# Use HF MCP tools to find existing datasets search_datasets("conversational AI training") get_dataset_details("username/dataset-name")
2. Creation (Use This Skill):
# Initialize new dataset uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private] # Configure with detailed system prompt uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"
3. Content Management (Use This Skill):
# Quick setup with any template uv run scripts/dataset_manager.py quick_setup \ --repo_id "your-username/dataset-name" \ --template classification # Add data with template validation uv run scripts/dataset_manager.py add_rows \ --repo_id "your-username/dataset-name" \ --template qa \ --rows_json "$(cat your_qa_data.json)"
Template-Based Data Structures
1. Chat Template (
)--template chat
{ "messages": [ {"role": "user", "content": "Natural user request"}, {"role": "assistant", "content": "Response with tool usage"}, {"role": "tool", "content": "Tool response", "tool_call_id": "call_123"} ], "scenario": "Description of use case", "complexity": "simple|intermediate|advanced" }
2. Classification Template (
)--template classification
{ "text": "Input text to be classified", "label": "classification_label", "confidence": 0.95, "metadata": {"domain": "technology", "language": "en"} }
3. QA Template (
)--template qa
{ "question": "What is the question being asked?", "answer": "The complete answer", "context": "Additional context if needed", "answer_type": "factual|explanatory|opinion", "difficulty": "easy|medium|hard" }
4. Completion Template (
)--template completion
{ "prompt": "The beginning text or context", "completion": "The expected continuation", "domain": "code|creative|technical|conversational", "style": "description of writing style" }
5. Tabular Template (
)--template tabular
{ "columns": [ {"name": "feature1", "type": "numeric", "description": "First feature"}, {"name": "target", "type": "categorical", "description": "Target variable"} ], "data": [ {"feature1": 123, "target": "class_a"}, {"feature1": 456, "target": "class_b"} ] }
Advanced System Prompt Template
For high-quality training data generation:
You are an AI assistant expert at using MCP tools effectively. #### Imported: MCP SERVER DEFINITIONS [Define available servers and tools] #### Imported: Limitations - Use this skill only when the task clearly matches the scope described above. - Do not treat the output as a substitute for environment-specific validation, testing, or expert review. - Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.