Claude-skill-registry dataset-generator

Generate evaluation datasets with adjustable difficulty levels from PDF documents for RAG system testing and benchmarking

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/dataset-generator" ~/.claude/skills/majiayu000-claude-skill-registry-dataset-generator && rm -rf "$T"

manifest: skills/data/dataset-generator/SKILL.md

Dataset Generator Skill

Generate high-quality benchmark evaluation datasets with adjustable difficulty levels from custom PDF documents. Perfect for testing RAG systems, knowledge graphs, and Q&A models.

Usage

Invoke this skill with:

/dataset-generator <pdf_directory> [output_file] [num_questions] [difficulty]

Arguments:

```
$1
```
(required) - Path to PDF directory containing source documents
```
$2
```
(optional) - Output JSON file path (default:
```
benchmark_dataset.json
```
)
```
$3
```
(optional) - Number of questions to generate (default: 20)
```
$4
```
(optional) - Difficulty level:
```
easy
```
,
```
medium
```
,
```
hard
```
, or
```
mixed
```
(default:
```
mixed
```
)

Examples

# Generate 20 mixed-difficulty questions
/dataset-generator ./pdfs

# Generate 30 hard questions
/dataset-generator ./pdfs hard_benchmark.json 30 hard

# Generate 15 easy questions for testing retrieval
/dataset-generator ./pdfs easy_test.json 15 easy

What This Skill Does

Extract Content: Reads all PDFs from specified directory using pypdf
Analyze Topics: Uses Claude to identify key concepts, entities, dates, and relationships
Generate Questions: Creates questions across 5 types:
- Fact Retrieval: Direct facts extractable from single passages
- Multi-hop Reasoning: Requires connecting 2-3 pieces of information
- Comparative Analysis: Compare concepts, approaches, or entities
- Contextual Summarization: Broad understanding across multiple sections
- Creative Generation: Application/scenario-based questions
Difficulty Calibration: Adjusts question complexity and required reasoning depth
Format Output: Standard benchmark JSON format with:
- Question and ground truth answer
- Question type classification
- Difficulty level
- 2-5 supporting evidence passages
- Evidence relationship explanations

Difficulty Levels

Easy (Single-hop, Direct)

Reasoning: Answerable from a single chunk/passage
Evidence: Direct quotes sufficient
Chunk Size: 300-500 chars
Examples:
- "What is [Product/Service] described in the document?"
- "Who is mentioned as the CEO in [Year]?"
- "What is the duration/cost/size of [Feature]?"

Medium (Multi-hop, Inference)

Reasoning: Requires 2-3 pieces of information
Evidence: Light inference and connection needed
Chunk Size: 800-1000 chars
Examples:
- "How does [Concept A] affect [Concept B]?"
- "What are the requirements for [Process/System]?"

Hard (Synthesis, Cross-document)

Reasoning: Requires synthesizing info across multiple documents
Evidence: Implicit relationships, complex inference
Chunk Size: 1200-1500 chars
Examples:
- "Compare [Company's] approach in [Document A] vs [Document B]"
- "Summarize how [System] addresses [Challenge] across all documents"

Mixed (Balanced Distribution)

Distribution: 40% easy, 40% medium, 20% hard
Purpose: Comprehensive testing across difficulty spectrum
Chunk Size: Adaptive (1000 chars average)

Output Format

Standard evaluation JSON format:

[
  {
    "id": "unique-hash-id",
    "question": "What is the main product described?",
    "answer": "The main product is a cloud-based solution that provides...",
    "question_type": "Fact Retrieval",
    "difficulty": "easy",
    "evidence": [
      "The product is a cloud-based solution that provides enterprise-grade features...",
      "Key capabilities include real-time processing and analytics..."
    ],
    "evidence_relations": "Evidence 1 defines the product, evidence 2 details key capabilities."
  }
]

Implementation Details

When invoked, execute Python script

generate_benchmark_with_difficulty.py

which:

Load PDFs: Extract text from all PDFs in directory
Adaptive Chunking:
- Easy: 300-500 char chunks
- Medium: 800-1000 char chunks
- Hard: 1200-1500 char chunks with 25% overlap
Topic Analysis: Use Claude to identify:
- Key entities (companies, products, people, dates)
- Main concepts and themes
- Relationships and connections
Question Generation (Claude-powered):
- Generate questions matching difficulty requirements
- Ensure diverse question types
- Create comprehensive ground truth answers
- Extract supporting evidence passages
Validation:
- Verify evidence supports answer
- Check answer completeness
- Validate JSON structure
Output: Save to specified file with statistics

Statistics Reported

After generation:

Total questions generated
Questions per type breakdown
Questions per difficulty
Average answer length
Average evidence passages per question
Processing time

Requirements

Python 3.8+
pypdf library (auto-installed if missing)
Anthropic API key (from environment)
PDF files in specified directory

Notes

For hard questions, ensures cross-document synthesis by analyzing multiple PDFs
For easy questions, uses direct extraction with minimal inference
Always includes 2-5 evidence passages per question
Validates that evidence actually supports the answer
Uses unique hash IDs for question tracking
Compatible with RAGAs and other evaluation frameworks