ClawForge langsmith-dataset
Use this skill for ANY question about creating test or evaluation datasets for LangChain agents. Covers generating datasets from traces (final_response, single_step, trajectory, RAG types), uploading to LangSmith, and managing evaluation data.
git clone https://github.com/jackjin1997/ClawForge
T=$(mktemp -d) && git clone --depth=1 https://github.com/jackjin1997/ClawForge "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/langsmith-dataset" ~/.claude/skills/jackjin1997-clawforge-langsmith-dataset && rm -rf "$T"
skills/langsmith-dataset/SKILL.mdLangSmith Dataset
Auto-generate evaluation datasets from LangSmith traces for testing and validation.
Setup
Environment Variables
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here # Required LANGSMITH_PROJECT=your-project-name # Optional: default project LANGSMITH_WORKSPACE_ID=your-workspace-id # Optional: for org-scoped keys
Dependencies
pip install langsmith click rich python-dotenv
Usage
Navigate to
skills/langsmith-dataset/scripts/ to run commands.
Scripts
- Create evaluation datasets from traces
generate_datasets.py
- View and inspect datasetsquery_datasets.py
Common Flags
All dataset generation commands support:
- Filter traces by root run name (e.g., "LangGraph" for DeepAgents)--root-run-name <name>
- Number of traces to process (default: 30)--limit <n>
- Only recent traces--last-n-minutes <n>
- Output file (.json or .csv)--output <path>
- Upload to LangSmith with this dataset name--upload <name>
- Overwrite existing file/dataset (will prompt for confirmation)--replace
- Skip confirmation prompts (use with caution)--yes
IMPORTANT - Safety Prompts:
- The script prompts for confirmation before deleting existing datasets with
--replace - ALWAYS respect these prompts - wait for user input before proceeding
- NEVER use
flag unless the user explicitly requests it--yes - The
flag skips all safety prompts and should only be used in automated workflows when explicitly authorized by the user--yes
Understanding Trace Hierarchy
Traces have depth levels based on parent-child relationships:
Depth 0: Root agent (e.g., "LangGraph") ├── Depth 1: Middleware/chains (model, tools, SummarizationMiddleware) │ ├── Depth 2: Tool calls (sql_db_query, retriever, etc.) │ └── Depth 2: LLM calls (ChatOpenAI, ChatAnthropic) └── Depth 3+: Nested subagent calls
Use
to target specific agent frameworks:--root-run-name
- DeepAgents:
--root-run-name LangGraph - Custom agents: Use your root node name
Dataset Types
1. Final Response
Full conversation with expected output - tests complete agent behavior.
# Basic usage python generate_datasets.py --type final_response \ --project my-project \ --root-run-name LangGraph \ --limit 30 \ --output /tmp/final_response.json # With custom output fields python generate_datasets.py --type final_response \ --project my-project \ --output-fields "answer,result" \ --output /tmp/final.json # Messages only (ignore output dict keys) python generate_datasets.py --type final_response \ --project my-project \ --messages-only \ --output /tmp/final.json
Structure:
{ "trace_id": "...", "inputs": {"query": "What are the top 3 genres?"}, "outputs": { "expected_response": "The top 3 genres based on the number of tracks are:\n\n1. Rock with 1,297 tracks\n2. Latin with 579 tracks\n3. Metal with 374 tracks" } }
Extraction Priority:
- Messages from root run (AI responses with content)
- User-specified output fields (
)--output-fields - Common keys (answer, output)
- Full output dict
Important: Always checks root run first for final response to avoid intermediate tool outputs.
2. Single Step
Single node inputs/outputs - tests any specific node's behavior. Supports multiple occurrences per trace to capture conversation evolution.
# Extract all occurrences (default) python generate_datasets.py --type single_step \ --project my-project \ --root-run-name LangGraph \ --run-name model \ --output /tmp/single_step.json # Sample 2 occurrences per trace python generate_datasets.py --type single_step \ --project my-project \ --root-run-name LangGraph \ --run-name model \ --sample-per-trace 2 \ --output /tmp/single_step_sampled.json # Target specific tool at depth 2 python generate_datasets.py --type single_step \ --project my-project \ --root-run-name LangGraph \ --run-name sql_db_query \ --output /tmp/sql_query.json
Structure:
{ "trace_id": "...", "run_id": "...", "occurrence": 2, "inputs": { "messages": [ {"type": "human", "content": "What are the top 3 genres?"}, {"type": "ai", "content": "", "tool_calls": [...]}, {"type": "tool", "content": "...results..."}, ... ] }, "outputs": { "expected_output": { "messages": [ {"type": "ai", "content": "", "tool_calls": [...]} ] }, "node_name": "model" } }
Key Features:
field tracks which invocation (1st, 2nd, 3rd, etc.)occurrence- Later occurrences have more conversation history → tests context handling
randomly samples N occurrences per trace--sample-per-trace- Use
to target any node at any depth--run-name
Common targets:
(depth 1) - LLM invocations with growing contextmodel
(depth 1) - Tool execution chaintools- Any custom node name
3. Trajectory
Tool call sequence - tests execution path with configurable depth.
# Include all tool calls (all depths) python generate_datasets.py --type trajectory \ --project my-project \ --root-run-name LangGraph \ --limit 30 \ --output /tmp/trajectory_all.json # Only tool calls up to depth 2 python generate_datasets.py --type trajectory \ --project my-project \ --root-run-name LangGraph \ --depth 2 \ --output /tmp/trajectory_depth2.json # Only root-level tool calls (depth 0) - usually empty if tools are at depth 2+ python generate_datasets.py --type trajectory \ --project my-project \ --depth 0 \ --output /tmp/trajectory_root.json
Structure:
{ "trace_id": "...", "inputs": {"query": "What are the top 3 genres?"}, "outputs": { "expected_trajectory": [ "sql_db_list_tables", "sql_db_schema", "sql_db_query_checker", "sql_db_query" ] } }
Depth Control:
- Omit
= all levels (includes subagent tool calls)--depth
= root + 2 levels (typical for capturing all main tools)--depth 2
= often only middleware/chains, no actual tool calls--depth 1
= root only (no tool calls)--depth 0
Note: Tool calls are typically at depth 2 in LangGraph/DeepAgents architecture.
4. RAG
Question/chunks/answer/citations - tests retrieval quality.
python generate_datasets.py --type rag \ --project my-project \ --limit 30 \ --output /tmp/rag_ds.csv # Supports .json or .csv
Structure (CSV format):
question,retrieved_chunks,answer,cited_chunks "How do I...","Chunk 1\n\nChunk 2","The answer is...","[\"Chunk 1\"]"
Output Formats
All dataset types support both JSON and CSV:
# JSON output (default) python generate_datasets.py --type trajectory --project my-project --output ds.json # CSV output (use .csv extension) python generate_datasets.py --type trajectory --project my-project --output ds.csv
Upload to LangSmith
# Generate and upload in one command python generate_datasets.py --type trajectory \ --project my-project \ --root-run-name LangGraph \ --limit 50 \ --output /tmp/trajectory_ds.json \ --upload "Skills: Trajectory" # Use --replace to overwrite existing dataset python generate_datasets.py --type final_response \ --project my-project \ --output /tmp/final.json \ --upload "Skills: Final Response" \ --replace
Naming Convention: Use "Skills: <Type>" format for consistency:
- "Skills: Final Response"
- "Skills: Single Step (model)"
- "Skills: Single Step (sql_db_query)"
- "Skills: Trajectory (all depths)"
- "Skills: Trajectory (depth=2)"
Query Datasets
# List all datasets python query_datasets.py list-datasets # Filter by name pattern python query_datasets.py list-datasets | grep "Skills:" # View dataset examples python query_datasets.py show "Skills: Trajectory" --limit 5 # View local file python query_datasets.py view-file /tmp/trajectory_ds.json --limit 3 # Analyze structure python query_datasets.py structure /tmp/trajectory_ds.json # Export from LangSmith to local python query_datasets.py export "Skills: Final Response" /tmp/exported.json --limit 100
Tips for Dataset Generation
- Always use
- Filter for specific agent framework (e.g., "LangGraph")--root-run-name - Start with successful traces - Use recent successful runs for baseline datasets
- Use time windows -
for last 24 hours of data--last-n-minutes 1440 - Sample for single_step - Use
to capture conversation evolution--sample-per-trace 2 - Match depth to needs -
typically captures all main tool calls--depth 2 - Review before upload - Use
to inspect firstquery_datasets.py view-file - Iterative refinement - Generate small batches (10-20) first, validate, then scale up
- Use
carefully - Overwrites existing datasets, useful for iteration--replace
Example Workflow
# 1. Generate fresh traces (if needed) python tests/test_agent.py --batch # Your test agent # 2. Generate all dataset types from LangGraph traces python generate_datasets.py --type final_response \ --project skills --root-run-name LangGraph --limit 10 \ --output /tmp/final.json --upload "Skills: Final Response" --replace python generate_datasets.py --type single_step \ --project skills --root-run-name LangGraph --run-name model \ --sample-per-trace 2 --limit 10 \ --output /tmp/model.json --upload "Skills: Single Step (model)" --replace python generate_datasets.py --type trajectory \ --project skills --root-run-name LangGraph --limit 10 \ --output /tmp/traj.json --upload "Skills: Trajectory (all depths)" --replace python generate_datasets.py --type trajectory \ --project skills --root-run-name LangGraph --depth 2 --limit 10 \ --output /tmp/traj_d2.json --upload "Skills: Trajectory (depth=2)" --replace # 3. Review in LangSmith UI # Visit https://smith.langchain.com → Datasets → Filter for "Skills:" # 4. Query locally if needed python query_datasets.py show "Skills: Final Response" --limit 3
Troubleshooting
Empty final_response outputs:
- Ensure
matches your agent's root node--root-run-name - Check that root run has messages with AI responses
- Use
if output dict is empty--messages-only
No trajectory examples:
- Tools might be at different depth - try removing
or use--depth--depth 2 - Verify tool calls exist:
python query_traces.py trace <id> --show-hierarchy
Too many single_step examples:
- Use
to limit examples per trace--sample-per-trace 2 - Reduces dataset size while maintaining diversity
Dataset upload fails:
- Check dataset doesn't exist or use
--replace - Verify LANGSMITH_API_KEY is set
Related Skills
- Use langsmith-trace skill to query and export traces
- Use langsmith-evaluator skill to create evaluators and measure performance