Claude-skill-registry batch-execution-validator
Validate production batch execution - trigger daily runs and analyze traces for architecture completeness and result quality
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/batch-execution-validator" ~/.claude/skills/majiayu000-claude-skill-registry-batch-execution-validator && rm -rf "$T"
skills/data/batch-execution-validator/SKILL.mdBatch Execution Validator Skill
Purpose
End-to-end validation of production batch execution pipeline:
- Trigger batch execution for daily frequency via production API
- Wait for execution to complete
- Retrieve execution traces from Langfuse using advanced filters
- Analyze traces for architecture completeness and result quality
- Report findings with actionable recommendations
When to Use
- "Run a batch execution test for daily frequency"
- "Validate the production pipeline is working correctly"
- "Check if Langfuse tracing captures all nodes"
- "Test batch execution and analyze results"
- "Verify daily batch runs are generating quality output"
Required Environment Variables
: Production API secret keyAPI_SECRET_KEY
: Langfuse public API keyLANGFUSE_PUBLIC_KEY
: Langfuse secret API keyLANGFUSE_SECRET_KEY
: Langfuse host URL (default: https://cloud.langfuse.com)LANGFUSE_HOST
Workflow
Step 1: Trigger Batch Execution
Use the api_client.py helper to trigger a batch execution for daily frequency:
cd .claude/skills/batch-execution-validator/helpers # Trigger batch execution python3 api_client.py \ --api-url https://your-api.com \ --frequency daily \ --wait 180 # Output: # - Batch execution triggered # - Number of tasks found # - Started timestamp # - Task IDs (for trace retrieval)
What it does:
- POSTs to
with frequency="daily"/execute/batch - Extracts task IDs that will be processed
- Waits specified time (default 180s = 3min) for completion
- Returns execution metadata
Step 2: Retrieve and Analyze Traces
Use the trace_fetcher.py helper to query Langfuse and analyze results:
# Retrieve traces for the batch execution python3 trace_fetcher.py \ --from-timestamp "2025-11-07T14:30:00Z" \ --tags batch_execution daily \ --session-ids "task-id-1,task-id-2,task-id-3" \ --output /tmp/batch_validation_results.json # Output: # - Full trace data with nested observations # - Architecture analysis (node coverage, hierarchy) # - Quality assessment (sections, citations, performance) # - Issues and warnings
What it does:
- Queries Langfuse with advanced filters:
: ["batch_execution", "daily"]tagstimestamp >= started_atsession_id in [task_ids]
- Fetches full trace details + all child observations
- Analyzes trace architecture:
- Node coverage (router, research, write, edit)
- Hierarchy validation (parent-child relationships)
- Metadata completeness
- Error detection
- Assesses result quality:
- Output structure (sections, citations)
- Content completeness
- Performance metrics (latency)
- Generates analysis report
Analysis Criteria
Architecture Validation
Expected Nodes:
- Route strategy selectionrouter
- Evidence gatheringresearch
- Content generationwrite
- Validation and refinementedit
Checks:
- ✓ All expected nodes present
- ✓ Trace metadata complete (task_id, frequency, callback_url)
- ✓ Correct trace hierarchy (all observations linked)
- ✓ No ERROR level observations
- ✓ All nodes have start_time, end_time, input, output
Quality Assessment
Output Structure:
- ✓ sections: Array with 2+ sections
- ✓ citations: Array with 3-10 citations
- ✓ metadata: evidence_count, strategy_slug present
Content Quality:
- ✓ Sections are substantive (>100 words each)
- ✓ Citations have title, url, snippet
- ✓ No placeholder text ("TBD", "TODO")
Performance:
- ✓ Total latency < 90s (warning threshold)
- ✓ Per-node latency reasonable
- ✓ No timeout errors
Example Usage
# Full workflow example cd .claude/skills/batch-execution-validator/helpers # Step 1: Trigger batch python3 api_client.py \ --api-url https://research-agent-api.replit.app \ --frequency daily \ --wait 180 # Output shows: # Batch triggered: 5 tasks found # Started at: 2025-11-07T14:30:00Z # Task IDs: abc-123, def-456, ghi-789, jkl-012, mno-345 # Step 2: Fetch and analyze traces (using output from step 1) python3 trace_fetcher.py \ --from-timestamp "2025-11-07T14:30:00Z" \ --tags batch_execution daily \ --session-ids "abc-123,def-456,ghi-789,jkl-012,mno-345" \ --output /tmp/batch_validation_results.json # Output shows: # Retrieved 5 traces # Architecture: 5/5 PASS # Quality: 4 HIGH, 1 MEDIUM # Issues: 2 warnings # Report saved to: /tmp/batch_validation_results.json
Output Format
The trace_fetcher.py generates a JSON report with:
{ "execution_metadata": { "triggered_at": "2025-11-07T14:30:00Z", "frequency": "daily", "tasks_found": 5 }, "traces": [ { "trace_id": "abc-123", "user_id": "test@example.com", "research_topic": "Latest AI developments", "architecture": { "status": "PASS", "nodes_found": ["router", "research", "write", "edit"], "metadata_complete": true, "errors": [] }, "quality": { "status": "HIGH", "sections_count": 4, "citations_count": 7, "avg_section_words": 185, "total_latency_ms": 48200, "issues": [] } } ], "summary": { "total_traces": 5, "architecture_pass": 5, "architecture_fail": 0, "quality_high": 4, "quality_medium": 1, "quality_low": 0, "warnings": 2, "errors": 0 }, "recommendations": [ "All traces passed architecture validation", "Quality is consistently high (4/5 HIGH)", "Warning: Trace ghi-789 has only 2 citations (expected 3-10)" ] }
Interpreting Results
Architecture Status
- PASS: All expected nodes present, no errors, metadata complete
- FAIL: Missing nodes, errors, incomplete hierarchy
Quality Status
- HIGH: 3+ sections, 5-10 citations, >150 words/section, <60s latency
- MEDIUM: 2-3 sections, 3-5 citations, >100 words/section, <90s latency
- LOW: Incomplete sections, few citations, thin content, slow
Common Issues
Architecture Issues:
- Missing nodes: Check if node was skipped or crashed
- ERROR observations: Review node logs and error messages
- Incomplete metadata: Check API payload and tracing setup
Quality Issues:
- Low citation count: Research node may have failed or returned poor results
- Thin content: Write node may need prompt tuning
- Slow performance: Identify bottleneck node (research usually)
Tips
- Run during low traffic: Batch execution uses production resources
- Use realistic test data: Create test subscriptions with diverse topics
- Validate after changes: Run this skill after any deployment
- Monitor trends: Compare results over time to detect regressions
- Check callback logs: Ensure webhooks are being delivered
Troubleshooting
"No tasks found for frequency":
- Create test subscriptions:
with frequency="daily"POST /tasks - Verify subscriptions are active:
GET /tasks?email=test@example.com
"No traces retrieved":
- Increase wait time (may need >3min for multiple tasks)
- Check Langfuse credentials are correct
- Verify traces have correct tags
"Architecture validation fails":
- Check API logs for node execution errors
- Review Langfuse trace details manually
- Validate LangGraph configuration
"Quality is LOW":
- Check research node is returning evidence
- Validate write node prompts
- Review LLM responses in trace observations
Next Steps After Validation
- If PASS: System is healthy, ready for optimization
- If architecture issues: Fix tracing, node execution, or configuration
- If quality issues: Tune prompts, improve research, optimize nodes
- Optimization: Use langfuse-optimization skill to analyze specific issues
Remember: This skill is for validation, not optimization. Use it to confirm the pipeline works end-to-end, then use specialized skills for tuning individual components.