Claude-skill-registry langfuse-optimization
Analyzes writing-ecosystem traces to fix style.yaml, template.yaml, and tools.yaml based on quality issues found in production runs.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/langfuse-optimization" ~/.claude/skills/majiayu000-claude-skill-registry-langfuse-optimization && rm -rf "$T"
skills/data/langfuse-optimization/SKILL.mdWriting Ecosystem Config Optimizer
Analyzes Langfuse traces to identify what's wrong with your style.yaml, template.yaml, and tools.yaml files, then tells you exactly how to fix them.
When to Use This Skill
- "Analyze traces and fix my config files"
- "My checks are failing - what's wrong with style.yaml?"
- "Optimize case 0001 configuration"
- "Why is the research node selecting wrong tools?"
Required Environment Variables
: Your Langfuse public API keyLANGFUSE_PUBLIC_KEY
: Your Langfuse secret API keyLANGFUSE_SECRET_KEY
: Langfuse host URL (default: https://cloud.langfuse.com)LANGFUSE_HOST
What This Skill Does
Input: User request + case ID Output: Specific fixes for style.yaml, template.yaml, tools.yaml
3-Step Process:
- Retrieve traces from Langfuse for specified case
- Extract problems from trace data (check failures, tool errors, structure issues)
- Generate fixes with exact YAML changes to make
Workflow
Step 1: Get User Request & Case ID
Ask for:
- Case ID (e.g., "0001", "0002", "The Prep")
- Time range (default: last 7 days)
- Specific focus (optional: "just style checks", "just tools", "everything")
Step 2: Retrieve Trace Data
Option A: Unified Retrieval (Recommended - Simpler)
Use the unified helper to get traces and observations in one command:
cd /home/runner/workspace/.claude/skills/langfuse-optimization # Get last 5 traces with observations for a case (using tags - RECOMMENDED) python3 helpers/retrieve_traces_and_observations.py \ --limit 5 \ --tags "case:0001" \ --output /tmp/langfuse_analysis/bundle.json # Filter by metadata (e.g., specific case_id) python3 helpers/retrieve_traces_and_observations.py \ --limit 3 \ --metadata case_id=0001 \ --output /tmp/langfuse_analysis/case_0001_bundle.json # Get traces only (skip observations for faster retrieval) python3 helpers/retrieve_traces_and_observations.py \ --limit 10 \ --no-observations \ --output /tmp/langfuse_analysis/traces_only.json # Save separate files + unified bundle python3 helpers/retrieve_traces_and_observations.py \ --limit 5 \ --output /tmp/langfuse_analysis/bundle.json \ --traces-output /tmp/langfuse_analysis/traces.json \ --observations-output /tmp/langfuse_analysis/observations.json # **RECOMMENDED**: Strip bloat for 95% size reduction python3 helpers/retrieve_traces_and_observations.py \ --tags "case:0001" \ --limit 1 \ --filter-essential \ --output /tmp/langfuse_analysis/filtered_bundle.json
Output: Single JSON bundle with:
- Query parameters (for reproducibility)
- Traces list
- Observations grouped by trace_id
- Trace count and IDs
Size Optimization Flags:
(Config Optimization):--filter-essential
- Strips:
(391KB+),facts_pack
(45KB+), long text fieldsvalidation_report - Replaces with compact summaries (facts count, size, failed checks)
- Reduction: ~95% (4.2MB → 200KB)
- Use case: Analyzing style.yaml, template.yaml, tools.yaml
(Additional Reduction):--filter-research-details
- Strips:
(34KB → 700B),structured_citations
(8KB → 200B)step_status - Replaces with counts, domains, tools used, success/failure stats
- Reduction: ~70% additional (on top of essential)
- Use case: When citation URLs and detailed step logs not needed
(Maximum Reduction):--filter-all
- Convenience flag: enables both
+--filter-essential--filter-research-details - Total reduction: ~96% (4.2MB → 30KB per trace)
- Use case: Large-scale trace collection, config optimization
Comparison:
- Without filtering: 4.2MB per trace (slow, all raw data)
- With
: 200KB per trace (fast, config analysis)--filter-essential - With
: 30KB per trace (fastest, minimal size)--filter-all
Option A.1: Single Trace Retrieval (Fastest for Individual Analysis)
When you know the exact trace ID you want to analyze, use the single trace helper:
cd /home/runner/workspace/.claude/skills/langfuse-optimization # Essential filtering only (95% reduction) python3 helpers/retrieve_single_trace.py 8fda46d7ac626327396d1a7962690807 --filter-essential # Maximum filtering (96% reduction) - RECOMMENDED for most cases python3 helpers/retrieve_single_trace.py 8fda46d7ac626327396d1a7962690807 --filter-all # Essential + Research details (custom combination) python3 helpers/retrieve_single_trace.py 8fda46d7ac626327396d1a7962690807 \ --filter-essential --filter-research-details \ --output /tmp/langfuse_analysis/single_trace.json # Without filtering (keep all raw data) python3 helpers/retrieve_single_trace.py abc123 --output /tmp/langfuse_analysis/trace.json
Benefits over multi-trace retrieval:
- 10-20x faster: Only fetches one trace instead of all traces for a case
- Lower API usage: Fewer API calls, less rate limiting
- Cleaner workflow: No need for client-side extraction
- Same structure: Output is identical to
retrieve_traces_and_observations.py
Output: Same bundle structure as Option A (compatible with all analysis tools)
When to use:
- Analyzing a specific trace ID from Langfuse dashboard
- Deep-diving into one workflow run
- Following up on a specific error or issue
- Comparing before/after changes to config
Option B: Two-Step Retrieval (Advanced - More Control)
For scenarios where you need separate retrieval stages:
cd /home/runner/workspace/.claude/skills/langfuse-optimization # Step 1: Get traces for a specific case (using tags) python3 helpers/retrieve_traces.py \ --tags "case:0001" \ --days 7 \ --limit 10 \ --output /tmp/langfuse_analysis/traces.json # Step 2: Get observations for those traces python3 helpers/retrieve_observations.py \ --trace-ids-file /tmp/langfuse_analysis/traces.json \ --output /tmp/langfuse_analysis/observations.json # Step 2 (with filtering): Strip bloat for 95% size reduction python3 helpers/retrieve_observations.py \ --trace-ids-file /tmp/langfuse_analysis/traces.json \ --filter-essential \ --output /tmp/langfuse_analysis/filtered_observations.json
Step 2B: Retrieve Annotation Queue Data (Optional)
If you have human annotations/feedback in Langfuse annotation queues:
cd /home/runner/workspace/.claude/skills/langfuse-optimization # Get all annotated items from a queue python3 helpers/retrieve_annotations.py \ --queue-id <your_queue_id> \ --output /tmp/langfuse_analysis/annotations.json # Get only completed annotations (reviewed items) python3 helpers/retrieve_annotations.py \ --queue-id <your_queue_id> \ --status completed \ --output /tmp/langfuse_analysis/annotations.json # Limit to recent 100 items python3 helpers/retrieve_annotations.py \ --queue-id <your_queue_id> \ --limit 100
What you get:
- Human comments/notes on traces
- Manual scores assigned by reviewers
- Issues flagged during quality review
- Trace IDs linked to annotations
How to use in analysis:
- Cross-reference annotation comments with trace data
- Identify patterns in human-flagged issues
- Prioritize fixes based on manual feedback frequency
- Validate if automated checks catch the same issues humans flag
Step 2.5: Using Metadata Filters
Filter traces by metadata fields to focus analysis on specific subsets:
cd /home/runner/workspace/.claude/skills/langfuse-optimization # Single metadata filter - analyze specific case python3 helpers/retrieve_traces_and_observations.py \ --metadata case_id=0001 \ --limit 10 \ --output /tmp/langfuse_analysis/case_0001.json # Multiple filters (AND logic - trace must match ALL) python3 helpers/retrieve_traces_and_observations.py \ --metadata case_id=0001 profile_name="Stock Deep Dive" \ --limit 5 \ --output /tmp/langfuse_analysis/filtered.json # Use dot notation for nested metadata (if applicable) python3 helpers/retrieve_traces_and_observations.py \ --metadata workflow_version=1 \ --output /tmp/langfuse_analysis/v1_workflows.json
How it works:
- Retrieves all traces from Langfuse within time range
- Applies client-side filtering by metadata fields
- Returns only traces matching ALL specified filters
- Limit applied AFTER filtering (ensures you get requested number of matching traces)
Common Use Cases:
- Analyze specific case:
--metadata case_id=0001 - Compare workflow versions:
vs--metadata workflow_version=1--metadata workflow_version=2 - Profile-specific issues:
--metadata profile_name="The Prep" - Combine filters:
(both must match)--metadata case_id=0001 workflow_version=2
Tips:
- Metadata values are case-sensitive strings
- Use exact matches only (no wildcards/regex)
- Check available metadata: run without filter first, inspect trace metadata
- Common fields:
,case_id
,profile_nameworkflow_version
Step 3: Extract Problems from Traces
Read
/tmp/langfuse_analysis/bundle.json (or observations.json if using two-step retrieval) and extract:
A. Style Check Failures (for style.yaml)
From edit node observations, find:
- Which checks failed
- Failure rates (how often each check fails)
- Scores vs thresholds
- Example content that failed
Map to style.yaml issues:
- Vague rubric: Check description unclear, LLM can't grade consistently
- Wrong threshold: Check fails too often (>30%) or never fails
- Missing check: Quality issue exists but no check catches it
- Wrong weight: Check importance (MINOR/MAJOR/CRITICAL) doesn't match impact
B. Template Problems (for template.yaml)
From write node observations, find:
- Missing required sections
- Word count violations
- Structure mismatches (bullets vs narrative)
Map to template.yaml issues:
- Unclear section descriptions
- Unrealistic word limits
- Missing section definitions
C. Tool Selection Issues (for tools.yaml)
From research node observations, find:
- Which tools were selected
- Tool failures (API errors, timeouts)
- Wrong tool for topic (should have used X but used Y)
- Loop expansion failures (
errors)for_each
Map to tools.yaml issues:
- Tool not available in pattern
- Wrong research pattern selected
- Loop directive path incorrect
- Missing fallback configuration
Step 4: Generate Config Fixes
For each problem, create a recommendation:
## Fix #N: [Problem description] **File**: `writing_ecosystem/config/cases/XXXX/[style|template|tools].yaml` **Problem**: - [Specific issue found in traces] - [Evidence: X failures in Y traces] **Current Config**: ```yaml [Show current YAML]
Fixed Config:
[Show corrected YAML with inline comments explaining changes]
Why this fixes it:
- [Explanation of root cause]
- [Expected improvement]
### Step 5: Present Simple Report ```markdown # Config Optimization Report - Case XXXX **Traces Analyzed**: X traces from [date range] --- ## Problems Found ### style.yaml Issues 1. ❌ `tone_consistency` check failing 30% (vague rubric) 2. ❌ `ttr_constraint` threshold too strict (16% failures) 3. ⚠️ `formality` check never fails (threshold too loose) ### template.yaml Issues 1. ❌ "Context" section missing description 2. ❌ Word limit conflict: max 100 words but needs 5 bullets ### tools.yaml Issues 1. ❌ Research pattern missing `finnhub` for financial topics 2. ❌ Loop directive path wrong: `user.portfolio.symbols` (should be `user.portfolio.summary.symbols`) --- ## Recommended Fixes ### Fix #1: Improve tone_consistency Rubric (style.yaml) **Problem**: Failing 30% of traces - rubric too vague **Current**: ```yaml signatures: tone_consistency: rubric: "Assess whether tone is consistent. Score 1-10." threshold: 7.0
Fixed:
signatures: tone_consistency: rubric: | Check tone consistency across: 1. FORMALITY: Professional terms only (not "pretty big", "kinda") 2. OBJECTIVITY: Neutral facts (not "shocked markets") 3. EXPERTISE: Assumes financial literacy Score 9-10: Perfect consistency Score 7-8: 1-2 minor lapses Score 5-6: Noticeable shifts Score <5: Multiple violations threshold: 7.0
Why: Specific dimensions + examples → LLM can grade consistently
Fix #2: Lower TTR Threshold (style.yaml)
Problem: Failing 16% - too strict for financial jargon
Current:
constraints: ttr_constraint: threshold: 0.55
Fixed:
constraints: ttr_constraint: threshold: 0.50 # Financial terms naturally repeat
Why: Domain terminology (Fed, QE, yield curve) lowers lexical diversity
Fix #3: Add Finnhub to Research Pattern (tools.yaml)
Problem: Financial topics not getting market data
Current:
research_patterns: default: general_research patterns: general_research: steps: - tool: perplexity
Fixed:
research_patterns: default: financial_research # Changed default for case 0001 patterns: financial_research: steps: - tool: perplexity save_as: news - tool: finnhub # Added for market data input: endpoint: company_news symbol: "{{topic}}" # Extract symbol from topic save_as: market_data
Why: Financial topics need both news (perplexity) + data (finnhub)
Implementation
1. Backup configs:
cd writing_ecosystem/config/cases/0001 cp style.yaml style.yaml.backup cp template.yaml template.yaml.backup cp tools.yaml tools.yaml.backup
2. Apply fixes:
- Open each file in editor
- Apply changes from recommendations above
- Save files
3. Test:
python run_workflow.py --case 0001 --topic "Test topic" # Check Langfuse trace for improvements
4. Monitor:
- Run 20-30 workflows
- Re-run this analysis
- Compare before/after failure rates
Expected Results
failures: 30% → ~15%tone_consistency
failures: 16% → ~8%ttr_constraint- Research quality: +20% (adding finnhub)
- Overall pre-flight score: 7.8 → 8.2
Ready to implement? Let me know which fixes to apply first, or if you want to see more detail on any issue.
## Analysis Patterns ### For style.yaml Issues **Look for**: 1. **High failure rate** (>30%) → Vague rubric or wrong threshold 2. **Zero failures** → Threshold too loose or check not working 3. **Inconsistent scores** → Rubric needs examples and clear criteria 4. **Low edit fix rate** (<50%) → Check unclear about what to fix **Common fixes**: - Add specific dimensions to rubrics - Provide good/bad examples - Adjust thresholds based on domain (finance vs tech vs general) - Add deterministic pre-checks for obvious violations ### For template.yaml Issues **Look for**: 1. **Missing sections** in write node output 2. **Word count violations** (consistent over/under) 3. **Structure mismatches** (bullets vs narrative) **Common fixes**: - Add clear section descriptions - Adjust word limits to realistic values - Clarify format requirements (when to use bullets vs prose) ### For tools.yaml Issues **Look for**: 1. **Wrong tool selected** for topic type 2. **Missing tools** for domain (finance needs finnhub) 3. **Loop expansion failures** (path errors in `for_each`) 4. **Tool errors** (API failures, timeouts) **Common fixes**: - Add domain-specific tools to patterns - Fix loop directive paths - Add fallback patterns - Update default pattern for case ## Key Principles ### 1. Evidence-Based Every recommendation must show: - How many traces failed - Example content that failed - Why current config caused the failure ### 2. Specific No generic advice like "improve rubric" - show EXACT YAML changes with inline comments ### 3. Prioritized Focus on: - High-frequency issues first (affects >30% of traces) - Quick wins (threshold adjustments) - High-impact changes (missing tools for domain) ### 4. Actionable Every fix includes: - Exact file path - Before/after YAML - Expected improvement - How to test ## Troubleshooting **"No traces found"**: - Verify case ID is correct - Check trace naming: `writing-workflow-0001` vs `writing-workflow-001` - Try broader: `--name "writing-workflow"` to see all cases **"No check failures in traces"**: - Workflow may be in fallback mode (no LLM) - Edit node may have been skipped (pre-flight score >8.5) - Verify edit node ran in observations **"Can't identify issue"**: - Read the actual style.yaml/template.yaml/tools.yaml files - Compare trace output to config requirements - Look for mismatches **"Metadata filter returning no traces"**: - Verify metadata fields exist in your traces (check raw trace JSON) - Metadata values are case-sensitive strings - Use exact matches only (no wildcards/regex) - Try without metadata filter first to see available metadata fields - Common fields: `case_id`, `profile_name`, `workflow_version` ## Success Criteria Good recommendations should: 1. ✅ Show exact YAML before/after 2. ✅ Explain WHY issue occurred (root cause) 3. ✅ Quantify impact (X% failure rate → Y% expected) 4. ✅ Be implementable in <5 min per fix 5. ✅ Focus on top 3-5 issues (not 50 minor ones) --- **Remember**: This skill is about **fixing config files**, not analyzing architecture. Keep it simple: 1. What's broken in the YAML? 2. Here's the fix 3. Here's why it works