Claude-skill-registry langfuse-optimization

Analyzes writing-ecosystem traces to fix style.yaml, template.yaml, and tools.yaml based on quality issues found in production runs.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/langfuse-optimization" ~/.claude/skills/majiayu000-claude-skill-registry-langfuse-optimization && rm -rf "$T"

manifest: skills/data/langfuse-optimization/SKILL.md

Writing Ecosystem Config Optimizer

Analyzes Langfuse traces to identify what's wrong with your style.yaml, template.yaml, and tools.yaml files, then tells you exactly how to fix them.

When to Use This Skill

"Analyze traces and fix my config files"
"My checks are failing - what's wrong with style.yaml?"
"Optimize case 0001 configuration"
"Why is the research node selecting wrong tools?"

Required Environment Variables

```
LANGFUSE_PUBLIC_KEY
```
: Your Langfuse public API key
```
LANGFUSE_SECRET_KEY
```
: Your Langfuse secret API key
```
LANGFUSE_HOST
```
: Langfuse host URL (default: https://cloud.langfuse.com)

What This Skill Does

Input: User request + case ID Output: Specific fixes for style.yaml, template.yaml, tools.yaml

3-Step Process:

Retrieve traces from Langfuse for specified case
Extract problems from trace data (check failures, tool errors, structure issues)
Generate fixes with exact YAML changes to make

Workflow

Step 1: Get User Request & Case ID

Ask for:

Case ID (e.g., "0001", "0002", "The Prep")
Time range (default: last 7 days)
Specific focus (optional: "just style checks", "just tools", "everything")

Step 2: Retrieve Trace Data

Option A: Unified Retrieval (Recommended - Simpler)

Use the unified helper to get traces and observations in one command:

cd /home/runner/workspace/.claude/skills/langfuse-optimization

# Get last 5 traces with observations for a case (using tags - RECOMMENDED)
python3 helpers/retrieve_traces_and_observations.py \
  --limit 5 \
  --tags "case:0001" \
  --output /tmp/langfuse_analysis/bundle.json

# Filter by metadata (e.g., specific case_id)
python3 helpers/retrieve_traces_and_observations.py \
  --limit 3 \
  --metadata case_id=0001 \
  --output /tmp/langfuse_analysis/case_0001_bundle.json

# Get traces only (skip observations for faster retrieval)
python3 helpers/retrieve_traces_and_observations.py \
  --limit 10 \
  --no-observations \
  --output /tmp/langfuse_analysis/traces_only.json

# Save separate files + unified bundle
python3 helpers/retrieve_traces_and_observations.py \
  --limit 5 \
  --output /tmp/langfuse_analysis/bundle.json \
  --traces-output /tmp/langfuse_analysis/traces.json \
  --observations-output /tmp/langfuse_analysis/observations.json

# **RECOMMENDED**: Strip bloat for 95% size reduction
python3 helpers/retrieve_traces_and_observations.py \
  --tags "case:0001" \
  --limit 1 \
  --filter-essential \
  --output /tmp/langfuse_analysis/filtered_bundle.json

Output: Single JSON bundle with:

Query parameters (for reproducibility)
Traces list
Observations grouped by trace_id
Trace count and IDs

Size Optimization Flags:

--filter-essential

(Config Optimization):

Strips:
```
facts_pack
```
(391KB+),
```
validation_report
```
(45KB+), long text fields
Replaces with compact summaries (facts count, size, failed checks)
Reduction: ~95% (4.2MB → 200KB)
Use case: Analyzing style.yaml, template.yaml, tools.yaml

--filter-research-details

(Additional Reduction):

Strips:
```
structured_citations
```
(34KB → 700B),
```
step_status
```
(8KB → 200B)
Replaces with counts, domains, tools used, success/failure stats
Reduction: ~70% additional (on top of essential)
Use case: When citation URLs and detailed step logs not needed

--filter-all

(Maximum Reduction):

Convenience flag: enables both

--filter-essential

--filter-research-details

Total reduction: ~96% (4.2MB → 30KB per trace)
Use case: Large-scale trace collection, config optimization

Comparison:

Without filtering: 4.2MB per trace (slow, all raw data)
With
```
--filter-essential
```
: 200KB per trace (fast, config analysis)
With
```
--filter-all
```
: 30KB per trace (fastest, minimal size)

Option A.1: Single Trace Retrieval (Fastest for Individual Analysis)

When you know the exact trace ID you want to analyze, use the single trace helper:

cd /home/runner/workspace/.claude/skills/langfuse-optimization

# Essential filtering only (95% reduction)
python3 helpers/retrieve_single_trace.py 8fda46d7ac626327396d1a7962690807 --filter-essential

# Maximum filtering (96% reduction) - RECOMMENDED for most cases
python3 helpers/retrieve_single_trace.py 8fda46d7ac626327396d1a7962690807 --filter-all

# Essential + Research details (custom combination)
python3 helpers/retrieve_single_trace.py 8fda46d7ac626327396d1a7962690807 \
  --filter-essential --filter-research-details \
  --output /tmp/langfuse_analysis/single_trace.json

# Without filtering (keep all raw data)
python3 helpers/retrieve_single_trace.py abc123 --output /tmp/langfuse_analysis/trace.json

Benefits over multi-trace retrieval:

10-20x faster: Only fetches one trace instead of all traces for a case
Lower API usage: Fewer API calls, less rate limiting
Cleaner workflow: No need for client-side extraction
Same structure: Output is identical to
```
retrieve_traces_and_observations.py
```

Output: Same bundle structure as Option A (compatible with all analysis tools)

When to use:

Analyzing a specific trace ID from Langfuse dashboard
Deep-diving into one workflow run
Following up on a specific error or issue
Comparing before/after changes to config

Option B: Two-Step Retrieval (Advanced - More Control)

For scenarios where you need separate retrieval stages:

cd /home/runner/workspace/.claude/skills/langfuse-optimization

# Step 1: Get traces for a specific case (using tags)
python3 helpers/retrieve_traces.py \
  --tags "case:0001" \
  --days 7 \
  --limit 10 \
  --output /tmp/langfuse_analysis/traces.json

# Step 2: Get observations for those traces
python3 helpers/retrieve_observations.py \
  --trace-ids-file /tmp/langfuse_analysis/traces.json \
  --output /tmp/langfuse_analysis/observations.json

# Step 2 (with filtering): Strip bloat for 95% size reduction
python3 helpers/retrieve_observations.py \
  --trace-ids-file /tmp/langfuse_analysis/traces.json \
  --filter-essential \
  --output /tmp/langfuse_analysis/filtered_observations.json

Step 2B: Retrieve Annotation Queue Data (Optional)

If you have human annotations/feedback in Langfuse annotation queues:

cd /home/runner/workspace/.claude/skills/langfuse-optimization

# Get all annotated items from a queue
python3 helpers/retrieve_annotations.py \
  --queue-id <your_queue_id> \
  --output /tmp/langfuse_analysis/annotations.json

# Get only completed annotations (reviewed items)
python3 helpers/retrieve_annotations.py \
  --queue-id <your_queue_id> \
  --status completed \
  --output /tmp/langfuse_analysis/annotations.json

# Limit to recent 100 items
python3 helpers/retrieve_annotations.py \
  --queue-id <your_queue_id> \
  --limit 100

What you get:

Human comments/notes on traces
Manual scores assigned by reviewers
Issues flagged during quality review
Trace IDs linked to annotations

How to use in analysis:

Cross-reference annotation comments with trace data
Identify patterns in human-flagged issues
Prioritize fixes based on manual feedback frequency
Validate if automated checks catch the same issues humans flag

Step 2.5: Using Metadata Filters

Filter traces by metadata fields to focus analysis on specific subsets:

cd /home/runner/workspace/.claude/skills/langfuse-optimization

# Single metadata filter - analyze specific case
python3 helpers/retrieve_traces_and_observations.py \
  --metadata case_id=0001 \
  --limit 10 \
  --output /tmp/langfuse_analysis/case_0001.json

# Multiple filters (AND logic - trace must match ALL)
python3 helpers/retrieve_traces_and_observations.py \
  --metadata case_id=0001 profile_name="Stock Deep Dive" \
  --limit 5 \
  --output /tmp/langfuse_analysis/filtered.json

# Use dot notation for nested metadata (if applicable)
python3 helpers/retrieve_traces_and_observations.py \
  --metadata workflow_version=1 \
  --output /tmp/langfuse_analysis/v1_workflows.json

How it works:

Retrieves all traces from Langfuse within time range
Applies client-side filtering by metadata fields
Returns only traces matching ALL specified filters
Limit applied AFTER filtering (ensures you get requested number of matching traces)

Common Use Cases:

Analyze specific case:
```
--metadata case_id=0001
```

Compare workflow versions:

--metadata workflow_version=1

--metadata workflow_version=2

Profile-specific issues:
```
--metadata profile_name="The Prep"
```

Combine filters:

--metadata case_id=0001 workflow_version=2

(both must match)

Tips:

Metadata values are case-sensitive strings
Use exact matches only (no wildcards/regex)
Check available metadata: run without filter first, inspect trace metadata
Common fields:
```
case_id
```
,
```
profile_name
```
,
```
workflow_version
```

Step 3: Extract Problems from Traces

Read

/tmp/langfuse_analysis/bundle.json

(or

observations.json

if using two-step retrieval) and extract:

A. Style Check Failures (for style.yaml)

From edit node observations, find:

Which checks failed
Failure rates (how often each check fails)
Scores vs thresholds
Example content that failed

Map to style.yaml issues:

Vague rubric: Check description unclear, LLM can't grade consistently
Wrong threshold: Check fails too often (>30%) or never fails
Missing check: Quality issue exists but no check catches it
Wrong weight: Check importance (MINOR/MAJOR/CRITICAL) doesn't match impact

B. Template Problems (for template.yaml)

From write node observations, find:

Missing required sections
Word count violations
Structure mismatches (bullets vs narrative)

Map to template.yaml issues:

Unclear section descriptions
Unrealistic word limits
Missing section definitions

C. Tool Selection Issues (for tools.yaml)

From research node observations, find:

Which tools were selected
Tool failures (API errors, timeouts)
Wrong tool for topic (should have used X but used Y)
Loop expansion failures (
```
for_each
```
errors)

Map to tools.yaml issues:

Tool not available in pattern
Wrong research pattern selected
Loop directive path incorrect
Missing fallback configuration

Step 4: Generate Config Fixes

For each problem, create a recommendation:

## Fix #N: [Problem description]

**File**: `writing_ecosystem/config/cases/XXXX/[style|template|tools].yaml`

**Problem**:
- [Specific issue found in traces]
- [Evidence: X failures in Y traces]

**Current Config**:
```yaml
[Show current YAML]

Fixed Config:

[Show corrected YAML with inline comments explaining changes]

Why this fixes it:

[Explanation of root cause]
[Expected improvement]


### Step 5: Present Simple Report

```markdown
# Config Optimization Report - Case XXXX

**Traces Analyzed**: X traces from [date range]

---

## Problems Found

### style.yaml Issues
1. ❌ `tone_consistency` check failing 30% (vague rubric)
2. ❌ `ttr_constraint` threshold too strict (16% failures)
3. ⚠️ `formality` check never fails (threshold too loose)

### template.yaml Issues
1. ❌ "Context" section missing description
2. ❌ Word limit conflict: max 100 words but needs 5 bullets

### tools.yaml Issues
1. ❌ Research pattern missing `finnhub` for financial topics
2. ❌ Loop directive path wrong: `user.portfolio.symbols` (should be `user.portfolio.summary.symbols`)

---

## Recommended Fixes

### Fix #1: Improve tone_consistency Rubric (style.yaml)

**Problem**: Failing 30% of traces - rubric too vague

**Current**:
```yaml
signatures:
  tone_consistency:
    rubric: "Assess whether tone is consistent. Score 1-10."
    threshold: 7.0

Fixed:

signatures:
  tone_consistency:
    rubric: |
      Check tone consistency across:
      1. FORMALITY: Professional terms only (not "pretty big", "kinda")
      2. OBJECTIVITY: Neutral facts (not "shocked markets")
      3. EXPERTISE: Assumes financial literacy

      Score 9-10: Perfect consistency
      Score 7-8: 1-2 minor lapses
      Score 5-6: Noticeable shifts
      Score <5: Multiple violations
    threshold: 7.0

Why: Specific dimensions + examples → LLM can grade consistently

Fix #2: Lower TTR Threshold (style.yaml)

Problem: Failing 16% - too strict for financial jargon

Current:

constraints:
  ttr_constraint:
    threshold: 0.55

Fixed:

constraints:
  ttr_constraint:
    threshold: 0.50  # Financial terms naturally repeat

Why: Domain terminology (Fed, QE, yield curve) lowers lexical diversity

Fix #3: Add Finnhub to Research Pattern (tools.yaml)

Problem: Financial topics not getting market data

Current:

research_patterns:
  default: general_research
  patterns:
    general_research:
      steps:
        - tool: perplexity

Fixed:

research_patterns:
  default: financial_research  # Changed default for case 0001
  patterns:
    financial_research:
      steps:
        - tool: perplexity
          save_as: news
        - tool: finnhub  # Added for market data
          input:
            endpoint: company_news
            symbol: "{{topic}}"  # Extract symbol from topic
          save_as: market_data

Why: Financial topics need both news (perplexity) + data (finnhub)

Implementation

1. Backup configs:

cd writing_ecosystem/config/cases/0001
cp style.yaml style.yaml.backup
cp template.yaml template.yaml.backup
cp tools.yaml tools.yaml.backup

2. Apply fixes:

Open each file in editor
Apply changes from recommendations above
Save files

3. Test:

python run_workflow.py --case 0001 --topic "Test topic"
# Check Langfuse trace for improvements

4. Monitor:

Run 20-30 workflows
Re-run this analysis
Compare before/after failure rates

Expected Results

```
tone_consistency
```
failures: 30% → ~15%
```
ttr_constraint
```
failures: 16% → ~8%
Research quality: +20% (adding finnhub)
Overall pre-flight score: 7.8 → 8.2

Ready to implement? Let me know which fixes to apply first, or if you want to see more detail on any issue.


## Analysis Patterns

### For style.yaml Issues

**Look for**:
1. **High failure rate** (>30%) → Vague rubric or wrong threshold
2. **Zero failures** → Threshold too loose or check not working
3. **Inconsistent scores** → Rubric needs examples and clear criteria
4. **Low edit fix rate** (<50%) → Check unclear about what to fix

**Common fixes**:
- Add specific dimensions to rubrics
- Provide good/bad examples
- Adjust thresholds based on domain (finance vs tech vs general)
- Add deterministic pre-checks for obvious violations

### For template.yaml Issues

**Look for**:
1. **Missing sections** in write node output
2. **Word count violations** (consistent over/under)
3. **Structure mismatches** (bullets vs narrative)

**Common fixes**:
- Add clear section descriptions
- Adjust word limits to realistic values
- Clarify format requirements (when to use bullets vs prose)

### For tools.yaml Issues

**Look for**:
1. **Wrong tool selected** for topic type
2. **Missing tools** for domain (finance needs finnhub)
3. **Loop expansion failures** (path errors in `for_each`)
4. **Tool errors** (API failures, timeouts)

**Common fixes**:
- Add domain-specific tools to patterns
- Fix loop directive paths
- Add fallback patterns
- Update default pattern for case

## Key Principles

### 1. Evidence-Based
Every recommendation must show:
- How many traces failed
- Example content that failed
- Why current config caused the failure

### 2. Specific
No generic advice like "improve rubric" - show EXACT YAML changes with inline comments

### 3. Prioritized
Focus on:
- High-frequency issues first (affects >30% of traces)
- Quick wins (threshold adjustments)
- High-impact changes (missing tools for domain)

### 4. Actionable
Every fix includes:
- Exact file path
- Before/after YAML
- Expected improvement
- How to test

## Troubleshooting

**"No traces found"**:
- Verify case ID is correct
- Check trace naming: `writing-workflow-0001` vs `writing-workflow-001`
- Try broader: `--name "writing-workflow"` to see all cases

**"No check failures in traces"**:
- Workflow may be in fallback mode (no LLM)
- Edit node may have been skipped (pre-flight score >8.5)
- Verify edit node ran in observations

**"Can't identify issue"**:
- Read the actual style.yaml/template.yaml/tools.yaml files
- Compare trace output to config requirements
- Look for mismatches

**"Metadata filter returning no traces"**:
- Verify metadata fields exist in your traces (check raw trace JSON)
- Metadata values are case-sensitive strings
- Use exact matches only (no wildcards/regex)
- Try without metadata filter first to see available metadata fields
- Common fields: `case_id`, `profile_name`, `workflow_version`

## Success Criteria

Good recommendations should:
1. ✅ Show exact YAML before/after
2. ✅ Explain WHY issue occurred (root cause)
3. ✅ Quantify impact (X% failure rate → Y% expected)
4. ✅ Be implementable in <5 min per fix
5. ✅ Focus on top 3-5 issues (not 50 minor ones)

---

**Remember**: This skill is about **fixing config files**, not analyzing architecture. Keep it simple:
1. What's broken in the YAML?
2. Here's the fix
3. Here's why it works