Skills langfuse
install
source · Clone the upstream repo
git clone https://github.com/TerminalSkills/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/TerminalSkills/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/langfuse" ~/.claude/skills/terminalskills-skills-langfuse && rm -rf "$T"
manifest:
skills/langfuse/SKILL.mdsafety · automated scan (low risk)
This is a pattern-based risk scan, not a security review. Our crawler flagged:
- pip install
Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.
source content
Langfuse — Open-Source LLM Observability
You are an expert in Langfuse, the open-source LLM engineering platform. You help developers trace LLM calls, evaluate output quality, manage prompts, track costs and latency, run experiments, and build evaluation datasets — providing full observability into AI applications from development through production.
Core Capabilities
Tracing
# Decorator-based tracing (Python) from langfuse.decorators import observe, langfuse_context @observe() def answer_question(question: str, context: str) -> str: """Trace the entire RAG pipeline as a single trace.""" # Step 1: Retrieve relevant docs docs = retrieve_docs(question) # Step 2: Generate answer answer = generate_answer(question, docs) # Add metadata to trace langfuse_context.update_current_trace( user_id="user-42", session_id="session-abc", tags=["production", "rag"], metadata={"model": "gpt-4o", "doc_count": len(docs)}, ) return answer @observe() def retrieve_docs(question: str) -> list[str]: """Traced as a span within the parent trace.""" embeddings = openai.embeddings.create(model="text-embedding-3-small", input=question) results = vector_store.search(embeddings.data[0].embedding, top_k=5) return [r.text for r in results] @observe(as_type="generation") def generate_answer(question: str, docs: list[str]) -> str: """Traced as an LLM generation with token usage and cost.""" response = openai.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": f"Context:\n{chr(10).join(docs)}"}, {"role": "user", "content": question}, ], ) return response.choices[0].message.content
// TypeScript / Vercel AI SDK integration import { Langfuse } from "langfuse"; import { AISDKExporter } from "langfuse-vercel"; const langfuse = new Langfuse(); // Wrap AI SDK calls const result = await generateText({ model: openai("gpt-4o"), prompt: question, experimental_telemetry: { isEnabled: true, functionId: "answer-question", metadata: { userId: "user-42" }, }, });
Prompt Management
from langfuse import Langfuse langfuse = Langfuse() # Fetch versioned prompt from Langfuse prompt = langfuse.get_prompt("rag-system-prompt", version=3) # Use in generation response = openai.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": prompt.compile(company_name="Acme", tone="professional")}, {"role": "user", "content": user_question}, ], ) # Prompts managed via Langfuse UI — non-engineers can edit, version, A/B test
Evaluation
# Score traces for quality langfuse.score( trace_id=trace.id, name="relevance", value=0.9, # 0-1 scale comment="Answer directly addressed the question", ) # LLM-as-judge evaluation langfuse.score( trace_id=trace.id, name="hallucination", value=0.0, # 0 = no hallucination data_type="NUMERIC", ) # Create evaluation datasets dataset = langfuse.create_dataset("rag-eval-v1") for item in test_cases: langfuse.create_dataset_item( dataset_name="rag-eval-v1", input={"question": item["question"]}, expected_output=item["expected_answer"], )
Installation
pip install langfuse # Python npm install langfuse # TypeScript # Self-hosted: docker-compose up (Langfuse is open-source)
Best Practices
- Trace everything — Wrap all LLM calls with tracing; understand latency, cost, and quality per request
- Structured traces — Use nested spans (retrieve → generate → format); identify bottlenecks in pipeline
- Cost tracking — Langfuse auto-calculates token costs per model; track spending by user, feature, prompt version
- Prompt versioning — Manage prompts in Langfuse UI; A/B test versions, rollback safely
- Evaluation datasets — Create test sets from production traces; run regression tests on prompt changes
- LLM-as-judge — Use automated scoring for hallucination, relevance, helpfulness at scale
- Session tracking — Group traces by session for conversational AI; see full conversation flow
- Self-hosted — Deploy with Docker for data sovereignty; same features as cloud, your infrastructure