Skills api-ai-langfuse
LLM observability with Langfuse — OpenTelemetry-based tracing, evaluations, prompt management, datasets, and production best practices
git clone https://github.com/agents-inc/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/agents-inc/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/dist/plugins/api-ai-langfuse/skills/api-ai-langfuse" ~/.claude/skills/agents-inc-skills-api-ai-langfuse && rm -rf "$T"
dist/plugins/api-ai-langfuse/skills/api-ai-langfuse/SKILL.mdLangfuse Observability Patterns
Quick Guide: Use the Langfuse TypeScript SDK (built on OpenTelemetry) to add observability to LLM applications. Install
,@langfuse/tracing, and@langfuse/otelfor core tracing. Use@opentelemetry/sdk-nodefor automatic context propagation orstartActiveObservation()to wrap functions. Useobserve()with@langfuse/openaifor zero-config OpenAI tracing. UseobserveOpenAI()fromLangfuseClientfor prompt management, scores, and datasets. Always call@langfuse/clientorforceFlush()in short-lived processes.sdk.shutdown()
<critical_requirements>
CRITICAL: Before Using This Skill
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
, named constants)import type
(You MUST import and register
at the top of your entry point BEFORE any other imports -- OpenTelemetry must instrument modules before they are loaded)instrumentation.ts
(You MUST call
or forceFlush()
in short-lived processes (serverless, scripts, CLI tools) -- events are batched and will be lost without explicit flushing)sdk.shutdown()
(You MUST use
with @langfuse/openai
for OpenAI SDK tracing -- do NOT manually create generation observations for OpenAI calls when the wrapper handles it automatically)observeOpenAI()
(You MUST set
, LANGFUSE_SECRET_KEY
, and LANGFUSE_PUBLIC_KEY
via environment variables -- never hardcode credentials)LANGFUSE_BASE_URL
(You MUST use
or startActiveObservation()
for nested tracing -- manual observe()
requires explicit startObservation()
calls and does NOT propagate context automatically).end()
</critical_requirements>
Auto-detection: Langfuse, langfuse, @langfuse/tracing, @langfuse/otel, @langfuse/client, @langfuse/openai, LangfuseSpanProcessor, LangfuseClient, startActiveObservation, startObservation, observeOpenAI, langfuse.score, langfuse.prompt, langfuse.dataset, LANGFUSE_SECRET_KEY, LANGFUSE_PUBLIC_KEY, forceFlush
When to use:
- Adding observability and tracing to LLM application code (any provider)
- Wrapping OpenAI SDK calls for automatic token/cost tracking
- Managing prompt templates with versioning, labels, and variable compilation
- Evaluating LLM output quality with scores (numeric, categorical, boolean)
- Running experiments against datasets for regression testing
- Tracking sessions, users, and metadata across multi-turn conversations
- Monitoring LLM costs and token usage in production
Key patterns covered:
- OpenTelemetry setup with
LangfuseSpanProcessor - Tracing with
,startActiveObservation
, and manualobservestartObservation - Observation types (span, generation, agent, tool, retriever, evaluator, embedding, chain, guardrail)
- OpenAI SDK auto-instrumentation with
observeOpenAI() - Prompt management (get, compile, text vs chat prompts, versioning)
- Scores and evaluations (numeric, categorical, boolean)
- Datasets and experiments for testing
- Flush, shutdown, and lifecycle management
When NOT to use:
- You only need basic
debugging -- Langfuse is for structured production observabilityconsole.log - You want provider-specific tracing built into an AI SDK -- check if your framework has native observability
- You need APM/infrastructure monitoring (CPU, memory, HTTP latency) -- use a general-purpose observability tool
Examples Index
- Core: Setup & Configuration -- OpenTelemetry setup, instrumentation file, client init, flush/shutdown
- Tracing -- startActiveObservation, observe, manual tracing, nesting, observation types, metadata
- OpenAI Integration -- observeOpenAI wrapper, streaming, token tracking, custom attributes
- Prompt Management -- getPrompt, compile, text vs chat, versioning, caching
- Scores & Datasets -- Numeric/categorical/boolean scores, datasets, experiments
- Quick API Reference -- Package index, environment variables, observation types, score methods
<philosophy>
Philosophy
Langfuse provides open-source LLM observability built on OpenTelemetry. The SDK (v4+, August 2025) is a ground-up rewrite using OTel as the tracing backbone, meaning traces integrate naturally with the broader observability ecosystem.
Core principles:
- OpenTelemetry-native -- Built on OTel spans and context propagation. Langfuse observations are wrappers around OTel spans with LLM-specific attributes (model, tokens, cost). This means any OTel-compatible instrumentation library works alongside Langfuse.
- Zero-latency tracing -- All trace events are queued locally and flushed in background batches. Your application's response time is not affected by observability.
- Modular packages --
for instrumentation,@langfuse/tracing
for prompts/scores/datasets,@langfuse/client
for OpenAI auto-instrumentation. Install only what you need.@langfuse/openai - Context-first --
automatically propagates parent-child relationships. Nested observations inherit context without manual ID threading.startActiveObservation() - Observation types -- LLM-specific types (
,generation
,agent
,tool
,retriever
,evaluator
) provide semantic meaning to traces, enabling richer dashboard views and filtering.embedding
When to use Langfuse:
- You need production-grade LLM observability with tracing, cost tracking, and evaluations
- You want prompt management with versioning, A/B testing via labels, and variable compilation
- You need dataset-driven testing and experiment tracking for LLM quality assurance
- You want an open-source, self-hostable alternative to proprietary LLM observability platforms
When NOT to use:
- Simple debugging --
is sufficient for local developmentconsole.log - Infrastructure monitoring -- use Datadog, Grafana, etc. for APM
- You need a complete AI agent framework -- Langfuse is observability, not orchestration
<patterns>
Core Patterns
Pattern 1: OpenTelemetry Setup
Create an
instrumentation.ts file and import it at the top of your entry point.
// instrumentation.ts import { NodeSDK } from "@opentelemetry/sdk-node"; import { LangfuseSpanProcessor } from "@langfuse/otel"; const sdk = new NodeSDK({ spanProcessors: [new LangfuseSpanProcessor()], }); sdk.start(); export { sdk };
// index.ts -- import instrumentation FIRST import "./instrumentation"; // All other imports AFTER instrumentation import { startActiveObservation } from "@langfuse/tracing";
Why good: OTel must instrument modules before they are loaded; importing instrumentation first ensures all subsequent imports are traced automatically
// BAD: importing instrumentation after other modules import { startActiveObservation } from "@langfuse/tracing"; import "./instrumentation"; // TOO LATE -- tracing won't capture earlier imports
Why bad: OpenAI/LangChain auto-instrumentation requires OTel to be initialized before those SDKs are imported
See: examples/core.md for environment variables, sampling, masking, and production configuration
Pattern 2: Tracing with startActiveObservation
The primary instrumentation pattern. Creates an observation, makes it the active context, and automatically ends it when the callback completes.
import { startActiveObservation } from "@langfuse/tracing"; async function handleRequest(query: string): Promise<string> { return await startActiveObservation("handle-request", async (span) => { span.update({ input: { query } }); // Nested observation -- automatically becomes a child const result = await startActiveObservation( "process-query", async (child) => { child.update({ input: { query } }); const answer = await callLLM(query); child.update({ output: { answer } }); return answer; }, ); span.update({ output: { result } }); return result; }); }
Why good: Automatic context propagation, automatic end on callback completion, nesting creates parent-child hierarchy without manual ID management
// BAD: using startObservation without ending it import { startObservation } from "@langfuse/tracing"; const span = startObservation("my-span"); await doWork(); // span.end() never called -- observation stays open forever
Why bad: Manual
startObservation requires explicit .end() calls; forgetting creates open-ended observations
See: examples/tracing.md for observe wrapper, observation types, metadata, and manual tracing
Pattern 3: The observe() Wrapper
Wraps a function to automatically capture inputs, outputs, timings, and errors.
import { observe } from "@langfuse/tracing"; const classifyIntent = observe( async (query: string) => { const result = await callLLM(query); return result.intent; }, { name: "classify-intent", asType: "generation" }, ); // Usage -- automatically traced const intent = await classifyIntent("Book a flight to Paris");
Why good: Declarative tracing, inputs/outputs captured automatically,
asType tags the observation type for richer dashboard filtering
Pattern 4: OpenAI Auto-Instrumentation
Use
observeOpenAI() to wrap the OpenAI client for automatic tracing of all calls.
import OpenAI from "openai"; import { observeOpenAI } from "@langfuse/openai"; const openai = observeOpenAI(new OpenAI()); // All calls automatically traced with model, tokens, cost const completion = await openai.chat.completions.create({ model: "gpt-4o", messages: [{ role: "user", content: "Hello" }], });
Why good: Zero manual instrumentation, captures model name, token counts, estimated costs, latency, and streaming metrics automatically
// BAD: manually creating generation observations for OpenAI calls await startActiveObservation("openai-call", async (span) => { const result = await rawOpenai.chat.completions.create({ ... }); span.update({ model: "gpt-4o", input: messages, output: result.choices[0].message.content, }); }, { asType: "generation" });
Why bad:
observeOpenAI handles all of this automatically with more accurate token/cost data; manual tracking is error-prone and duplicates effort
See: examples/openai-integration.md for streaming, custom attributes, and token tracking on streams
Pattern 5: Prompt Management
Fetch versioned prompts, compile with variables, and link to traces.
import { LangfuseClient } from "@langfuse/client"; const langfuse = new LangfuseClient(); // Fetch a text prompt (production label by default) const prompt = await langfuse.prompt.get("summarize-article"); const compiled = prompt.compile({ topic: "AI safety", length: "brief" }); // -> "Write a brief summary about AI safety." // Fetch a chat prompt const chatPrompt = await langfuse.prompt.get("assistant-v2", { type: "chat" }); const messages = chatPrompt.compile({ userName: "Alice" }); // -> [{ role: "system", content: "You are helping Alice..." }, ...]
Why good: Centralized prompt management with versioning, labels for A/B testing, variable compilation, and built-in caching
See: examples/prompt-management.md for versioning, labels, cache control, and linking prompts to traces
Pattern 6: Scores and Evaluations
Attach quality measurements to traces and observations.
import { LangfuseClient } from "@langfuse/client"; const langfuse = new LangfuseClient(); // Numeric score langfuse.score.create({ traceId: "trace-123", name: "relevance", value: 0.95, dataType: "NUMERIC", }); // Categorical score langfuse.score.create({ traceId: "trace-123", name: "quality", value: "good", dataType: "CATEGORICAL", }); // Boolean score (0 or 1) langfuse.score.create({ traceId: "trace-123", name: "contains-hallucination", value: 0, dataType: "BOOLEAN", }); // Score a specific observation within a trace langfuse.score.create({ traceId: "trace-123", observationId: "obs-456", name: "accuracy", value: 0.88, dataType: "NUMERIC", }); // Flush in short-lived processes await langfuse.score.flush();
Why good: Three data types cover all evaluation needs, scores attach at trace or observation level, fire-and-forget API with batching
See: examples/scores-datasets.md for active observation scoring, session scores, datasets, and experiments
Pattern 7: Flush and Shutdown
Always flush in short-lived processes. The SDK batches events and sends them asynchronously.
import { sdk } from "./instrumentation"; import { LangfuseClient } from "@langfuse/client"; import { LangfuseSpanProcessor } from "@langfuse/otel"; const langfuse = new LangfuseClient(); async function main() { // ... do work ... // Flush scores await langfuse.score.flush(); // Shutdown OTel SDK (flushes all pending spans) await sdk.shutdown(); } main();
Why good: Explicit flush/shutdown ensures all events are sent before the process exits; without this, data is silently lost in serverless and scripts
// BAD: exiting without flushing async function handler() { await startActiveObservation("my-trace", async (span) => { span.update({ output: "done" }); }); // Process exits -- batched events never sent }
Why bad: Langfuse batches events locally; if the process exits before the flush interval, events are lost
</patterns><performance>
Performance Optimization
Sampling for High-Volume Applications
Reduce costs by sampling a subset of traces:
import { TraceIdRatioBasedSampler } from "@opentelemetry/sdk-trace-base"; const sdk = new NodeSDK({ sampler: new TraceIdRatioBasedSampler(0.2), // Sample 20% of traces spanProcessors: [new LangfuseSpanProcessor()], });
Or via environment variable:
LANGFUSE_SAMPLE_RATE=0.2
Key Optimization Patterns
- Batch flush tuning -- Configure
(default 10) andLANGFUSE_FLUSH_AT
(default 1s) for your workloadLANGFUSE_FLUSH_INTERVAL - Span filtering -- Use
onshouldExportSpan
to drop noisy non-LLM spansLangfuseSpanProcessor - Data masking -- Redact PII before transmission with the
option to avoid storing sensitive datamask - Stream token tracking -- Set
on OpenAI streaming calls sostream_options: { include_usage: true }
captures token countsobserveOpenAI
<decision_framework>
Decision Framework
Which Packages to Install
What do you need? +-- Tracing LLM calls? | +-- YES -> npm install @langfuse/tracing @langfuse/otel @opentelemetry/sdk-node | +-- Also using OpenAI SDK? | +-- YES -> npm install @langfuse/openai +-- Prompt management, scores, or datasets? | +-- YES -> npm install @langfuse/client +-- Both tracing AND client features? +-- YES -> Install all: @langfuse/tracing @langfuse/otel @opentelemetry/sdk-node @langfuse/client
Which Tracing Method to Use
How do you want to instrument? +-- Wrapping a function? -> observe() (declarative, auto-captures inputs/outputs) +-- Block of code with nesting? -> startActiveObservation() (context propagation, auto-end) +-- Need manual start/end control? -> startObservation() (requires explicit .end()) +-- OpenAI SDK calls? -> observeOpenAI() (zero-config auto-instrumentation) +-- Update active span without reference? -> updateActiveObservation()
Which Observation Type (asType)
What is this observation? +-- LLM call (prompt -> completion) -> "generation" +-- AI agent decision-making step -> "agent" +-- External API or function call -> "tool" +-- Vector store or DB retrieval -> "retriever" +-- Quality assessment step -> "evaluator" +-- Embedding creation -> "embedding" +-- Link between application steps -> "chain" +-- Content safety / jailbreak check -> "guardrail" +-- Generic duration operation -> "span" (default) +-- Point-in-time event -> "event"
</decision_framework>
<red_flags>
RED FLAGS
High Priority Issues:
- Not importing
before other modules (auto-instrumentation silently fails)instrumentation.ts - Exiting short-lived processes without
orforceFlush()
(events are silently lost)sdk.shutdown() - Hardcoding
orLANGFUSE_SECRET_KEY
in source code (credential exposure)LANGFUSE_PUBLIC_KEY - Using manual generation observations when
would handle it automatically (duplicated effort, less accurate data)observeOpenAI() - Using
without callingstartObservation()
(observation stays open indefinitely).end()
Medium Priority Issues:
- Not setting
on OpenAI streaming calls (token counts missing fromstream_options: { include_usage: true }
traces)observeOpenAI - Forgetting to call
in short-lived processes (scores are batched and may be lost)langfuse.score.flush() - Using
whenstartObservation()
would work (no automatic context propagation or auto-end)startActiveObservation() - Not using
on observations (all observations appear as generic spans, losing semantic meaning)asType - Not setting
for self-hosted instances (defaults to cloud.langfuse.com)LANGFUSE_BASE_URL
Common Mistakes:
- Importing
without setting up the OTel@langfuse/openai
first -- the OpenAI wrapper requires OTel context to send tracesNodeSDK - Confusing
(fromLangfuseClient
, for prompts/scores/datasets) with the OTel tracing functions (from@langfuse/client
)@langfuse/tracing - Using
without matching allprompt.compile()
placeholders -- unmatched variables remain as literal{{variable}}
in output{{name}} - Calling
with alangfuse.score.create()
of typevalue
forstring
scores orNUMERIC
fornumber
scores (type mismatch)CATEGORICAL - Running dataset experiments without OTel setup -- experiment tasks run inside
which requires OTelstartActiveObservation
Gotchas & Edge Cases:
does NOT support the OpenAI Assistants API -- only Chat Completions and Responses APIobserveOpenAI()- The SDK's default span filter only exports Langfuse and GenAI spans. If you use a custom instrumentation library, you must configure
to include it.shouldExportSpan
caches prompts with a default TTL. If you update a prompt and don't see changes, setLangfuseClient.prompt.get()
to bypass caching.cacheTtlSeconds: 0- Boolean scores use float values (
or0
), not JavaScript booleans (1
/true
).false - Self-hosted Langfuse requires platform version >= 3.95.0 for TypeScript SDK v4 compatibility.
is fire-and-forget (synchronous) -- it queues the score for batched delivery. You only needscore.create()
onawait
.flush()- Dataset names with slashes (
) must be URL-encoded when used as path parameters.evaluation/qa-dataset - The v4+ SDK is a complete rewrite from v3 --
class,Langfuse
,trace()
,span()
from v3 are replaced by OTel-based APIs.generation()
</red_flags>
<critical_reminders>
CRITICAL REMINDERS
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
, named constants)import type
(You MUST import and register
at the top of your entry point BEFORE any other imports -- OpenTelemetry must instrument modules before they are loaded)instrumentation.ts
(You MUST call
or forceFlush()
in short-lived processes (serverless, scripts, CLI tools) -- events are batched and will be lost without explicit flushing)sdk.shutdown()
(You MUST use
with @langfuse/openai
for OpenAI SDK tracing -- do NOT manually create generation observations for OpenAI calls when the wrapper handles it automatically)observeOpenAI()
(You MUST set
, LANGFUSE_SECRET_KEY
, and LANGFUSE_PUBLIC_KEY
via environment variables -- never hardcode credentials)LANGFUSE_BASE_URL
(You MUST use
or startActiveObservation()
for nested tracing -- manual observe()
requires explicit startObservation()
calls and does NOT propagate context automatically).end()
Failure to follow these rules will produce silent data loss, missing traces, or credential exposure in LLM observability.
</critical_reminders>