Vibeship-spawner-skills langfuse

Langfuse Skill

install
source · Clone the upstream repo
git clone https://github.com/vibeforge1111/vibeship-spawner-skills
manifest: ai-agents/langfuse/skill.yaml
source content

Langfuse Skill

Open-source LLM observability and evaluation

id: langfuse name: Langfuse version: 1.0.0 layer: 2 # Integration layer

description: | Expert in Langfuse - the open-source LLM observability platform. Covers tracing, prompt management, evaluation, datasets, and integration with LangChain, LlamaIndex, and OpenAI. Essential for debugging, monitoring, and improving LLM applications in production.

owns:

  • LLM tracing and observability
  • Prompt management and versioning
  • Evaluation and scoring
  • Dataset management
  • Cost tracking
  • Performance monitoring
  • A/B testing prompts

pairs_with:

  • langgraph
  • crewai
  • structured-output
  • autonomous-agents

requires:

  • Python or TypeScript/JavaScript
  • Langfuse account (cloud or self-hosted)
  • LLM API keys

ecosystem: primary: - Langfuse Cloud - Langfuse Self-hosted - Python SDK - JS/TS SDK common_integrations: - LangChain - LlamaIndex - OpenAI SDK - Anthropic SDK - Vercel AI SDK platforms: - Any Python/JS backend - Serverless functions - Jupyter notebooks

prerequisites:

  • LLM application basics
  • API integration experience
  • Understanding of tracing concepts

limits:

  • Self-hosted requires infrastructure
  • High-volume may need optimization
  • Real-time dashboard has latency
  • Evaluation requires setup

tags:

  • langfuse
  • observability
  • tracing
  • llm-monitoring
  • evaluation
  • prompt-management
  • debugging
  • analytics

triggers:

  • "langfuse"
  • "llm observability"
  • "llm tracing"
  • "prompt management"
  • "llm evaluation"
  • "monitor llm"
  • "debug llm"

history:

  • version: "1.0.0" date: "2025-01" changes: "Initial skill covering Langfuse patterns"

contrarian_insights:

  • claim: "Observability is optional for LLM apps" counter: "Without observability, you're flying blind on cost, quality, and latency" evidence: "Teams discover 40%+ cost waste and quality issues only after adding tracing"
  • claim: "Just log to files" counter: "Structured traces enable debugging, evaluation, and prompt iteration" evidence: "Finding issues in logs vs Langfuse: hours vs minutes"
  • claim: "Evaluation is too expensive" counter: "Catching regressions early saves more than eval costs" evidence: "One bad prompt update can cost more than months of evaluation"

identity: role: LLM Observability Architect personality: | You are an expert in LLM observability and evaluation. You think in terms of traces, spans, and metrics. You know that LLM applications need monitoring just like traditional software - but with different dimensions (cost, quality, latency). You use data to drive prompt improvements and catch regressions. expertise: - Tracing architecture - Prompt versioning - Evaluation strategies - Cost optimization - Quality monitoring

patterns:

  • name: Basic Tracing Setup description: Instrument LLM calls with Langfuse when_to_use: Any LLM application implementation: | from langfuse import Langfuse

    Initialize client

    langfuse = Langfuse( public_key="pk-...", secret_key="sk-...", host="https://cloud.langfuse.com" # or self-hosted URL )

    Create a trace for a user request

    trace = langfuse.trace( name="chat-completion", user_id="user-123", session_id="session-456", # Groups related traces metadata={"feature": "customer-support"}, tags=["production", "v2"] )

    Log a generation (LLM call)

    generation = trace.generation( name="gpt-4o-response", model="gpt-4o", model_parameters={"temperature": 0.7}, input={"messages": [{"role": "user", "content": "Hello"}]}, metadata={"attempt": 1} )

    Make actual LLM call

    response = openai.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello"}] )

    Complete the generation with output

    generation.end( output=response.choices[0].message.content, usage={ "input": response.usage.prompt_tokens, "output": response.usage.completion_tokens } )

    Score the trace

    trace.score( name="user-feedback", value=1, # 1 = positive, 0 = negative comment="User clicked helpful" )

    Flush before exit (important in serverless)

    langfuse.flush()

  • name: OpenAI Integration description: Automatic tracing with OpenAI SDK when_to_use: OpenAI-based applications implementation: | from langfuse.openai import openai

    Drop-in replacement for OpenAI client

    All calls automatically traced

    response = openai.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello"}], # Langfuse-specific parameters name="greeting", # Trace name session_id="session-123", user_id="user-456", tags=["test"], metadata={"feature": "chat"} )

    Works with streaming

    stream = openai.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Tell me a story"}], stream=True, name="story-generation" )

    for chunk in stream: print(chunk.choices[0].delta.content, end="")

    Works with async

    import asyncio from langfuse.openai import AsyncOpenAI

    async_client = AsyncOpenAI()

    async def main(): response = await async_client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello"}], name="async-greeting" )

  • name: LangChain Integration description: Trace LangChain applications when_to_use: LangChain-based applications implementation: | from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langfuse.callback import CallbackHandler

    Create Langfuse callback handler

    langfuse_handler = CallbackHandler( public_key="pk-...", secret_key="sk-...", host="https://cloud.langfuse.com", session_id="session-123", user_id="user-456" )

    Use with any LangChain component

    llm = ChatOpenAI(model="gpt-4o")

    prompt = ChatPromptTemplate.from_messages([ ("system", "You are a helpful assistant."), ("user", "{input}") ])

    chain = prompt | llm

    Pass handler to invoke

    response = chain.invoke( {"input": "Hello"}, config={"callbacks": [langfuse_handler]} )

    Or set as default

    import langchain langchain.callbacks.manager.set_handler(langfuse_handler)

    Then all calls are traced

    response = chain.invoke({"input": "Hello"})

    Works with agents, retrievers, etc.

    from langchain.agents import create_openai_tools_agent

    agent = create_openai_tools_agent(llm, tools, prompt) agent_executor = AgentExecutor(agent=agent, tools=tools)

    result = agent_executor.invoke( {"input": "What's the weather?"}, config={"callbacks": [langfuse_handler]} )

  • name: Prompt Management description: Version and deploy prompts when_to_use: Managing prompts across environments implementation: | from langfuse import Langfuse

    langfuse = Langfuse()

    Fetch prompt from Langfuse

    (Create in UI or via API first)

    prompt = langfuse.get_prompt("customer-support-v2")

    Get compiled prompt with variables

    compiled = prompt.compile( customer_name="John", issue="billing question" )

    Use with OpenAI

    response = openai.chat.completions.create( model=prompt.config.get("model", "gpt-4o"), messages=compiled, temperature=prompt.config.get("temperature", 0.7) )

    Link generation to prompt version

    trace = langfuse.trace(name="support-chat") generation = trace.generation( name="response", model="gpt-4o", prompt=prompt # Links to specific version )

    Create/update prompts via API

    langfuse.create_prompt( name="customer-support-v3", prompt=[ {"role": "system", "content": "You are a support agent..."}, {"role": "user", "content": "{{user_message}}"} ], config={ "model": "gpt-4o", "temperature": 0.7 }, labels=["production"] # or ["staging", "development"] )

    Fetch specific label

    prompt = langfuse.get_prompt( "customer-support-v3", label="production" # Gets latest with this label )

  • name: Evaluation and Scoring description: Evaluate LLM outputs systematically when_to_use: Quality assurance and improvement implementation: | from langfuse import Langfuse

    langfuse = Langfuse()

    Manual scoring in code

    trace = langfuse.trace(name="qa-flow")

    After getting response

    trace.score( name="relevance", value=0.85, # 0-1 scale comment="Response addressed the question" )

    trace.score( name="correctness", value=1, # Binary: 0 or 1 data_type="BOOLEAN" )

    LLM-as-judge evaluation

    def evaluate_response(question: str, response: str) -> float: eval_prompt = f""" Rate the response quality from 0 to 1.

      Question: {question}
      Response: {response}
    
      Output only a number between 0 and 1.
      """
    
      result = openai.chat.completions.create(
          model="gpt-4o-mini",  # Cheaper model for eval
          messages=[{"role": "user", "content": eval_prompt}]
      )
    
      return float(result.choices[0].message.content.strip())
    

    Score asynchronously

    score = evaluate_response(question, response) trace.score( name="quality-llm-judge", value=score )

    Create evaluation dataset

    dataset = langfuse.create_dataset(name="support-qa-v1")

    Add items to dataset

    langfuse.create_dataset_item( dataset_name="support-qa-v1", input={"question": "How do I reset my password?"}, expected_output="Go to settings > security > reset password" )

    Run evaluation on dataset

    dataset = langfuse.get_dataset("support-qa-v1")

    for item in dataset.items: # Generate response response = generate_response(item.input["question"])

      # Link to dataset item
      trace = langfuse.trace(name="eval-run")
      trace.generation(
          name="response",
          input=item.input,
          output=response
      )
    
      # Score against expected
      similarity = calculate_similarity(response, item.expected_output)
      trace.score(name="similarity", value=similarity)
    
      # Link trace to dataset item
      item.link(trace, "eval-run-1")
    
  • name: Decorator Pattern description: Clean instrumentation with decorators when_to_use: Function-based applications implementation: | from langfuse.decorators import observe, langfuse_context

    @observe() # Creates a trace def chat_handler(user_id: str, message: str) -> str: # All nested @observe calls become spans context = get_context(message) response = generate_response(message, context) return response

    @observe() # Becomes a span under parent trace def get_context(message: str) -> str: # RAG retrieval docs = retriever.get_relevant_documents(message) return "\n".join([d.page_content for d in docs])

    @observe(as_type="generation") # LLM generation span def generate_response(message: str, context: str) -> str: response = openai.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": f"Context: {context}"}, {"role": "user", "content": message} ] ) return response.choices[0].message.content

    Add metadata and scores

    @observe() def main_flow(user_input: str): # Update current trace langfuse_context.update_current_trace( user_id="user-123", session_id="session-456", tags=["production"] )

      result = process(user_input)
    
      # Score the trace
      langfuse_context.score_current_trace(
          name="success",
          value=1 if result else 0
      )
    
      return result
    

    Works with async

    @observe() async def async_handler(message: str): result = await async_generate(message) return result

anti_patterns:

  • name: Not Flushing in Serverless description: Missing langfuse.flush() before exit why_bad: | Traces are batched. Serverless may exit before flush. Data is lost. what_to_do_instead: | Always call langfuse.flush() at end. Use context managers where available. Consider sync mode for critical traces.

  • name: Tracing Everything description: Instrumenting every tiny function why_bad: | Noisy traces. Performance overhead. Hard to find important info. what_to_do_instead: | Focus on: LLM calls, key logic, user actions. Group related operations. Use meaningful span names.

  • name: No User/Session IDs description: Traces without user context why_bad: | Can't debug specific users. Can't track sessions. Analytics limited. what_to_do_instead: | Always pass user_id and session_id. Use consistent identifiers. Add relevant metadata.

  • name: Ignoring Scores description: Tracing without evaluation why_bad: | Can see what happened, not quality. No feedback loop. Can't measure improvement. what_to_do_instead: | Add user feedback scores. Implement LLM-as-judge. Track key quality metrics.

handoffs:

  • trigger: "agent|langgraph|workflow" to: langgraph context: "Need to build agent to monitor"

  • trigger: "crewai|multi-agent|crew" to: crewai context: "Need to monitor multi-agent system"

  • trigger: "structured output|parsing" to: structured-output context: "Need to trace structured extraction"