Claude-code-plugins-plus-skills langfuse-incident-runbook
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/langfuse-pack/skills/langfuse-incident-runbook" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-langfuse-incident-runbook && rm -rf "$T"
manifest:
plugins/saas-packs/langfuse-pack/skills/langfuse-incident-runbook/SKILL.mdsource content
Langfuse Incident Runbook
Overview
Step-by-step procedures for Langfuse-related incidents, from initial triage (2 min) through resolution and post-incident review. Your application should work without Langfuse -- these procedures focus on restoring observability.
Severity Classification
| Severity | Description | Response Time | Example |
|---|---|---|---|
| P1 | Application impacted by tracing | 15 min | SDK throwing unhandled errors, blocking requests |
| P2 | Traces not appearing, no app impact | 1 hour | Missing observability data |
| P3 | Degraded performance from tracing | 4 hours | High latency from flush backlog |
| P4 | Minor issues | 24 hours | Occasional missing traces |
Instructions
Step 1: Initial Assessment (2 Minutes)
set -euo pipefail echo "=== Langfuse Incident Triage ===" echo "Time: $(date -u)" # 1. Check Langfuse cloud status echo -n "Status page: " curl -s -o /dev/null -w "%{http_code}" https://status.langfuse.com || echo "UNREACHABLE" echo "" # 2. Test API connectivity HOST="${LANGFUSE_BASE_URL:-${LANGFUSE_HOST:-https://cloud.langfuse.com}}" echo -n "API health: " curl -s -o /dev/null -w "%{http_code} (%{time_total}s)" "$HOST/api/public/health" || echo "FAILED" echo "" # 3. Test auth if [ -n "${LANGFUSE_PUBLIC_KEY:-}" ] && [ -n "${LANGFUSE_SECRET_KEY:-}" ]; then AUTH=$(echo -n "$LANGFUSE_PUBLIC_KEY:$LANGFUSE_SECRET_KEY" | base64) echo -n "Auth test: " curl -s -o /dev/null -w "%{http_code}" \ -H "Authorization: Basic $AUTH" "$HOST/api/public/traces?limit=1" || echo "FAILED" echo "" fi # 4. Check app error logs echo "" echo "--- Recent errors ---" grep -i "langfuse\|trace.*error\|flush.*fail" /var/log/app/*.log 2>/dev/null | tail -10 || echo "No log files found"
Step 2: Determine Incident Type and Response
| Symptom | Likely Cause | Immediate Action |
|---|---|---|
| No traces appearing | SDK not flushing | Check shutdown handlers; set temporarily |
| Key rotation or mismatch | Verify keys match the correct project |
| Rate limited | Increase batch size, reduce flush frequency |
| SDK throwing errors | Unhandled exception | Wrap in try/catch; check SDK version |
| High request latency | Sync flush in hot path | Switch to async; increase |
| Complete Langfuse outage | Service-side issue | Enable fallback mode |
Step 3: Fallback Mode (P1 -- App Impacted)
If Langfuse is causing application issues, disable tracing immediately:
// Emergency disable via environment variable // Set LANGFUSE_ENABLED=false in your deployment // In your tracing initialization: if (process.env.LANGFUSE_ENABLED === "false") { console.warn("Langfuse tracing DISABLED (emergency fallback)"); // Don't initialize SDK -- all observe/startActiveObservation calls // will still work but produce no-op spans }
For v3, use the
enabled flag:
const langfuse = new Langfuse({ enabled: process.env.LANGFUSE_ENABLED !== "false", });
Step 4: Common Resolution Procedures
Procedure A: Missing Traces
// 1. Verify SDK is initialized console.log("Langfuse configured:", !!process.env.LANGFUSE_PUBLIC_KEY); // 2. Check flush is happening // v4+: Verify NodeSDK is started and shutdown is registered // v3: Verify flushAsync() or shutdownAsync() is called // 3. Temporarily set aggressive flush for debugging const processor = new LangfuseSpanProcessor({ exportIntervalMillis: 1000, maxExportBatchSize: 1, });
Procedure B: Rate Limit (429) Recovery
// Increase batching to reduce API calls const processor = new LangfuseSpanProcessor({ exportIntervalMillis: 30000, // 30s flush maxExportBatchSize: 200, // Large batches }); // Or temporarily enable sampling const EMERGENCY_SAMPLE_RATE = 0.1; // Only trace 10%
Procedure C: Self-Hosted Instance Down
set -euo pipefail # Check container status docker ps -a | grep langfuse # Check logs docker logs langfuse-langfuse-1 --tail 50 # Check database docker exec langfuse-postgres-1 pg_isready -U langfuse # Restart if needed docker compose restart langfuse
Step 5: Post-Incident Verification
set -euo pipefail # Verify traces are flowing again echo "=== Post-Incident Check ===" HOST="${LANGFUSE_BASE_URL:-https://cloud.langfuse.com}" AUTH=$(echo -n "$LANGFUSE_PUBLIC_KEY:$LANGFUSE_SECRET_KEY" | base64) # Check recent trace count TRACE_COUNT=$(curl -s \ -H "Authorization: Basic $AUTH" \ "$HOST/api/public/traces?limit=5" | python3 -c "import sys,json; print(len(json.load(sys.stdin).get('data',[])))" 2>/dev/null || echo "ERROR") echo "Recent traces: $TRACE_COUNT" if [ "$TRACE_COUNT" = "0" ] || [ "$TRACE_COUNT" = "ERROR" ]; then echo "WARNING: Traces may not be flowing yet" else echo "OK: Traces are appearing" fi
Step 6: Post-Incident Review (P1/P2)
Document for post-mortem:
- Timeline: When detected, when resolved, total duration
- Impact: Traces lost, application impact, user impact
- Root cause: Why did the incident occur?
- Resolution: What fixed it?
- Prevention: What changes prevent recurrence?
- Action items: Improvements to implement
Escalation Path
| Level | Who | When |
|---|---|---|
| L1 | On-call engineer | All incidents -- run triage |
| L2 | Platform team lead | P1/P2 unresolved after 30 min |
| L3 | Langfuse support | Confirmed service-side issue |
Langfuse support channels:
- Status Page -- check first
- Discord -- community support
- GitHub Issues -- bug reports
- Email support (enterprise customers)
Error Handling
| Issue | Immediate Fix | Permanent Fix |
|---|---|---|
| SDK crashes app | Set | Wrap all tracing in try/catch |
| Lost traces | Increase batch size | Add shutdown handlers |
| High latency | Disable sync flush | Use async-only patterns |
| Auth failures | Rotate and redeploy keys | Add key validation at startup |