Awesome-omni-skill monitoring-observability

Monitoring and observability patterns for Prometheus metrics, Grafana dashboards, Langfuse LLM tracing, and drift detection. Use when adding logging, metrics, distributed tracing, LLM cost tracking, or quality drift monitoring.

install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/devops/monitoring-observability" ~/.claude/skills/diegosouzapw-awesome-omni-skill-monitoring-observability && rm -rf "$T"
manifest: skills/devops/monitoring-observability/SKILL.md
source content

Monitoring & Observability

Comprehensive patterns for infrastructure monitoring, LLM observability, and quality drift detection. Each category has individual rule files in

rules/
loaded on-demand.

Quick Reference

CategoryRulesImpactWhen to Use
Infrastructure Monitoring3CRITICALPrometheus metrics, Grafana dashboards, alerting rules
LLM Observability3HIGHLangfuse tracing, cost tracking, evaluation scoring
Drift Detection3HIGHStatistical drift, quality regression, drift alerting
Silent Failures3HIGHTool skipping, quality degradation, loop/token spike alerting

Total: 12 rules across 4 categories

Quick Start

# Prometheus metrics with RED method
from prometheus_client import Counter, Histogram

http_requests = Counter('http_requests_total', 'Total requests', ['method', 'endpoint', 'status'])
http_duration = Histogram('http_request_duration_seconds', 'Request latency',
    buckets=[0.01, 0.05, 0.1, 0.5, 1, 2, 5])
# Langfuse LLM tracing
from langfuse import observe, get_client

@observe()
async def analyze_content(content: str):
    get_client().update_current_trace(
        user_id="user_123", session_id="session_abc",
        tags=["production", "orchestkit"],
    )
    return await llm.generate(content)
# PSI drift detection
import numpy as np

psi_score = calculate_psi(baseline_scores, current_scores)
if psi_score >= 0.25:
    alert("Significant quality drift detected!")

Infrastructure Monitoring

Prometheus metrics, Grafana dashboards, and alerting for application health.

RuleFileKey Pattern
Prometheus Metrics
rules/monitoring-prometheus.md
RED method, counters, histograms, cardinality
Grafana Dashboards
rules/monitoring-grafana.md
Golden Signals, SLO/SLI, health checks
Alerting Rules
rules/monitoring-alerting.md
Severity levels, grouping, escalation, fatigue prevention

LLM Observability

Langfuse-based tracing, cost tracking, and evaluation for LLM applications.

RuleFileKey Pattern
Langfuse Traces
rules/llm-langfuse-traces.md
@observe decorator, OTEL spans, agent graphs
Cost Tracking
rules/llm-cost-tracking.md
Token usage, spend alerts, Metrics API
Eval Scoring
rules/llm-eval-scoring.md
Custom scores, evaluator tracing, quality monitoring

Drift Detection

Statistical and quality drift detection for production LLM systems.

RuleFileKey Pattern
Statistical Drift
rules/drift-statistical.md
PSI, KS test, KL divergence, EWMA
Quality Drift
rules/drift-quality.md
Score regression, baseline comparison, canary prompts
Drift Alerting
rules/drift-alerting.md
Dynamic thresholds, correlation, anti-patterns

Silent Failures

Detection and alerting for silent failures in LLM agents.

RuleFileKey Pattern
Tool Skipping
rules/silent-tool-skipping.md
Expected vs actual tool calls, Langfuse traces
Quality Degradation
rules/silent-degraded-quality.md
Heuristics + LLM-as-judge, z-score baselines
Silent Alerting
rules/silent-alerting.md
Loop detection, token spikes, escalation workflow

Key Decisions

DecisionRecommendationRationale
Metric methodologyRED method (Rate, Errors, Duration)Industry standard, covers essential service health
Log formatStructured JSONMachine-parseable, supports log aggregation
TracingOpenTelemetryVendor-neutral, auto-instrumentation, broad ecosystem
LLM observabilityLangfuse (not LangSmith)Open-source, self-hosted, built-in prompt management
LLM tracing API
@observe
+
get_client()
OTEL-native, automatic span creation
Drift methodPSI for production, KS for small samplesPSI is stable for large datasets, KS more sensitive
Threshold strategyDynamic (95th percentile) over staticReduces alert fatigue, context-aware
Alert severity4 levels (Critical, High, Medium, Low)Clear escalation paths, appropriate response times

Detailed Documentation

ResourceDescription
references/Logging, metrics, tracing, Langfuse, drift analysis guides
checklists/Implementation checklists for monitoring and Langfuse setup
examples/Real-world monitoring dashboard and trace examples
scripts/Templates: Prometheus, OpenTelemetry, health checks, Langfuse

Related Skills

  • defense-in-depth
    - Layer 8 observability as part of security architecture
  • devops-deployment
    - Observability integration with CI/CD and Kubernetes
  • resilience-patterns
    - Monitoring circuit breakers and failure scenarios
  • llm-evaluation
    - Evaluation patterns that integrate with Langfuse scoring
  • caching
    - Caching strategies that reduce costs tracked by Langfuse