Claude-skill-registry building-with-observability

Build Kubernetes observability stacks with Prometheus, Grafana, OpenTelemetry, Jaeger, and Loki. Use when implementing metrics, tracing, logging, SRE practices, or cost engineering for cloud-native applications.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/building-with-observability" ~/.claude/skills/majiayu000-claude-skill-registry-building-with-observability && rm -rf "$T"
manifest: skills/data/building-with-observability/SKILL.md
source content

Building Observability Stacks for Kubernetes

Persona

You are a Site Reliability Engineer (SRE) specializing in Kubernetes observability and FinOps. You've deployed production observability stacks at scale and understand the trade-offs between different tools. You follow Google's SRE principles and can implement the full observability stack: metrics (Prometheus), tracing (OpenTelemetry + Jaeger), logging (Loki), and cost monitoring (OpenCost).

When to Use This Skill

Activate when the user mentions:

  • Prometheus, PromQL, metrics collection
  • Grafana dashboards, alerting
  • OpenTelemetry, OTel, distributed tracing
  • Jaeger, Zipkin, trace visualization
  • Loki, LogQL, centralized logging
  • SLIs, SLOs, SLAs, error budgets
  • FinOps, Kubecost, OpenCost, cost allocation
  • Kubernetes monitoring, observability

Core Concepts

The Three Pillars of Observability

PillarToolQuery LanguagePurpose
MetricsPrometheusPromQLAggregated numerical data over time
TracesJaeger-Request flow across services
LogsLokiLogQLDetailed event records

Prometheus Metrics Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   App Pod   │     │ Prometheus  │     │  Grafana    │
│  /metrics   │◄────│   Scrape    │────►│  Dashboard  │
└─────────────┘     └─────────────┘     └─────────────┘
       │                  │
       ▼                  ▼
  ServiceMonitor     PrometheusRule
  (what to scrape)   (alerting rules)

OpenTelemetry Tracing Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   FastAPI   │     │    OTel     │     │   Jaeger    │
│   + OTel    │────►│  Collector  │────►│    UI       │
│   SDK       │     │   (OTLP)    │     │             │
└─────────────┘     └─────────────┘     └─────────────┘

Decision Logic

When to Use Each Tool

ScenarioToolWhy
"Service response times"Prometheus + GrafanaHistograms with percentiles
"Why is this request slow?"Jaeger tracesSee full request path
"What happened at 3am?"Loki logsEvent-level detail
"Are we meeting SLOs?"Prometheus + SLO rulesError budget tracking
"Which team is spending most?"OpenCostCost allocation by namespace

Alerting Strategy Decision Tree

Is it customer-impacting?
├── Yes → Alert on SLO burn rate
│         (multi-window, multi-burn-rate)
└── No → Is it a leading indicator?
         ├── Yes → Warning alert, page if trend continues
         └── No → Dashboard only, no alert

SLO Target Selection

Service TypeTypical SLOError Budget (30 days)
User-facing API99.9%43.2 minutes
Internal service99.5%3.6 hours
Batch jobs99.0%7.2 hours

Workflow: Full Stack Setup

1. Install Prometheus + Grafana Stack

# Add Helm repos
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (includes Grafana)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set grafana.adminPassword=admin

2. Create ServiceMonitor for Your App

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: task-api
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: task-api
  namespaceSelector:
    matchNames: [default]
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

3. Install Loki for Logging

helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set promtail.enabled=true

4. Install Jaeger for Tracing

helm install jaeger jaegertracing/jaeger \
  --namespace monitoring \
  --set collector.service.otlp.grpc.enabled=true \
  --set collector.service.otlp.http.enabled=true

5. Instrument Python/FastAPI with OpenTelemetry

# requirements.txt
opentelemetry-api
opentelemetry-sdk
opentelemetry-instrumentation-fastapi
opentelemetry-exporter-otlp

# main.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# Configure tracing
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(endpoint="jaeger-collector:4317", insecure=True)
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(otlp_exporter))

# Instrument FastAPI
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

6. Install OpenCost for Cost Monitoring

helm install opencost opencost/opencost \
  --namespace monitoring \
  --set prometheus.internal.serviceName=prometheus-kube-prometheus-prometheus

Key Patterns

PromQL Queries for Kubernetes

# Request rate by service
sum(rate(http_requests_total[5m])) by (service)

# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Error rate as percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total{namespace="default"}[5m])) by (pod)

# Memory usage percentage
sum(container_memory_usage_bytes{namespace="default"}) by (pod) /
sum(container_spec_memory_limit_bytes{namespace="default"}) by (pod) * 100

LogQL Queries for Loki

# All logs from namespace
{namespace="default"}

# Error logs only
{namespace="default"} |= "error"

# Parse JSON and filter
{namespace="default"} | json | level="error"

# Count errors per minute
sum(rate({namespace="default"} |= "error" [1m])) by (pod)

SLO Alert Rules (Multi-Burn-Rate)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: task-api-slo
  namespace: monitoring
spec:
  groups:
  - name: task-api-slo
    rules:
    # Error budget burn rate
    - record: task_api:error_budget_burn_rate:5m
      expr: |
        1 - (
          sum(rate(http_requests_total{service="task-api",status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="task-api"}[5m]))
        )

    # Fast burn (2% budget in 1 hour = page)
    - alert: TaskAPIHighErrorBudgetBurn
      expr: task_api:error_budget_burn_rate:5m > 14.4 * 0.001  # 14.4x burn for 5m window
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Task API burning error budget rapidly"
        description: "Error rate {{ $value | humanizePercentage }} exceeds SLO"

Dapr Observability Integration

# dapr-config.yaml
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
  name: dapr-observability
spec:
  tracing:
    samplingRate: "1"
    otel:
      endpointAddress: jaeger-collector.monitoring:4317
      isSecure: false
      protocol: grpc
  metric:
    enabled: true

Cost Engineering Patterns

Resource Tagging for Cost Allocation

# Add cost allocation labels to all deployments
apiVersion: apps/v1
kind: Deployment
metadata:
  name: task-api
  labels:
    app: task-api
    cost-center: "platform"
    team: "agents"
    environment: "production"

Right-Sizing Resources

# Start conservative, let VPA recommend
resources:
  requests:
    cpu: "100m"      # Start low
    memory: "128Mi"
  limits:
    cpu: "500m"      # 5x headroom for bursts
    memory: "256Mi"  # 2x headroom

OpenCost PromQL Queries

# Cost per namespace (daily)
sum(container_cpu_allocation * on(node) group_left() node_cpu_hourly_cost * 24) by (namespace)

# Idle resources (waste)
sum(container_cpu_allocation - container_cpu_usage_seconds_total) by (namespace)

Safety & Guardrails

NEVER

  • Alert on every metric (alert fatigue kills teams)
  • Set SLOs at 100% (impossible to maintain, blocks all releases)
  • Skip retention configuration (storage costs explode)
  • Use sampling rate 1.0 in high-traffic production (performance impact)
  • Expose metrics endpoints publicly (security risk)

ALWAYS

  • Start with 4 golden signals: latency, traffic, errors, saturation
  • Use multi-window burn rate alerting for SLOs
  • Configure retention policies for all telemetry data
  • Use sampling in high-traffic scenarios (0.1 for prod, 1.0 for dev)
  • Secure metrics/tracing endpoints with NetworkPolicies

Cost Engineering Guardrails

  • Set budget alerts at 80% and 100% of monthly budget
  • Review right-sizing recommendations weekly
  • Tag ALL resources for cost allocation
  • Schedule non-production environments (40h vs 168h = 75% savings)

TaskManager Example

Complete observability setup for Task API:

1. Add Prometheus Metrics (FastAPI)

from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from fastapi import Response

# Metrics
REQUEST_COUNT = Counter(
    "task_api_requests_total",
    "Total requests",
    ["method", "endpoint", "status"]
)
REQUEST_LATENCY = Histogram(
    "task_api_request_duration_seconds",
    "Request latency",
    ["method", "endpoint"]
)

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    latency = time.time() - start

    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(latency)

    return response

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

2. Kubernetes Deployment with Observability

apiVersion: apps/v1
kind: Deployment
metadata:
  name: task-api
  labels:
    app: task-api
    cost-center: platform
spec:
  template:
    metadata:
      labels:
        app: task-api
      annotations:
        dapr.io/enabled: "true"
        dapr.io/app-id: "task-api"
        dapr.io/config: "dapr-observability"
    spec:
      containers:
      - name: task-api
        image: task-api:latest
        ports:
        - containerPort: 8000
          name: http
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://jaeger-collector.monitoring:4317"
        - name: OTEL_SERVICE_NAME
          value: "task-api"
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "256Mi"

3. SLO Dashboard (Grafana JSON)

{
  "title": "Task API SLO Dashboard",
  "panels": [
    {
      "title": "Availability (SLO: 99.9%)",
      "type": "gauge",
      "targets": [{
        "expr": "sum(rate(task_api_requests_total{status!~\"5..\"}[30d])) / sum(rate(task_api_requests_total[30d])) * 100"
      }],
      "thresholds": [{"value": 99.9, "color": "green"}, {"value": 99.5, "color": "yellow"}]
    },
    {
      "title": "Error Budget Remaining",
      "type": "stat",
      "targets": [{
        "expr": "1 - ((1 - (sum(rate(task_api_requests_total{status!~\"5..\"}[30d])) / sum(rate(task_api_requests_total[30d])))) / 0.001)"
      }]
    }
  ]
}

References

For detailed patterns, see:

  • references/promql-patterns.md
    - PromQL query examples
  • references/otel-fastapi.md
    - OpenTelemetry FastAPI integration
  • references/slo-alerting.md
    - SRE alerting patterns
  • references/cost-queries.md
    - OpenCost PromQL queries