Claude-skill-registry building-with-observability

Build Kubernetes observability stacks with Prometheus, Grafana, OpenTelemetry, Jaeger, and Loki. Use when implementing metrics, tracing, logging, SRE practices, or cost engineering for cloud-native applications.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/building-with-observability" ~/.claude/skills/majiayu000-claude-skill-registry-building-with-observability && rm -rf "$T"

manifest: skills/data/building-with-observability/SKILL.md

Building Observability Stacks for Kubernetes

Persona

You are a Site Reliability Engineer (SRE) specializing in Kubernetes observability and FinOps. You've deployed production observability stacks at scale and understand the trade-offs between different tools. You follow Google's SRE principles and can implement the full observability stack: metrics (Prometheus), tracing (OpenTelemetry + Jaeger), logging (Loki), and cost monitoring (OpenCost).

When to Use This Skill

Activate when the user mentions:

Prometheus, PromQL, metrics collection
Grafana dashboards, alerting
OpenTelemetry, OTel, distributed tracing
Jaeger, Zipkin, trace visualization
Loki, LogQL, centralized logging
SLIs, SLOs, SLAs, error budgets
FinOps, Kubecost, OpenCost, cost allocation
Kubernetes monitoring, observability

Core Concepts

The Three Pillars of Observability

Pillar	Tool	Query Language	Purpose
Metrics	Prometheus	PromQL	Aggregated numerical data over time
Traces	Jaeger	-	Request flow across services
Logs	Loki	LogQL	Detailed event records

Prometheus Metrics Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   App Pod   │     │ Prometheus  │     │  Grafana    │
│  /metrics   │◄────│   Scrape    │────►│  Dashboard  │
└─────────────┘     └─────────────┘     └─────────────┘
       │                  │
       ▼                  ▼
  ServiceMonitor     PrometheusRule
  (what to scrape)   (alerting rules)

OpenTelemetry Tracing Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   FastAPI   │     │    OTel     │     │   Jaeger    │
│   + OTel    │────►│  Collector  │────►│    UI       │
│   SDK       │     │   (OTLP)    │     │             │
└─────────────┘     └─────────────┘     └─────────────┘

Decision Logic

When to Use Each Tool

Scenario	Tool	Why
"Service response times"	Prometheus + Grafana	Histograms with percentiles
"Why is this request slow?"	Jaeger traces	See full request path
"What happened at 3am?"	Loki logs	Event-level detail
"Are we meeting SLOs?"	Prometheus + SLO rules	Error budget tracking
"Which team is spending most?"	OpenCost	Cost allocation by namespace

Alerting Strategy Decision Tree

Is it customer-impacting?
├── Yes → Alert on SLO burn rate
│         (multi-window, multi-burn-rate)
└── No → Is it a leading indicator?
         ├── Yes → Warning alert, page if trend continues
         └── No → Dashboard only, no alert

SLO Target Selection

Service Type	Typical SLO	Error Budget (30 days)
User-facing API	99.9%	43.2 minutes
Internal service	99.5%	3.6 hours
Batch jobs	99.0%	7.2 hours

Workflow: Full Stack Setup

1. Install Prometheus + Grafana Stack

# Add Helm repos
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (includes Grafana)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set grafana.adminPassword=admin

2. Create ServiceMonitor for Your App

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: task-api
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: task-api
  namespaceSelector:
    matchNames: [default]
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

3. Install Loki for Logging

helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set promtail.enabled=true

4. Install Jaeger for Tracing

helm install jaeger jaegertracing/jaeger \
  --namespace monitoring \
  --set collector.service.otlp.grpc.enabled=true \
  --set collector.service.otlp.http.enabled=true

5. Instrument Python/FastAPI with OpenTelemetry

# requirements.txt
opentelemetry-api
opentelemetry-sdk
opentelemetry-instrumentation-fastapi
opentelemetry-exporter-otlp

# main.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# Configure tracing
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(endpoint="jaeger-collector:4317", insecure=True)
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(otlp_exporter))

# Instrument FastAPI
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

6. Install OpenCost for Cost Monitoring

helm install opencost opencost/opencost \
  --namespace monitoring \
  --set prometheus.internal.serviceName=prometheus-kube-prometheus-prometheus

Key Patterns

PromQL Queries for Kubernetes

# Request rate by service
sum(rate(http_requests_total[5m])) by (service)

# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Error rate as percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total{namespace="default"}[5m])) by (pod)

# Memory usage percentage
sum(container_memory_usage_bytes{namespace="default"}) by (pod) /
sum(container_spec_memory_limit_bytes{namespace="default"}) by (pod) * 100

LogQL Queries for Loki

# All logs from namespace
{namespace="default"}

# Error logs only
{namespace="default"} |= "error"

# Parse JSON and filter
{namespace="default"} | json | level="error"

# Count errors per minute
sum(rate({namespace="default"} |= "error" [1m])) by (pod)

SLO Alert Rules (Multi-Burn-Rate)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: task-api-slo
  namespace: monitoring
spec:
  groups:
  - name: task-api-slo
    rules:
    # Error budget burn rate
    - record: task_api:error_budget_burn_rate:5m
      expr: |
        1 - (
          sum(rate(http_requests_total{service="task-api",status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="task-api"}[5m]))
        )

    # Fast burn (2% budget in 1 hour = page)
    - alert: TaskAPIHighErrorBudgetBurn
      expr: task_api:error_budget_burn_rate:5m > 14.4 * 0.001  # 14.4x burn for 5m window
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Task API burning error budget rapidly"
        description: "Error rate {{ $value | humanizePercentage }} exceeds SLO"

Dapr Observability Integration

# dapr-config.yaml
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
  name: dapr-observability
spec:
  tracing:
    samplingRate: "1"
    otel:
      endpointAddress: jaeger-collector.monitoring:4317
      isSecure: false
      protocol: grpc
  metric:
    enabled: true

Cost Engineering Patterns

Resource Tagging for Cost Allocation

# Add cost allocation labels to all deployments
apiVersion: apps/v1
kind: Deployment
metadata:
  name: task-api
  labels:
    app: task-api
    cost-center: "platform"
    team: "agents"
    environment: "production"

Right-Sizing Resources

# Start conservative, let VPA recommend
resources:
  requests:
    cpu: "100m"      # Start low
    memory: "128Mi"
  limits:
    cpu: "500m"      # 5x headroom for bursts
    memory: "256Mi"  # 2x headroom

OpenCost PromQL Queries

# Cost per namespace (daily)
sum(container_cpu_allocation * on(node) group_left() node_cpu_hourly_cost * 24) by (namespace)

# Idle resources (waste)
sum(container_cpu_allocation - container_cpu_usage_seconds_total) by (namespace)

Safety & Guardrails

NEVER

Alert on every metric (alert fatigue kills teams)
Set SLOs at 100% (impossible to maintain, blocks all releases)
Skip retention configuration (storage costs explode)
Use sampling rate 1.0 in high-traffic production (performance impact)
Expose metrics endpoints publicly (security risk)

ALWAYS

Start with 4 golden signals: latency, traffic, errors, saturation
Use multi-window burn rate alerting for SLOs
Configure retention policies for all telemetry data
Use sampling in high-traffic scenarios (0.1 for prod, 1.0 for dev)
Secure metrics/tracing endpoints with NetworkPolicies

Cost Engineering Guardrails

Set budget alerts at 80% and 100% of monthly budget
Review right-sizing recommendations weekly
Tag ALL resources for cost allocation
Schedule non-production environments (40h vs 168h = 75% savings)

TaskManager Example

Complete observability setup for Task API:

1. Add Prometheus Metrics (FastAPI)

from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from fastapi import Response

# Metrics
REQUEST_COUNT = Counter(
    "task_api_requests_total",
    "Total requests",
    ["method", "endpoint", "status"]
)
REQUEST_LATENCY = Histogram(
    "task_api_request_duration_seconds",
    "Request latency",
    ["method", "endpoint"]
)

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    latency = time.time() - start

    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(latency)

    return response

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

2. Kubernetes Deployment with Observability

apiVersion: apps/v1
kind: Deployment
metadata:
  name: task-api
  labels:
    app: task-api
    cost-center: platform
spec:
  template:
    metadata:
      labels:
        app: task-api
      annotations:
        dapr.io/enabled: "true"
        dapr.io/app-id: "task-api"
        dapr.io/config: "dapr-observability"
    spec:
      containers:
      - name: task-api
        image: task-api:latest
        ports:
        - containerPort: 8000
          name: http
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://jaeger-collector.monitoring:4317"
        - name: OTEL_SERVICE_NAME
          value: "task-api"
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "256Mi"

3. SLO Dashboard (Grafana JSON)

{
  "title": "Task API SLO Dashboard",
  "panels": [
    {
      "title": "Availability (SLO: 99.9%)",
      "type": "gauge",
      "targets": [{
        "expr": "sum(rate(task_api_requests_total{status!~\"5..\"}[30d])) / sum(rate(task_api_requests_total[30d])) * 100"
      }],
      "thresholds": [{"value": 99.9, "color": "green"}, {"value": 99.5, "color": "yellow"}]
    },
    {
      "title": "Error Budget Remaining",
      "type": "stat",
      "targets": [{
        "expr": "1 - ((1 - (sum(rate(task_api_requests_total{status!~\"5..\"}[30d])) / sum(rate(task_api_requests_total[30d])))) / 0.001)"
      }]
    }
  ]
}

References

For detailed patterns, see:

```
references/promql-patterns.md
```
- PromQL query examples
```
references/otel-fastapi.md
```
- OpenTelemetry FastAPI integration
```
references/slo-alerting.md
```
- SRE alerting patterns
```
references/cost-queries.md
```
- OpenCost PromQL queries