Claude-skill-registry building-with-observability
Build Kubernetes observability stacks with Prometheus, Grafana, OpenTelemetry, Jaeger, and Loki. Use when implementing metrics, tracing, logging, SRE practices, or cost engineering for cloud-native applications.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/building-with-observability" ~/.claude/skills/majiayu000-claude-skill-registry-building-with-observability && rm -rf "$T"
manifest:
skills/data/building-with-observability/SKILL.mdsource content
Building Observability Stacks for Kubernetes
Persona
You are a Site Reliability Engineer (SRE) specializing in Kubernetes observability and FinOps. You've deployed production observability stacks at scale and understand the trade-offs between different tools. You follow Google's SRE principles and can implement the full observability stack: metrics (Prometheus), tracing (OpenTelemetry + Jaeger), logging (Loki), and cost monitoring (OpenCost).
When to Use This Skill
Activate when the user mentions:
- Prometheus, PromQL, metrics collection
- Grafana dashboards, alerting
- OpenTelemetry, OTel, distributed tracing
- Jaeger, Zipkin, trace visualization
- Loki, LogQL, centralized logging
- SLIs, SLOs, SLAs, error budgets
- FinOps, Kubecost, OpenCost, cost allocation
- Kubernetes monitoring, observability
Core Concepts
The Three Pillars of Observability
| Pillar | Tool | Query Language | Purpose |
|---|---|---|---|
| Metrics | Prometheus | PromQL | Aggregated numerical data over time |
| Traces | Jaeger | - | Request flow across services |
| Logs | Loki | LogQL | Detailed event records |
Prometheus Metrics Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ App Pod │ │ Prometheus │ │ Grafana │ │ /metrics │◄────│ Scrape │────►│ Dashboard │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ ▼ ▼ ServiceMonitor PrometheusRule (what to scrape) (alerting rules)
OpenTelemetry Tracing Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ FastAPI │ │ OTel │ │ Jaeger │ │ + OTel │────►│ Collector │────►│ UI │ │ SDK │ │ (OTLP) │ │ │ └─────────────┘ └─────────────┘ └─────────────┘
Decision Logic
When to Use Each Tool
| Scenario | Tool | Why |
|---|---|---|
| "Service response times" | Prometheus + Grafana | Histograms with percentiles |
| "Why is this request slow?" | Jaeger traces | See full request path |
| "What happened at 3am?" | Loki logs | Event-level detail |
| "Are we meeting SLOs?" | Prometheus + SLO rules | Error budget tracking |
| "Which team is spending most?" | OpenCost | Cost allocation by namespace |
Alerting Strategy Decision Tree
Is it customer-impacting? ├── Yes → Alert on SLO burn rate │ (multi-window, multi-burn-rate) └── No → Is it a leading indicator? ├── Yes → Warning alert, page if trend continues └── No → Dashboard only, no alert
SLO Target Selection
| Service Type | Typical SLO | Error Budget (30 days) |
|---|---|---|
| User-facing API | 99.9% | 43.2 minutes |
| Internal service | 99.5% | 3.6 hours |
| Batch jobs | 99.0% | 7.2 hours |
Workflow: Full Stack Setup
1. Install Prometheus + Grafana Stack
# Add Helm repos helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # Install kube-prometheus-stack (includes Grafana) helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring --create-namespace \ --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \ --set grafana.adminPassword=admin
2. Create ServiceMonitor for Your App
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: task-api namespace: monitoring spec: selector: matchLabels: app: task-api namespaceSelector: matchNames: [default] endpoints: - port: http path: /metrics interval: 30s
3. Install Loki for Logging
helm install loki grafana/loki-stack \ --namespace monitoring \ --set promtail.enabled=true
4. Install Jaeger for Tracing
helm install jaeger jaegertracing/jaeger \ --namespace monitoring \ --set collector.service.otlp.grpc.enabled=true \ --set collector.service.otlp.http.enabled=true
5. Instrument Python/FastAPI with OpenTelemetry
# requirements.txt opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation-fastapi opentelemetry-exporter-otlp # main.py from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor # Configure tracing trace.set_tracer_provider(TracerProvider()) otlp_exporter = OTLPSpanExporter(endpoint="jaeger-collector:4317", insecure=True) trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(otlp_exporter)) # Instrument FastAPI app = FastAPI() FastAPIInstrumentor.instrument_app(app)
6. Install OpenCost for Cost Monitoring
helm install opencost opencost/opencost \ --namespace monitoring \ --set prometheus.internal.serviceName=prometheus-kube-prometheus-prometheus
Key Patterns
PromQL Queries for Kubernetes
# Request rate by service sum(rate(http_requests_total[5m])) by (service) # P95 latency histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) # Error rate as percentage sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # CPU usage by pod sum(rate(container_cpu_usage_seconds_total{namespace="default"}[5m])) by (pod) # Memory usage percentage sum(container_memory_usage_bytes{namespace="default"}) by (pod) / sum(container_spec_memory_limit_bytes{namespace="default"}) by (pod) * 100
LogQL Queries for Loki
# All logs from namespace {namespace="default"} # Error logs only {namespace="default"} |= "error" # Parse JSON and filter {namespace="default"} | json | level="error" # Count errors per minute sum(rate({namespace="default"} |= "error" [1m])) by (pod)
SLO Alert Rules (Multi-Burn-Rate)
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: task-api-slo namespace: monitoring spec: groups: - name: task-api-slo rules: # Error budget burn rate - record: task_api:error_budget_burn_rate:5m expr: | 1 - ( sum(rate(http_requests_total{service="task-api",status!~"5.."}[5m])) / sum(rate(http_requests_total{service="task-api"}[5m])) ) # Fast burn (2% budget in 1 hour = page) - alert: TaskAPIHighErrorBudgetBurn expr: task_api:error_budget_burn_rate:5m > 14.4 * 0.001 # 14.4x burn for 5m window for: 2m labels: severity: critical annotations: summary: "Task API burning error budget rapidly" description: "Error rate {{ $value | humanizePercentage }} exceeds SLO"
Dapr Observability Integration
# dapr-config.yaml apiVersion: dapr.io/v1alpha1 kind: Configuration metadata: name: dapr-observability spec: tracing: samplingRate: "1" otel: endpointAddress: jaeger-collector.monitoring:4317 isSecure: false protocol: grpc metric: enabled: true
Cost Engineering Patterns
Resource Tagging for Cost Allocation
# Add cost allocation labels to all deployments apiVersion: apps/v1 kind: Deployment metadata: name: task-api labels: app: task-api cost-center: "platform" team: "agents" environment: "production"
Right-Sizing Resources
# Start conservative, let VPA recommend resources: requests: cpu: "100m" # Start low memory: "128Mi" limits: cpu: "500m" # 5x headroom for bursts memory: "256Mi" # 2x headroom
OpenCost PromQL Queries
# Cost per namespace (daily) sum(container_cpu_allocation * on(node) group_left() node_cpu_hourly_cost * 24) by (namespace) # Idle resources (waste) sum(container_cpu_allocation - container_cpu_usage_seconds_total) by (namespace)
Safety & Guardrails
NEVER
- Alert on every metric (alert fatigue kills teams)
- Set SLOs at 100% (impossible to maintain, blocks all releases)
- Skip retention configuration (storage costs explode)
- Use sampling rate 1.0 in high-traffic production (performance impact)
- Expose metrics endpoints publicly (security risk)
ALWAYS
- Start with 4 golden signals: latency, traffic, errors, saturation
- Use multi-window burn rate alerting for SLOs
- Configure retention policies for all telemetry data
- Use sampling in high-traffic scenarios (0.1 for prod, 1.0 for dev)
- Secure metrics/tracing endpoints with NetworkPolicies
Cost Engineering Guardrails
- Set budget alerts at 80% and 100% of monthly budget
- Review right-sizing recommendations weekly
- Tag ALL resources for cost allocation
- Schedule non-production environments (40h vs 168h = 75% savings)
TaskManager Example
Complete observability setup for Task API:
1. Add Prometheus Metrics (FastAPI)
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST from fastapi import Response # Metrics REQUEST_COUNT = Counter( "task_api_requests_total", "Total requests", ["method", "endpoint", "status"] ) REQUEST_LATENCY = Histogram( "task_api_request_duration_seconds", "Request latency", ["method", "endpoint"] ) @app.middleware("http") async def metrics_middleware(request: Request, call_next): start = time.time() response = await call_next(request) latency = time.time() - start REQUEST_COUNT.labels( method=request.method, endpoint=request.url.path, status=response.status_code ).inc() REQUEST_LATENCY.labels( method=request.method, endpoint=request.url.path ).observe(latency) return response @app.get("/metrics") def metrics(): return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
2. Kubernetes Deployment with Observability
apiVersion: apps/v1 kind: Deployment metadata: name: task-api labels: app: task-api cost-center: platform spec: template: metadata: labels: app: task-api annotations: dapr.io/enabled: "true" dapr.io/app-id: "task-api" dapr.io/config: "dapr-observability" spec: containers: - name: task-api image: task-api:latest ports: - containerPort: 8000 name: http env: - name: OTEL_EXPORTER_OTLP_ENDPOINT value: "http://jaeger-collector.monitoring:4317" - name: OTEL_SERVICE_NAME value: "task-api" resources: requests: cpu: "100m" memory: "128Mi" limits: cpu: "500m" memory: "256Mi"
3. SLO Dashboard (Grafana JSON)
{ "title": "Task API SLO Dashboard", "panels": [ { "title": "Availability (SLO: 99.9%)", "type": "gauge", "targets": [{ "expr": "sum(rate(task_api_requests_total{status!~\"5..\"}[30d])) / sum(rate(task_api_requests_total[30d])) * 100" }], "thresholds": [{"value": 99.9, "color": "green"}, {"value": 99.5, "color": "yellow"}] }, { "title": "Error Budget Remaining", "type": "stat", "targets": [{ "expr": "1 - ((1 - (sum(rate(task_api_requests_total{status!~\"5..\"}[30d])) / sum(rate(task_api_requests_total[30d])))) / 0.001)" }] } ] }
References
For detailed patterns, see:
- PromQL query examplesreferences/promql-patterns.md
- OpenTelemetry FastAPI integrationreferences/otel-fastapi.md
- SRE alerting patternsreferences/slo-alerting.md
- OpenCost PromQL queriesreferences/cost-queries.md