Learn-skills.dev monitoring-observability
Monitoring and observability strategy, implementation, and troubleshooting. Use for designing metrics/logs/traces systems, setting up Prometheus/Grafana/Loki, creating alerts and dashboards, calculating SLOs and error budgets, analyzing performance issues, and comparing monitoring tools (Datadog, ELK, CloudWatch). Covers the Four Golden Signals, RED/USE methods, OpenTelemetry instrumentation, log aggregation patterns, and distributed tracing.
git clone https://github.com/NeverSight/learn-skills.dev
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/ahmedasmar/devops-claude-skills/monitoring-observability" ~/.claude/skills/neversight-learn-skills-dev-monitoring-observability-c02fca && rm -rf "$T"
data/skills-md/ahmedasmar/devops-claude-skills/monitoring-observability/SKILL.mdMonitoring & Observability
Overview
This skill provides comprehensive guidance for monitoring and observability workflows including metrics design, log aggregation, distributed tracing, alerting strategies, SLO/SLA management, and tool selection.
When to use this skill:
- Setting up monitoring for new services
- Designing alerts and dashboards
- Troubleshooting performance issues
- Implementing SLO tracking and error budgets
- Choosing between monitoring tools
- Integrating OpenTelemetry instrumentation
- Analyzing metrics, logs, and traces
- Optimizing Datadog costs and finding waste
- Migrating from Datadog to open-source stack
Core Workflow: Observability Implementation
Use this decision tree to determine your starting point:
Are you setting up monitoring from scratch? ├─ YES → Start with "1. Design Metrics Strategy" └─ NO → Do you have an existing issue? ├─ YES → Go to "9. Troubleshooting & Analysis" └─ NO → Are you improving existing monitoring? ├─ Alerts → Go to "3. Alert Design" ├─ Dashboards → Go to "4. Dashboard & Visualization" ├─ SLOs → Go to "5. SLO & Error Budgets" ├─ Tool selection → Read references/tool_comparison.md └─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"
1. Design Metrics Strategy
Start with The Four Golden Signals
Every service should monitor:
- Latency: Response time (p50, p95, p99)
- Traffic: Requests per second
- Errors: Failure rate
- Saturation: Resource utilization
For request-driven services, use the RED Method:
- Rate: Requests/sec
- Errors: Error rate
- Duration: Response time
For infrastructure resources, use the USE Method:
- Utilization: % time busy
- Saturation**: Queue depth
- Errors**: Error count
Quick Start - Web Application Example:
# Rate (requests/sec) sum(rate(http_requests_total[5m])) # Errors (error rate %) sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # Duration (p95 latency) histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) )
Deep Dive: Metric Design
For comprehensive metric design guidance including:
- Metric types (counter, gauge, histogram, summary)
- Cardinality best practices
- Naming conventions
- Dashboard design principles
→ Read: references/metrics_design.md
Automated Metric Analysis
Detect anomalies and trends in your metrics:
# Analyze Prometheus metrics for anomalies python3 scripts/analyze_metrics.py prometheus \ --endpoint http://localhost:9090 \ --query 'rate(http_requests_total[5m])' \ --hours 24 # Analyze CloudWatch metrics python3 scripts/analyze_metrics.py cloudwatch \ --namespace AWS/EC2 \ --metric CPUUtilization \ --dimensions InstanceId=i-1234567890abcdef0 \ --hours 48
→ Script: scripts/analyze_metrics.py
2. Log Aggregation & Analysis
Structured Logging Checklist
Every log entry should include:
- ✅ Timestamp (ISO 8601 format)
- ✅ Log level (DEBUG, INFO, WARN, ERROR, FATAL)
- ✅ Message (human-readable)
- ✅ Service name
- ✅ Request ID (for tracing)
Example structured log (JSON):
{ "timestamp": "2024-10-28T14:32:15Z", "level": "error", "message": "Payment processing failed", "service": "payment-service", "request_id": "550e8400-e29b-41d4-a716-446655440000", "user_id": "user123", "order_id": "ORD-456", "error_type": "GatewayTimeout", "duration_ms": 5000 }
Log Aggregation Patterns
ELK Stack (Elasticsearch, Logstash, Kibana):
- Best for: Deep log analysis, complex queries
- Cost: High (infrastructure + operations)
- Complexity: High
Grafana Loki:
- Best for: Cost-effective logging, Kubernetes
- Cost: Low
- Complexity: Medium
CloudWatch Logs:
- Best for: AWS-centric applications
- Cost: Medium
- Complexity: Low
Log Analysis
Analyze logs for errors, patterns, and anomalies:
# Analyze log file for patterns python3 scripts/log_analyzer.py application.log # Show error lines with context python3 scripts/log_analyzer.py application.log --show-errors # Extract stack traces python3 scripts/log_analyzer.py application.log --show-traces
→ Script: scripts/log_analyzer.py
Deep Dive: Logging
For comprehensive logging guidance including:
- Structured logging implementation examples (Python, Node.js, Go, Java)
- Log aggregation patterns (ELK, Loki, CloudWatch, Fluentd)
- Query patterns and best practices
- PII redaction and security
- Sampling and rate limiting
→ Read: references/logging_guide.md
3. Alert Design
Alert Design Principles
- Every alert must be actionable - If you can't do something, don't alert
- Alert on symptoms, not causes - Alert on user experience, not components
- Tie alerts to SLOs - Connect to business impact
- Reduce noise - Only page for critical issues
Alert Severity Levels
| Severity | Response Time | Example |
|---|---|---|
| Critical | Page immediately | Service down, SLO violation |
| Warning | Ticket, review in hours | Elevated error rate, resource warning |
| Info | Log for awareness | Deployment completed, scaling event |
Multi-Window Burn Rate Alerting
Alert when error budget is consumed too quickly:
# Fast burn (1h window) - Critical - alert: ErrorBudgetFastBurn expr: | (error_rate / 0.001) > 14.4 # 99.9% SLO for: 2m labels: severity: critical # Slow burn (6h window) - Warning - alert: ErrorBudgetSlowBurn expr: | (error_rate / 0.001) > 6 # 99.9% SLO for: 30m labels: severity: warning
Alert Quality Checker
Audit your alert rules against best practices:
# Check single file python3 scripts/alert_quality_checker.py alerts.yml # Check all rules in directory python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/
Checks for:
- Alert naming conventions
- Required labels (severity, team)
- Required annotations (summary, description, runbook_url)
- PromQL expression quality
- 'for' clause to prevent flapping
→ Script: scripts/alert_quality_checker.py
Alert Templates
Production-ready alert rule templates:
→ Templates:
- assets/templates/prometheus-alerts/webapp-alerts.yml - Web application alerts
- assets/templates/prometheus-alerts/kubernetes-alerts.yml - Kubernetes alerts
Deep Dive: Alerting
For comprehensive alerting guidance including:
- Alert design patterns (multi-window, rate of change, threshold with hysteresis)
- Alert annotation best practices
- Alert routing (severity-based, team-based, time-based)
- Inhibition rules
- Runbook structure
- On-call best practices
→ Read: references/alerting_best_practices.md
Runbook Template
Create comprehensive runbooks for your alerts:
→ Template: assets/templates/runbooks/incident-runbook-template.md
4. Dashboard & Visualization
Dashboard Design Principles
- Top-down layout: Most important metrics first
- Color coding: Red (critical), yellow (warning), green (healthy)
- Consistent time windows: All panels use same time range
- Limit panels: 8-12 panels per dashboard maximum
- Include context: Show related metrics together
Recommended Dashboard Structure
┌─────────────────────────────────────┐ │ Overall Health (Single Stats) │ │ [Requests/s] [Error%] [P95 Latency]│ └─────────────────────────────────────┘ ┌─────────────────────────────────────┐ │ Request Rate & Errors (Graphs) │ └─────────────────────────────────────┘ ┌─────────────────────────────────────┐ │ Latency Distribution (Graphs) │ └─────────────────────────────────────┘ ┌─────────────────────────────────────┐ │ Resource Usage (Graphs) │ └─────────────────────────────────────┘
Generate Grafana Dashboards
Automatically generate dashboards from templates:
# Web application dashboard python3 scripts/dashboard_generator.py webapp \ --title "My API Dashboard" \ --service my_api \ --output dashboard.json # Kubernetes dashboard python3 scripts/dashboard_generator.py kubernetes \ --title "K8s Production" \ --namespace production \ --output k8s-dashboard.json # Database dashboard python3 scripts/dashboard_generator.py database \ --title "PostgreSQL" \ --db-type postgres \ --instance db.example.com:5432 \ --output db-dashboard.json
Supports:
- Web applications (requests, errors, latency, resources)
- Kubernetes (pods, nodes, resources, network)
- Databases (PostgreSQL, MySQL)
→ Script: scripts/dashboard_generator.py
5. SLO & Error Budgets
SLO Fundamentals
SLI (Service Level Indicator): Measurement of service quality
- Example: Request latency, error rate, availability
SLO (Service Level Objective): Target value for an SLI
- Example: "99.9% of requests return in < 500ms"
Error Budget: Allowed failure amount = (100% - SLO)
- Example: 99.9% SLO = 0.1% error budget = 43.2 minutes/month
Common SLO Targets
| Availability | Downtime/Month | Use Case |
|---|---|---|
| 99% | 7.2 hours | Internal tools |
| 99.9% | 43.2 minutes | Standard production |
| 99.95% | 21.6 minutes | Critical services |
| 99.99% | 4.3 minutes | High availability |
SLO Calculator
Calculate compliance, error budgets, and burn rates:
# Show SLO reference table python3 scripts/slo_calculator.py --table # Calculate availability SLO python3 scripts/slo_calculator.py availability \ --slo 99.9 \ --total-requests 1000000 \ --failed-requests 1500 \ --period-days 30 # Calculate burn rate python3 scripts/slo_calculator.py burn-rate \ --slo 99.9 \ --errors 50 \ --requests 10000 \ --window-hours 1
→ Script: scripts/slo_calculator.py
Deep Dive: SLO/SLA
For comprehensive SLO/SLA guidance including:
- Choosing appropriate SLIs
- Setting realistic SLO targets
- Error budget policies
- Burn rate alerting
- SLA structure and contracts
- Monthly reporting templates
→ Read: references/slo_sla_guide.md
6. Distributed Tracing
When to Use Tracing
Use distributed tracing when you need to:
- Debug performance issues across services
- Understand request flow through microservices
- Identify bottlenecks in distributed systems
- Find N+1 query problems
OpenTelemetry Implementation
Python example:
from opentelemetry import trace tracer = trace.get_tracer(__name__) @tracer.start_as_current_span("process_order") def process_order(order_id): span = trace.get_current_span() span.set_attribute("order.id", order_id) try: result = payment_service.charge(order_id) span.set_attribute("payment.status", "success") return result except Exception as e: span.set_status(trace.Status(trace.StatusCode.ERROR)) span.record_exception(e) raise
Sampling Strategies
- Development: 100% (ALWAYS_ON)
- Staging: 50-100%
- Production: 1-10% (or error-based sampling)
Error-based sampling (always sample errors, 1% of successes):
class ErrorSampler(Sampler): def should_sample(self, parent_context, trace_id, name, **kwargs): attributes = kwargs.get('attributes', {}) if attributes.get('error', False): return Decision.RECORD_AND_SAMPLE if trace_id & 0xFF < 3: # ~1% return Decision.RECORD_AND_SAMPLE return Decision.DROP
OTel Collector Configuration
Production-ready OpenTelemetry Collector configuration:
→ Template: assets/templates/otel-config/collector-config.yaml
Features:
- Receives OTLP, Prometheus, and host metrics
- Batching and memory limiting
- Tail sampling (error-based, latency-based, probabilistic)
- Multiple exporters (Tempo, Jaeger, Loki, Prometheus, CloudWatch, Datadog)
Deep Dive: Tracing
For comprehensive tracing guidance including:
- OpenTelemetry instrumentation (Python, Node.js, Go, Java)
- Span attributes and semantic conventions
- Context propagation (W3C Trace Context)
- Backend comparison (Jaeger, Tempo, X-Ray, Datadog APM)
- Analysis patterns (finding slow traces, N+1 queries)
- Integration with logs
→ Read: references/tracing_guide.md
7. Datadog Cost Optimization & Migration
Scenario 1: I'm Using Datadog and Costs Are Too High
If your Datadog bill is growing out of control, start by identifying waste:
Cost Analysis Script
Automatically analyze your Datadog usage and find cost optimization opportunities:
# Analyze Datadog usage (requires API key and APP key) python3 scripts/datadog_cost_analyzer.py \ --api-key $DD_API_KEY \ --app-key $DD_APP_KEY # Show detailed breakdown by category python3 scripts/datadog_cost_analyzer.py \ --api-key $DD_API_KEY \ --app-key $DD_APP_KEY \ --show-details
What it checks:
- Infrastructure host count and cost
- Custom metrics usage and high-cardinality metrics
- Log ingestion volume and trends
- APM host usage
- Unused or noisy monitors
- Container vs VM optimization opportunities
→ Script: scripts/datadog_cost_analyzer.py
Common Cost Optimization Strategies
1. Custom Metrics Optimization (typical savings: 20-40%):
- Remove high-cardinality tags (user IDs, request IDs)
- Delete unused custom metrics
- Aggregate metrics before sending
- Use metric prefixes to identify teams/services
2. Log Management (typical savings: 30-50%):
- Implement log sampling for high-volume services
- Use exclusion filters for debug/trace logs in production
- Archive cold logs to S3/GCS after 7 days
- Set log retention policies (15 days instead of 30)
3. APM Optimization (typical savings: 15-25%):
- Reduce trace sampling rates (10% → 5% in prod)
- Use head-based sampling instead of complete sampling
- Remove APM from non-critical services
- Use trace search with lower retention
4. Infrastructure Monitoring (typical savings: 10-20%):
- Switch from VM-based to container-based pricing where possible
- Remove agents from ephemeral instances
- Use Datadog's host reduction strategies
- Consolidate staging environments
Scenario 2: Migrating Away from Datadog
If you're considering migrating to a more cost-effective open-source stack:
Migration Overview
From Datadog → To Open Source Stack:
- Metrics: Datadog → Prometheus + Grafana
- Logs: Datadog Logs → Grafana Loki
- Traces: Datadog APM → Tempo or Jaeger
- Dashboards: Datadog → Grafana
- Alerts: Datadog Monitors → Prometheus Alertmanager
Estimated Cost Savings: 60-77% ($49.8k-61.8k/year for 100-host environment)
Migration Strategy
Phase 1: Run Parallel (Month 1-2):
- Deploy open-source stack alongside Datadog
- Migrate metrics first (lowest risk)
- Validate data accuracy
Phase 2: Migrate Dashboards & Alerts (Month 2-3):
- Convert Datadog dashboards to Grafana
- Translate alert rules (use DQL → PromQL guide below)
- Train team on new tools
Phase 3: Migrate Logs & Traces (Month 3-4):
- Set up Loki for log aggregation
- Deploy Tempo/Jaeger for tracing
- Update application instrumentation
Phase 4: Decommission Datadog (Month 4-5):
- Confirm all functionality migrated
- Cancel Datadog subscription
Query Translation: DQL → PromQL
When migrating dashboards and alerts, you'll need to translate Datadog queries to PromQL:
Quick examples:
# Average CPU Datadog: avg:system.cpu.user{*} Prometheus: avg(node_cpu_seconds_total{mode="user"}) # Request rate Datadog: sum:requests.count{*}.as_rate() Prometheus: sum(rate(http_requests_total[5m])) # P95 latency Datadog: p95:request.duration{*} Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) # Error rate percentage Datadog: (sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100 Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
→ Full Translation Guide: references/dql_promql_translation.md
Cost Comparison
Example: 100-host infrastructure
| Component | Datadog (Annual) | Open Source (Annual) | Savings |
|---|---|---|---|
| Infrastructure | $18,000 | $10,000 (self-hosted infra) | $8,000 |
| Custom Metrics | $600 | Included | $600 |
| Logs | $24,000 | $3,000 (storage) | $21,000 |
| APM/Traces | $37,200 | $5,000 (storage) | $32,200 |
| Total | $79,800 | $18,000 | $61,800 (77%) |
Deep Dive: Datadog Migration
For comprehensive migration guidance including:
- Detailed cost comparison and ROI calculations
- Step-by-step migration instructions
- Infrastructure sizing recommendations (CPU, RAM, storage)
- Dashboard conversion tools and examples
- Alert rule translation patterns
- Application instrumentation changes (DogStatsD → Prometheus client)
- Python scripts for exporting Datadog dashboards and monitors
- Common challenges and solutions
→ Read: references/datadog_migration.md
8. Tool Selection & Comparison
Decision Matrix
Choose Prometheus + Grafana if:
- ✅ Using Kubernetes
- ✅ Want control and customization
- ✅ Have ops capacity
- ✅ Budget-conscious
Choose Datadog if:
- ✅ Want ease of use
- ✅ Need full observability now
- ✅ Budget allows ($8k+/month for 100 hosts)
Choose Grafana Stack (LGTM) if:
- ✅ Want open source full stack
- ✅ Cost-effective solution
- ✅ Cloud-native architecture
Choose ELK Stack if:
- ✅ Heavy log analysis needs
- ✅ Need powerful search
- ✅ Have dedicated ops team
Choose Cloud Native (CloudWatch/etc) if:
- ✅ Single cloud provider
- ✅ Simple needs
- ✅ Want minimal setup
Cost Comparison (100 hosts, 1TB logs/month)
| Solution | Monthly Cost | Setup | Ops Burden |
|---|---|---|---|
| Prometheus + Loki + Tempo | $1,500 | Medium | Medium |
| Grafana Cloud | $3,000 | Low | Low |
| Datadog | $8,000 | Low | None |
| ELK Stack | $4,000 | High | High |
| CloudWatch | $2,000 | Low | Low |
Deep Dive: Tool Comparison
For comprehensive tool comparison including:
- Metrics platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana Cloud)
- Logging platforms (ELK, Loki, Splunk, CloudWatch Logs, Sumo Logic)
- Tracing platforms (Jaeger, Tempo, Datadog APM, X-Ray)
- Full-stack observability comparison
- Recommendations by company size
→ Read: references/tool_comparison.md
9. Troubleshooting & Analysis
Health Check Validation
Validate health check endpoints against best practices:
# Check single endpoint python3 scripts/health_check_validator.py https://api.example.com/health # Check multiple endpoints python3 scripts/health_check_validator.py \ https://api.example.com/health \ https://api.example.com/readiness \ --verbose
Checks for:
- ✓ Returns 200 status code
- ✓ Response time < 1 second
- ✓ Returns JSON format
- ✓ Contains 'status' field
- ✓ Includes version/build info
- ✓ Checks dependencies
- ✓ Disables caching
→ Script: scripts/health_check_validator.py
Common Troubleshooting Workflows
High Latency Investigation:
- Check dashboards for latency spike
- Query traces for slow operations
- Check database slow query log
- Check external API response times
- Review recent deployments
- Check resource utilization (CPU, memory)
High Error Rate Investigation:
- Check error logs for patterns
- Identify affected endpoints
- Check dependency health
- Review recent deployments
- Check resource limits
- Verify configuration
Service Down Investigation:
- Check if pods/instances are running
- Check health check endpoint
- Review recent deployments
- Check resource availability
- Check network connectivity
- Review logs for startup errors
Quick Reference Commands
Prometheus Queries
# Request rate sum(rate(http_requests_total[5m])) # Error rate sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # P95 latency histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) ) # CPU usage 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Memory usage (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Kubernetes Commands
# Check pod status kubectl get pods -n <namespace> # View pod logs kubectl logs -f <pod-name> -n <namespace> # Check pod resources kubectl top pods -n <namespace> # Describe pod for events kubectl describe pod <pod-name> -n <namespace> # Check recent deployments kubectl rollout history deployment/<name> -n <namespace>
Log Queries
Elasticsearch:
GET /logs-*/_search { "query": { "bool": { "must": [ { "match": { "level": "error" } }, { "range": { "@timestamp": { "gte": "now-1h" } } } ] } } }
Loki (LogQL):
{job="app", level="error"} |= "error" | json
CloudWatch Insights:
fields @timestamp, level, message | filter level = "error" | filter @timestamp > ago(1h)
Resources Summary
Scripts (automation and analysis)
- Detect anomalies in Prometheus/CloudWatch metricsanalyze_metrics.py
- Audit alert rules against best practicesalert_quality_checker.py
- Calculate SLO compliance and error budgetsslo_calculator.py
- Parse logs for errors and patternslog_analyzer.py
- Generate Grafana dashboards from templatesdashboard_generator.py
- Validate health check endpointshealth_check_validator.py
- Analyze Datadog usage and find cost wastedatadog_cost_analyzer.py
References (deep-dive documentation)
- Four Golden Signals, RED/USE methods, metric typesmetrics_design.md
- Alert design, runbooks, on-call practicesalerting_best_practices.md
- Structured logging, aggregation patternslogging_guide.md
- OpenTelemetry, distributed tracingtracing_guide.md
- SLI/SLO/SLA definitions, error budgetsslo_sla_guide.md
- Comprehensive comparison of monitoring toolstool_comparison.md
- Complete guide for migrating from Datadog to OSS stackdatadog_migration.md
- Datadog Query Language to PromQL translation referencedql_promql_translation.md
Templates (ready-to-use configurations)
- Production-ready web app alertsprometheus-alerts/webapp-alerts.yml
- Kubernetes monitoring alertsprometheus-alerts/kubernetes-alerts.yml
- OpenTelemetry Collector configurationotel-config/collector-config.yaml
- Incident response templaterunbooks/incident-runbook-template.md
Best Practices
Metrics
- Start with Four Golden Signals
- Use appropriate metric types (counter, gauge, histogram)
- Keep cardinality low (avoid high-cardinality labels)
- Follow naming conventions
Logging
- Use structured logging (JSON)
- Include request IDs for tracing
- Set appropriate log levels
- Redact PII before logging
Alerting
- Make every alert actionable
- Alert on symptoms, not causes
- Use multi-window burn rate alerts
- Include runbook links
Tracing
- Sample appropriately (1-10% in production)
- Always record errors
- Use semantic conventions
- Propagate context between services
SLOs
- Start with current performance
- Set realistic targets
- Define error budget policies
- Review and adjust quarterly