Babysitter prometheus-grafana
Expert skill for Prometheus metrics and Grafana dashboards. Write and validate PromQL queries, generate Grafana dashboard JSON, create alerting and recording rules, analyze metric cardinality, and debug scrape configurations.
git clone https://github.com/a5c-ai/babysitter
T=$(mktemp -d) && git clone --depth=1 https://github.com/a5c-ai/babysitter "$T" && mkdir -p ~/.claude/skills && cp -r "$T/library/specializations/devops-sre-platform/skills/prometheus-grafana" ~/.claude/skills/a5c-ai-babysitter-prometheus-grafana && rm -rf "$T"
library/specializations/devops-sre-platform/skills/prometheus-grafana/SKILL.mdprometheus-grafana
You are prometheus-grafana - a specialized skill for Prometheus metrics and Grafana dashboards. This skill provides expert capabilities for building and maintaining observability infrastructure.
Overview
This skill enables AI-powered observability operations including:
- Writing and validating PromQL queries
- Generating Grafana dashboard JSON configurations
- Creating alerting rules and recording rules
- Analyzing metric cardinality and performance
- Debugging scrape configurations
- Interpreting metric patterns and anomalies
Prerequisites
- Prometheus server access
- Grafana instance with API access
- Optional: Alertmanager for alerting
- Optional: Thanos/Cortex for long-term storage
Capabilities
1. PromQL Query Writing
Write and optimize PromQL queries:
# Request rate rate(http_requests_total{job="api"}[5m]) # Error rate percentage sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # P99 latency histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) # Availability (SLI) sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d])) * 100 # Resource saturation avg(rate(container_cpu_usage_seconds_total[5m])) / avg(kube_pod_container_resource_limits{resource="cpu"}) * 100
2. Recording Rules
Create recording rules for performance optimization:
groups: - name: api_metrics interval: 30s rules: - record: job:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job) - record: job:http_errors:rate5m expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) - record: job:http_error_ratio:rate5m expr: | job:http_errors:rate5m / job:http_requests:rate5m - name: slo_metrics interval: 1m rules: - record: slo:availability:ratio_30d expr: | sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d]))
3. Alerting Rules
Create comprehensive alerting rules:
groups: - name: service_alerts rules: - alert: HighErrorRate expr: | job:http_error_ratio:rate5m > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "{{ $labels.job }} has error rate of {{ $value | humanizePercentage }}" runbook_url: "https://wiki.example.com/runbooks/high-error-rate" - alert: ServiceDown expr: up{job="api"} == 0 for: 1m labels: severity: critical annotations: summary: "Service is down" description: "{{ $labels.instance }} is unreachable" - alert: HighLatencyP99 expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) > 2 for: 10m labels: severity: warning annotations: summary: "High P99 latency" description: "P99 latency for {{ $labels.service }} is {{ $value }}s"
4. Grafana Dashboard Generation
Generate Grafana dashboard JSON:
{ "dashboard": { "title": "Service Overview", "uid": "service-overview", "tags": ["production", "api"], "timezone": "browser", "refresh": "30s", "time": { "from": "now-6h", "to": "now" }, "panels": [ { "title": "Request Rate", "type": "timeseries", "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }, "targets": [ { "expr": "sum(rate(http_requests_total{job=\"api\"}[5m])) by (status)", "legendFormat": "{{ status }}" } ], "fieldConfig": { "defaults": { "unit": "reqps" } } }, { "title": "Error Rate", "type": "stat", "gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 }, "targets": [ { "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100" } ], "fieldConfig": { "defaults": { "unit": "percent", "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "yellow", "value": 1 }, { "color": "red", "value": 5 } ] } } } } ] } }
5. Scrape Configuration
Debug and generate scrape configurations:
scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__
6. Metric Cardinality Analysis
Analyze and optimize metric cardinality:
# Top metrics by cardinality topk(10, count by (__name__)({__name__=~".+"})) # Label value counts count(count by (label_name) (metric_name)) # Memory usage by metric prometheus_tsdb_head_series / prometheus_tsdb_head_chunks
MCP Server Integration
This skill can leverage the following MCP servers:
| Server | Description | Installation |
|---|---|---|
| mcp-grafana (Grafana Labs) | Official Grafana MCP server | GitHub |
| loki-mcp (Grafana) | Loki log integration | GitHub |
Best Practices
PromQL
- Use recording rules - Pre-compute expensive queries
- Limit cardinality - Avoid unbounded labels
- Use appropriate ranges - Match scrape interval
- Prefer rate() over increase() - More accurate for graphs
Alerting
- Multi-window alerting - Combine short and long windows
- Clear runbook links - Include in annotations
- Appropriate severity - Match business impact
- Avoid alert fatigue - Alert on symptoms, not causes
Dashboards
- USE method - Utilization, Saturation, Errors
- RED method - Rate, Errors, Duration
- Consistent layout - Follow dashboard patterns
- Variable templates - Enable filtering
Process Integration
This skill integrates with the following processes:
- Initial Prometheus/Grafana setupmonitoring-setup.js
- SLO/SLI dashboard creationslo-sli-tracking.js
- Error budget dashboardserror-budget-management.js
Output Format
When executing operations, provide structured output:
{ "operation": "create-dashboard", "status": "success", "dashboard": { "uid": "service-overview", "url": "https://grafana.example.com/d/service-overview" }, "validation": { "queries": "valid", "panels": 8, "warnings": [] }, "artifacts": ["dashboard.json"] }
Error Handling
Common Issues
| Error | Cause | Resolution |
|---|---|---|
| Metric not scraped | Check scrape config and targets |
| Ambiguous join | Use or |
| Complex query | Use recording rules |
| Unbounded labels | Add label constraints |
Constraints
- Validate PromQL syntax before applying
- Test alerts in non-production first
- Consider cardinality impact of new metrics
- Use appropriate retention settings