Ai prometheus
install
source · Clone the upstream repo
git clone https://github.com/wpank/ai
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/wpank/ai "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/devops/prometheus" ~/.claude/skills/wpank-ai-prometheus && rm -rf "$T"
manifest:
skills/devops/prometheus/SKILL.mdsource content
Prometheus
Production Prometheus setup covering scrape configuration, service discovery, recording rules, alert rules, and operational best practices for infrastructure and application monitoring.
When to Use
| Scenario | Example |
|---|---|
| Set up metrics collection | New service needs Prometheus scraping |
| Configure service discovery | K8s pods, file-based, or static targets |
| Create recording rules | Pre-compute expensive PromQL queries |
| Design alert rules | SLO-based alerts for availability and latency |
| Production deployment | HA setup with retention and storage planning |
| Troubleshoot scraping | Targets down, metrics missing, relabeling issues |
Architecture
Applications ──(/metrics)──→ Prometheus Server ──→ AlertManager → Slack/PD ↑ │ client libraries ├──→ Grafana (dashboards) (prom client) └──→ Thanos/Cortex (long-term storage)
Installation
Kubernetes (Helm)
helm repo add prometheus-community \ https://prometheus-community.github.io/helm-charts helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring --create-namespace \ --set prometheus.prometheusSpec.retention=30d \ --set prometheus.prometheusSpec.storageVolumeSize=50Gi
Core Configuration
prometheus.yml
global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: production region: us-west-2 alerting: alertmanagers: - static_configs: - targets: ["alertmanager:9093"] rule_files: - /etc/prometheus/rules/*.yml scrape_configs: # Self-monitoring - job_name: prometheus static_configs: - targets: ["localhost:9090"] # Node exporters - job_name: node-exporter static_configs: - targets: ["node1:9100", "node2:9100", "node3:9100"] relabel_configs: - source_labels: [__address__] target_label: instance regex: "([^:]+)(:[0-9]+)?" replacement: "${1}" # Application metrics (TLS) - job_name: my-app scheme: https metrics_path: /metrics tls_config: ca_file: /etc/prometheus/ca.crt static_configs: - targets: ["app1:9090", "app2:9090"]
Service Discovery
Kubernetes Pods (Annotation-Based)
scrape_configs: - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_namespace] target_label: namespace - source_labels: [__meta_kubernetes_pod_name] target_label: pod
Pod annotations to enable scraping:
metadata: annotations: prometheus.io/scrape: "true" prometheus.io/port: "9090" prometheus.io/path: "/metrics"
File-Based Discovery
scrape_configs: - job_name: file-sd file_sd_configs: - files: ["/etc/prometheus/targets/*.json"] refresh_interval: 5m
targets/production.json:
[{ "targets": ["app1:9090", "app2:9090"], "labels": { "env": "production", "service": "api" } }]
Discovery Method Comparison
| Method | Best For | Dynamic |
|---|---|---|
| Fixed infrastructure, dev | No |
| CM-managed inventories | Yes (file watch) |
| K8s workloads | Yes (API watch) |
| Consul service mesh | Yes (Consul watch) |
| AWS EC2 instances | Yes (API poll) |
Recording Rules
Pre-compute expensive queries for dashboard and alert performance:
# /etc/prometheus/rules/recording_rules.yml groups: - name: api_metrics interval: 15s rules: - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m])) - record: job:http_errors:rate5m expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) - record: job:http_error_rate:ratio expr: job:http_errors:rate5m / job:http_requests:rate5m - record: job:http_duration:p95 expr: > histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) ) - name: resource_metrics interval: 30s rules: - record: instance:node_cpu:utilization expr: > 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) - record: instance:node_memory:utilization expr: > 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) - record: instance:node_disk:utilization expr: > 100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
Naming Convention
level:metric_name:operations
| Part | Example | Meaning |
|---|---|---|
| level | , | Aggregation level |
| metric_name | | Base metric |
| operations | , | Applied functions |
Alert Rules
# /etc/prometheus/rules/alert_rules.yml groups: - name: availability rules: - alert: ServiceDown expr: up{job="my-app"} == 0 for: 1m labels: severity: critical annotations: summary: "{{ $labels.instance }} is down" description: "{{ $labels.job }} down for >1 minute" - alert: HighErrorRate expr: job:http_error_rate:ratio > 0.05 for: 5m labels: severity: warning annotations: summary: "Error rate {{ $value | humanizePercentage }} for {{ $labels.job }}" - alert: HighP95Latency expr: job:http_duration:p95 > 1 for: 5m labels: severity: warning annotations: summary: "P95 latency {{ $value }}s for {{ $labels.job }}" - name: resources rules: - alert: HighCPU expr: instance:node_cpu:utilization > 80 for: 5m labels: { severity: warning } annotations: summary: "CPU {{ $value }}% on {{ $labels.instance }}" - alert: HighMemory expr: instance:node_memory:utilization > 85 for: 5m labels: { severity: warning } annotations: summary: "Memory {{ $value }}% on {{ $labels.instance }}" - alert: DiskSpaceLow expr: instance:node_disk:utilization > 90 for: 5m labels: { severity: critical } annotations: summary: "Disk {{ $value }}% on {{ $labels.instance }}"
Alert Severity Guide
| Severity | Threshold | Response |
|---|---|---|
| Service down, data loss risk | Page on-call immediately |
| Degraded, approaching limit | Investigate within hours |
| Notable but not urgent | Review in next business day |
Validation
# Validate config syntax promtool check config prometheus.yml # Validate rule files promtool check rules /etc/prometheus/rules/*.yml # Test a query promtool query instant http://localhost:9090 'up' # Reload config without restart curl -X POST http://localhost:9090/-/reload
Best Practices
| Practice | Detail |
|---|---|
Naming: | Snake_case, for counters, / for units |
| Scrape intervals 15–60s | Shorter wastes resources and storage |
| Recording rules for dashboards | Pre-compute anything queried repeatedly |
| Monitor Prometheus itself | , |
| HA deployment | 2+ instances scraping same targets |
| Retention planning | Match to disk capacity |
| Federation for scale | Global Prometheus aggregates from regional instances |
| Long-term storage | Thanos or Cortex for >30d retention |
Troubleshooting Quick Reference
| Problem | Diagnosis | Fix |
|---|---|---|
Target shows | Check page for error | Fix firewall, verify endpoint, check TLS |
| Metrics missing | Query | Verify scrape config, check endpoint |
| High cardinality | growing | Drop high-cardinality labels with |
| Storage filling up | Check | Reduce retention, add disk, enable compaction |
| Slow queries | Check | Add recording rules, reduce range, limit series |
| Config not applied | Check | Fix syntax, POST |
NEVER Do
| Anti-Pattern | Why | Do Instead |
|---|---|---|
| Scrape interval < 5s | Overwhelms targets and storage | Use 15–60s intervals |
| High-cardinality labels (user ID, request ID) | Explodes TSDB series count | Use logs for high-cardinality data |
Alert without duration | Fires on transient spikes | Always set minimum |
| Skip recording rules | Dashboards compute expensive queries every load | Pre-compute with recording rules |
| Store secrets in prometheus.yml | Config often in Git | Use file-based secrets or env substitution |
Ignore metric | Miss targets silently going down | Alert on for all jobs |
| Single Prometheus instance in prod | Single point of failure | Run 2+ replicas with shared targets |
| Unbounded retention | Disk fills, Prometheus crashes | Set explicit |
Templates
| Template | Description |
|---|---|
| templates/prometheus.yml | Full config with static, file-based, and K8s discovery |
| templates/alert-rules.yml | 25+ alert rules by category |
| templates/recording-rules.yml | Pre-computed metrics for HTTP, latency, resources, SLOs |