Vibecosystem prometheus-patterns

Name: prometheus-patterns
Author: vibeeval

PromQL queries, alerting rules, recording rules, Grafana dashboard JSON, SLO

install

source · Clone the upstream repo

git clone https://github.com/vibeeval/vibecosystem

manifest: skills/prometheus-patterns/skill.md

Prometheus Patterns

PromQL Essentials

Rate and Error Calculations

# Request rate (per second, 5m window)
rate(http_requests_total[5m])

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# P99 latency from histogram
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# P50 latency by endpoint
histogram_quantile(0.50,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler)
)

# Saturation: CPU usage per pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
/ sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod) * 100

SLO: Error Budget

# SLO: 99.9% availability over 30 days
# Error budget = 0.1% = 43.2 minutes/month

# Current burn rate (how fast consuming budget)
1 - (
  sum(rate(http_requests_total{status!~"5.."}[1h]))
  / sum(rate(http_requests_total[1h]))
) / (1 - 0.999)

# Remaining error budget (percentage)
1 - (
  sum(increase(http_requests_total{status=~"5.."}[30d]))
  / (sum(increase(http_requests_total[30d])) * 0.001)
)

Recording Rules

groups:
  - name: sli_rules
    interval: 30s
    rules:
      - record: job:http_request_rate:5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      - record: job:http_error_rate:5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          / sum(rate(http_requests_total[5m])) by (job)

      - record: job:http_latency_p99:5m
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
          )

Alerting Rules

groups:
  - name: slo_alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_error_rate:5m > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1% for {{ $labels.job }}"
          runbook: "https://wiki.internal/runbooks/high-error-rate"

      - alert: HighLatency
        expr: job:http_latency_p99:5m > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 500ms for {{ $labels.job }}"

      - alert: ErrorBudgetBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))
          ) > 14.4 * 0.001
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Burning error budget 14.4x faster than allowed"

Instrumentation (Go)

import "github.com/prometheus/client_golang/prometheus"

var (
    httpRequests = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "handler", "status"},
    )
    httpDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration",
            Buckets: []float64{0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5},
        },
        []string{"method", "handler"},
    )
)

Checklist

RED metrics (Rate, Errors, Duration) for every service
Recording rules for expensive queries
Alerts have runbook links
Error budget alerts with multi-window burn rate
Histogram buckets match expected latency distribution
Labels have low cardinality (no user IDs, request IDs)
Grafana dashboards use recording rules, not raw queries
Alert severity matches response SLA

Anti-Patterns

High cardinality labels (user_id, trace_id) causing metric explosion
Using
```
avg()
```
for latency instead of histograms/quantiles
Missing
```
for
```
clause in alerts causing alert storms
Recording rules with too short intervals wasting resources
Alerting on symptoms without linking to causes
Not setting meaningful histogram bucket boundaries