Terminal-skills monitoring

监控与告警

install

source · Clone the upstream repo

git clone https://github.com/chaterm/terminal-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/chaterm/terminal-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/devops/monitoring" ~/.claude/skills/chaterm-terminal-skills-monitoring && rm -rf "$T"

manifest: devops/monitoring/SKILL.md

监控与告警

概述

Prometheus、Grafana、告警规则配置等技能。

Prometheus

基础查询（PromQL）

# 即时向量
http_requests_total
http_requests_total{job="api", status="200"}

# 范围向量
http_requests_total[5m]

# 偏移
http_requests_total offset 1h

# 聚合
sum(http_requests_total)
sum by (job) (http_requests_total)
sum without (instance) (http_requests_total)

# 速率
rate(http_requests_total[5m])
irate(http_requests_total[5m])

# 增量
increase(http_requests_total[1h])

# 直方图分位数
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

常用查询

# CPU 使用率
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# 磁盘使用率
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100

# 网络流量
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

# HTTP 请求速率
sum(rate(http_requests_total[5m])) by (status)

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# 延迟 P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

配置文件

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - "rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node1:9100', 'node2:9100']

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

告警规则

# rules/alerts.yml
groups:
  - name: node
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}%"

      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"

  - name: application
    rules:
      - alert: HighErrorRate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency"

Alertmanager

配置

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'

  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'xxx'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Grafana

数据源配置

# provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Dashboard JSON 示例

{
  "dashboard": {
    "title": "Node Metrics",
    "panels": [
      {
        "title": "CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{ instance }}"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "gauge",
        "targets": [
          {
            "expr": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100"
          }
        ]
      }
    ]
  }
}

常用面板查询

# CPU 使用率（时间序列）
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用（仪表盘）
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# 请求速率（柱状图）
sum(rate(http_requests_total[5m])) by (status)

# 延迟热力图
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)

常见场景

场景 1：Kubernetes 监控

# ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-monitor
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

场景 2：自定义指标

# Python 应用
from prometheus_client import Counter, Histogram, start_http_server

REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency', ['method', 'endpoint'])

@REQUEST_LATENCY.labels(method='GET', endpoint='/api').time()
def handle_request():
    REQUEST_COUNT.labels(method='GET', endpoint='/api', status='200').inc()
    # ...

start_http_server(8000)

场景 3：SLO 监控

# 可用性 SLO (99.9%)
1 - (sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d])))

# 错误预算消耗
(1 - (sum(rate(http_requests_total{status=~"5.."}[7d])) / sum(rate(http_requests_total[7d])))) / 0.999

# 延迟 SLO (P99 < 500ms)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[30d])) by (le)) < 0.5

场景 4：告警静默

# 创建静默
amtool silence add alertname=HighCPUUsage instance=node1 --duration=2h --comment="Maintenance"

# 查看静默
amtool silence query

# 删除静默
amtool silence expire <silence-id>

故障排查

问题	排查方法
指标缺失	检查 scrape 配置、target 状态
告警不触发	检查规则语法、Alertmanager 配置
查询慢	优化 PromQL、增加采样间隔
存储满	调整 retention、清理旧数据

# 检查 Prometheus targets
curl http://prometheus:9090/api/v1/targets

# 检查告警规则
curl http://prometheus:9090/api/v1/rules

# 检查 Alertmanager 状态
curl http://alertmanager:9093/api/v1/status

# 测试 PromQL
curl 'http://prometheus:9090/api/v1/query?query=up'