Claude-skill-registry kafka-observability
Kafka monitoring and observability expert for Prometheus, Grafana, and JMX metrics. Use when setting up Kafka monitoring, configuring alerting rules, or building performance dashboards.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/kafka-observability" ~/.claude/skills/majiayu000-claude-skill-registry-kafka-observability && rm -rf "$T"
skills/data/kafka-observability/SKILL.mdKafka Monitoring & Observability
Expert guidance for implementing comprehensive monitoring and observability for Apache Kafka using Prometheus and Grafana.
When to Use This Skill
I activate when you need help with:
- Monitoring setup: "Set up Kafka monitoring", "configure Prometheus for Kafka", "Grafana dashboards for Kafka"
- Metrics collection: "Kafka JMX metrics", "export Kafka metrics to Prometheus"
- Alerting: "Kafka alerting rules", "alert on under-replicated partitions", "critical Kafka metrics"
- Troubleshooting: "Monitor Kafka performance", "track consumer lag", "broker health monitoring"
What I Know
Available Monitoring Components
This plugin provides a complete monitoring stack:
1. Prometheus JMX Exporter Configuration
- Location:
plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml - Purpose: Export Kafka JMX metrics to Prometheus format
- Metrics Exported:
- Broker topic metrics (bytes in/out, messages in, request rate)
- Replica manager (under-replicated partitions, ISR shrinks/expands)
- Controller metrics (active controller, offline partitions, leader elections)
- Request metrics (produce/fetch latency)
- Log metrics (flush rate, flush latency)
- JVM metrics (heap, GC, threads, file descriptors)
2. Grafana Dashboards (5 Dashboards)
- Location:
plugins/specweave-kafka/monitoring/grafana/dashboards/ - Dashboards:
- kafka-cluster-overview.json - Cluster health and throughput
- kafka-broker-metrics.json - Per-broker performance
- kafka-consumer-lag.json - Consumer lag monitoring
- kafka-topic-metrics.json - Topic-level metrics
- kafka-jvm-metrics.json - JVM health (heap, GC, threads)
3. Grafana Provisioning
- Location:
plugins/specweave-kafka/monitoring/grafana/provisioning/ - Files:
- Dashboard provisioning configdashboards/kafka.yml
- Prometheus datasource configdatasources/prometheus.yml
Setup Workflow 1: JMX Exporter (Self-Hosted Kafka)
For Kafka running on VMs or bare metal (non-Kubernetes).
Step 1: Download JMX Prometheus Agent
# Download JMX Prometheus agent JAR cd /opt wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar # Copy JMX Exporter config cp plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml /opt/kafka-jmx-exporter.yml
Step 2: Configure Kafka Broker
Add JMX exporter to Kafka startup script:
# Edit Kafka startup (e.g., /etc/systemd/system/kafka.service) [Service] Environment="KAFKA_OPTS=-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-jmx-exporter.yml"
Or add to
kafka-server-start.sh:
export KAFKA_OPTS="-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-jmx-exporter.yml"
Step 3: Restart Kafka and Verify
# Restart Kafka broker sudo systemctl restart kafka # Verify JMX exporter is running (port 7071) curl localhost:7071/metrics | grep kafka_server # Expected output: kafka_server_broker_topic_metrics_bytesin_total{...} 12345
Step 4: Configure Prometheus Scraping
Add Kafka brokers to Prometheus config:
# prometheus.yml scrape_configs: - job_name: 'kafka' static_configs: - targets: - 'kafka-broker-1:7071' - 'kafka-broker-2:7071' - 'kafka-broker-3:7071' scrape_interval: 30s
# Reload Prometheus sudo systemctl reload prometheus # OR send SIGHUP kill -HUP $(pidof prometheus) # Verify scraping curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
Setup Workflow 2: Strimzi (Kubernetes)
For Kafka running on Kubernetes with Strimzi Operator.
Step 1: Create JMX Exporter ConfigMap
# Create ConfigMap from JMX exporter config kubectl create configmap kafka-metrics \ --from-file=kafka-metrics-config.yml=plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml \ -n kafka
Step 2: Configure Kafka CR with Metrics
# kafka-cluster.yaml (add metricsConfig section) apiVersion: kafka.strimzi.io/v1beta2 kind: Kafka metadata: name: my-kafka-cluster namespace: kafka spec: kafka: version: 3.7.0 replicas: 3 # ... other config ... metricsConfig: type: jmxPrometheusExporter valueFrom: configMapKeyRef: name: kafka-metrics key: kafka-metrics-config.yml
# Apply updated Kafka CR kubectl apply -f kafka-cluster.yaml # Verify metrics endpoint (wait for rolling restart) kubectl exec -it kafka-my-kafka-cluster-0 -n kafka -- curl localhost:9404/metrics | grep kafka_server
Step 3: Install Prometheus Operator (if not installed)
# Add Prometheus Community Helm repo helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # Install kube-prometheus-stack (Prometheus + Grafana + Alertmanager) helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \ --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false
Step 4: Create PodMonitor for Kafka
# kafka-podmonitor.yaml apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: kafka-metrics namespace: kafka labels: app: strimzi spec: selector: matchLabels: strimzi.io/kind: Kafka podMetricsEndpoints: - port: tcp-prometheus interval: 30s
# Apply PodMonitor kubectl apply -f kafka-podmonitor.yaml # Verify Prometheus is scraping Kafka kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 # Open: http://localhost:9090/targets # Should see kafka-metrics/* targets
Setup Workflow 3: Grafana Dashboards
Installation (Docker Compose)
If using Docker Compose for local development:
# docker-compose.yml (add to existing Kafka setup) version: '3.8' services: # ... Kafka services ... prometheus: image: prom/prometheus:v2.48.0 ports: - "9090:9090" volumes: - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - prometheus-data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' grafana: image: grafana/grafana:10.2.0 ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin volumes: - ./monitoring/grafana/provisioning:/etc/grafana/provisioning - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards - grafana-data:/var/lib/grafana volumes: prometheus-data: grafana-data:
# Start monitoring stack docker-compose up -d prometheus grafana # Access Grafana # URL: http://localhost:3000 # Username: admin # Password: admin
Installation (Kubernetes)
Dashboards are auto-provisioned if using kube-prometheus-stack:
# Create ConfigMaps for each dashboard for dashboard in plugins/specweave-kafka/monitoring/grafana/dashboards/*.json; do name=$(basename "$dashboard" .json) kubectl create configmap "kafka-dashboard-$name" \ --from-file="$dashboard" \ -n monitoring \ --dry-run=client -o yaml | kubectl apply -f - done # Label ConfigMaps for Grafana auto-discovery kubectl label configmap -n monitoring kafka-dashboard-* grafana_dashboard=1 # Grafana will auto-import dashboards (wait 30-60 seconds) # Access Grafana kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 # URL: http://localhost:3000 # Username: admin # Password: prom-operator (default kube-prometheus-stack password)
Manual Dashboard Import
If auto-provisioning doesn't work:
# 1. Access Grafana UI # 2. Go to: Dashboards → Import # 3. Upload JSON files from: # plugins/specweave-kafka/monitoring/grafana/dashboards/ # Or use Grafana API for dashboard in plugins/specweave-kafka/monitoring/grafana/dashboards/*.json; do curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \ -H "Content-Type: application/json" \ -d @"$dashboard" done
Dashboard Overview
1. Kafka Cluster Overview (kafka-cluster-overview.json
)
kafka-cluster-overview.jsonPurpose: High-level cluster health
Key Metrics:
- Active Controller Count (should be exactly 1)
- Under-Replicated Partitions (should be 0) ⚠️ CRITICAL
- Offline Partitions Count (should be 0) ⚠️ CRITICAL
- Unclean Leader Elections (should be 0)
- Cluster Throughput (bytes in/out per second)
- Request Rate (produce, fetch requests per second)
- ISR Changes (shrinks/expands)
- Leader Election Rate
Use When: Checking overall cluster health
2. Kafka Broker Metrics (kafka-broker-metrics.json
)
kafka-broker-metrics.jsonPurpose: Per-broker performance
Key Metrics:
- Broker CPU Usage (% utilization)
- Broker Heap Memory Usage
- Broker Network Throughput (bytes in/out)
- Request Handler Idle Percentage (low = CPU saturation)
- File Descriptors (open vs max)
- Log Flush Latency (p50, p99)
- JVM GC Collection Count/Time
Use When: Investigating broker performance issues
3. Kafka Consumer Lag (kafka-consumer-lag.json
)
kafka-consumer-lag.jsonPurpose: Consumer lag monitoring
Key Metrics:
- Consumer Lag per Topic/Partition
- Total Lag per Consumer Group
- Offset Commit Rate
- Current Consumer Offset
- Log End Offset (producer offset)
- Consumer Group Members
Use When: Troubleshooting slow consumers or lag spikes
4. Kafka Topic Metrics (kafka-topic-metrics.json
)
kafka-topic-metrics.jsonPurpose: Topic-level metrics
Key Metrics:
- Messages Produced per Topic
- Bytes per Topic (in/out)
- Partition Count per Topic
- Replication Factor
- In-Sync Replicas
- Log Size per Partition
- Current Offset per Partition
- Partition Leader Distribution
Use When: Analyzing topic throughput and hotspots
5. Kafka JVM Metrics (kafka-jvm-metrics.json
)
kafka-jvm-metrics.jsonPurpose: JVM health monitoring
Key Metrics:
- Heap Memory Usage (used vs max)
- Heap Utilization Percentage
- GC Collection Rate (collections/sec)
- GC Collection Time (ms/sec)
- JVM Thread Count
- Heap Memory by Pool (young gen, old gen, survivor)
- Off-Heap Memory Usage (metaspace, code cache)
- GC Pause Time Percentiles (p50, p95, p99)
Use When: Investigating memory leaks or GC pauses
Critical Alerts Configuration
Create Prometheus alerting rules for critical Kafka metrics:
# kafka-alerts.yml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: kafka-alerts namespace: monitoring spec: groups: - name: kafka.rules interval: 30s rules: # CRITICAL: Under-Replicated Partitions - alert: KafkaUnderReplicatedPartitions expr: sum(kafka_server_replica_manager_under_replicated_partitions) > 0 for: 5m labels: severity: critical annotations: summary: "Kafka has under-replicated partitions" description: "{{ $value }} partitions are under-replicated. Data loss risk!" # CRITICAL: Offline Partitions - alert: KafkaOfflinePartitions expr: kafka_controller_offline_partitions_count > 0 for: 1m labels: severity: critical annotations: summary: "Kafka has offline partitions" description: "{{ $value }} partitions are offline. Service degradation!" # CRITICAL: No Active Controller - alert: KafkaNoActiveController expr: kafka_controller_active_controller_count == 0 for: 1m labels: severity: critical annotations: summary: "No active Kafka controller" description: "Cluster has no active controller. Cannot perform administrative operations!" # WARNING: High Consumer Lag - alert: KafkaConsumerLagHigh expr: sum by (consumergroup) (kafka_consumergroup_lag) > 10000 for: 10m labels: severity: warning annotations: summary: "Consumer group {{ $labels.consumergroup }} has high lag" description: "Lag is {{ $value }} messages. Consumers may be slow." # WARNING: High CPU Usage - alert: KafkaBrokerHighCPU expr: os_process_cpu_load{job="kafka"} > 0.8 for: 5m labels: severity: warning annotations: summary: "Broker {{ $labels.instance }} has high CPU usage" description: "CPU usage is {{ $value | humanizePercentage }}. Consider scaling." # WARNING: Low Heap Memory - alert: KafkaBrokerLowHeapMemory expr: jvm_memory_heap_used_bytes{job="kafka"} / jvm_memory_heap_max_bytes{job="kafka"} > 0.9 for: 5m labels: severity: warning annotations: summary: "Broker {{ $labels.instance }} has low heap memory" description: "Heap usage is {{ $value | humanizePercentage }}. Risk of OOM!" # WARNING: High GC Time - alert: KafkaBrokerHighGCTime expr: rate(jvm_gc_collection_time_ms_total{job="kafka"}[5m]) > 500 for: 5m labels: severity: warning annotations: summary: "Broker {{ $labels.instance }} spending too much time in GC" description: "GC time is {{ $value }}ms/sec. Application pauses likely."
# Apply alerts (Kubernetes) kubectl apply -f kafka-alerts.yml # Verify alerts loaded kubectl get prometheusrules -n monitoring
Troubleshooting
"Prometheus not scraping Kafka metrics"
Symptoms: No Kafka metrics in Prometheus
Fix:
# 1. Verify JMX exporter is running curl http://kafka-broker:7071/metrics # 2. Check Prometheus targets curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")' # 3. Check Prometheus logs kubectl logs -n monitoring prometheus-kube-prometheus-prometheus-0 # Common issues: # - Firewall blocking port 7071 # - Incorrect scrape config # - Kafka broker not running
"Grafana dashboards not loading"
Symptoms: Dashboards show "No data"
Fix:
# 1. Verify Prometheus datasource # Grafana UI → Configuration → Data Sources → Prometheus → Test # 2. Check if Kafka metrics exist in Prometheus # Prometheus UI → Graph → Enter: kafka_server_broker_topic_metrics_bytesin_total # 3. Verify dashboard queries match your Prometheus job name # Dashboard panels use job="kafka" by default # If your job name is different, update dashboard JSON
"Consumer lag metrics missing"
Symptoms: Consumer lag dashboard empty
Fix: Consumer lag metrics require Kafka Exporter (separate from JMX Exporter):
# Install Kafka Exporter (Kubernetes) helm install kafka-exporter prometheus-community/prometheus-kafka-exporter \ --namespace monitoring \ --set kafkaServer={kafka-bootstrap:9092} # Or run as Docker container docker run -d -p 9308:9308 \ danielqsj/kafka-exporter \ --kafka.server=kafka:9092 \ --web.listen-address=:9308 # Add to Prometheus scrape config scrape_configs: - job_name: 'kafka-exporter' static_configs: - targets: ['kafka-exporter:9308']
Integration with Other Skills
- kafka-iac-deployment: Set up monitoring during Terraform deployment
- kafka-kubernetes: Configure monitoring for Strimzi Kafka on K8s
- kafka-architecture: Use cluster sizing metrics to validate capacity planning
- kafka-cli-tools: Use kcat to generate test traffic and verify metrics
Quick Reference Commands
# Check JMX exporter metrics curl http://localhost:7071/metrics | grep -E "(kafka_server|kafka_controller)" # Prometheus query examples curl -g 'http://localhost:9090/api/v1/query?query=kafka_server_replica_manager_under_replicated_partitions' # Grafana dashboard export curl http://admin:admin@localhost:3000/api/dashboards/uid/kafka-cluster-overview | jq .dashboard > backup.json # Reload Prometheus config kill -HUP $(pidof prometheus) # Check Prometheus targets curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
Next Steps After Monitoring Setup:
- Review all 5 Grafana dashboards to familiarize yourself with metrics
- Set up alerting (Slack, PagerDuty, email)
- Create runbooks for critical alerts (under-replicated partitions, offline partitions, no controller)
- Monitor for 7 days to establish baseline metrics
- Tune JVM settings based on GC metrics