Claude-skill-registry check-metrics
Query Prometheus metrics, check resource usage, and analyze platform performance in the Kagenti platform
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/check-metrics" ~/.claude/skills/majiayu000-claude-skill-registry-check-metrics && rm -rf "$T"
manifest:
skills/data/check-metrics/SKILL.mdsource content
Check Metrics Skill
This skill helps you query Prometheus metrics and analyze platform performance.
When to Use
- User asks about resource usage (CPU, memory, disk)
- Investigating performance issues
- Checking service health metrics
- After deployments to verify metrics collection
- Analyzing platform capacity and scaling needs
What This Skill Does
- Query Metrics: Execute PromQL queries against Prometheus
- Resource Usage: Check CPU, memory, disk usage
- Service Health: Verify service metrics and availability
- Performance Analysis: Analyze request rates, latency, errors
- Capacity Planning: Review resource trends
Examples
Access Prometheus UI
Prometheus UI: Port-forward to access locally
kubectl port-forward -n observability svc/prometheus 9090:9090 & # Open http://localhost:9090
Grafana Explore: https://grafana.localtest.me:9443/explore
- Select Prometheus datasource
- Enter PromQL queries
Query Metrics via CLI
# Basic query kubectl exec -n observability deployment/grafana -- \ curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \ --data-urlencode 'query=up' | python3 -m json.tool # Query with time range kubectl exec -n observability deployment/grafana -- \ curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query_range' \ --data-urlencode 'query=rate(container_cpu_usage_seconds_total[5m])' \ --data-urlencode 'start='$(date -u -v-1H +%s) \ --data-urlencode 'end='$(date -u +%s) \ --data-urlencode 'step=60' | python3 -m json.tool
Common PromQL Queries
Service Health
# Check if services are up up{job="kubernetes-pods"} # Count running pods by namespace count by (kubernetes_namespace) (up == 1) # Check deployment replicas kube_deployment_status_replicas_available # Check StatefulSet replicas kube_statefulset_status_replicas_ready
CPU Usage
# Pod CPU usage (percentage of limit) sum(rate(container_cpu_usage_seconds_total{container!="",container!="POD"}[5m])) by (namespace, pod, container) / sum(container_spec_cpu_quota{container!="",container!="POD"} / container_spec_cpu_period{container!="",container!="POD"}) by (namespace, pod, container) * 100 # Node CPU usage 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Top CPU consuming pods topk(10, sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace, pod) )
Memory Usage
# Pod memory usage (percentage of limit) sum(container_memory_working_set_bytes{container!="",container!="POD"}) by (namespace, pod, container) / sum(container_spec_memory_limit_bytes{container!="",container!="POD"}) by (namespace, pod, container) * 100 # Pod memory usage in bytes container_memory_working_set_bytes{container!="",container!="POD"} # Top memory consuming pods topk(10, sum(container_memory_working_set_bytes{container!=""}) by (namespace, pod) )
Network Traffic
# Network receive rate rate(container_network_receive_bytes_total[5m]) # Network transmit rate rate(container_network_transmit_bytes_total[5m]) # Total network I/O by pod sum by (pod) ( rate(container_network_receive_bytes_total[5m]) + rate(container_network_transmit_bytes_total[5m]) )
Disk Usage
# Filesystem usage percentage (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100 # PVC usage by namespace sum by (namespace, persistentvolumeclaim) ( kubelet_volume_stats_used_bytes ) # Disk I/O rate rate(container_fs_writes_bytes_total[5m])
Pod Status
# Pods not running kube_pod_status_phase{phase!="Running"} # Pod restart count kube_pod_container_status_restarts_total # Pods waiting (pending) kube_pod_status_phase{phase="Pending"} # Pods in crash loop kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}
Request Metrics (if instrumented)
# Request rate rate(http_requests_total[5m]) # Error rate rate(http_requests_total{status=~"5.."}[5m]) # Request latency (p95) histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) )
Check Specific Components
Prometheus Metrics
# Check Prometheus scrape targets kubectl exec -n observability deployment/grafana -- \ curl -s 'http://prometheus.observability.svc:9090/api/v1/targets' | python3 -m json.tool # Prometheus storage size kubectl exec -n observability deployment/grafana -- \ curl -s 'http://prometheus.observability.svc:9090/api/v1/status/tsdb' | python3 -m json.tool
Grafana Metrics
# Grafana datasource queries grafana_datasource_request_total # Grafana dashboard loads grafana_page_response_status_total
Keycloak Metrics (if exposed)
# Keycloak sessions keycloak_sessions # Keycloak login failures keycloak_failed_login_attempts
Istio Metrics
# Istio requests istio_requests_total # Istio request duration histogram_quantile(0.95, rate(istio_request_duration_milliseconds_bucket[5m]) ) # Istio error rate rate(istio_requests_total{response_code=~"5.."}[5m])
Resource Monitoring via kubectl
Quick Resource Check
# Node resources kubectl top nodes # Pod resources (all namespaces) kubectl top pods -A --sort-by=memory # Pod resources (specific namespace) kubectl top pods -n observability --sort-by=cpu # Container resources in pod kubectl top pod <pod-name> -n <namespace> --containers
Resource Limits and Requests
# Show resource requests/limits for deployment kubectl describe deployment <name> -n <namespace> | grep -A 5 "Limits\|Requests" # Show all pod resource requests kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources}{"\n"}{end}'
Grafana Dashboards
Access: https://grafana.localtest.me:9443/dashboards
Key Dashboards:
- Kubernetes / Compute Resources / Cluster - Overall cluster metrics
- Kubernetes / Compute Resources / Namespace (Pods) - Per-namespace pod resources
- Kubernetes / Compute Resources / Pod - Individual pod metrics
- Prometheus - Prometheus self-monitoring
- Loki Logs - Log volume and patterns
- Istio Mesh - Service mesh metrics
Create Custom Queries in Grafana
- Navigate to Explore (compass icon in sidebar)
- Select Prometheus datasource
- Enter PromQL query
- Click "Run query"
- Optionally save to dashboard
Troubleshooting with Metrics
Issue: High CPU Usage
# Find pods using >80% CPU sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod, container) / sum(container_spec_cpu_quota / container_spec_cpu_period) by (namespace, pod, container) * 100 > 80
Issue: High Memory Usage
# Find pods using >80% memory sum(container_memory_working_set_bytes) by (namespace, pod, container) / sum(container_spec_memory_limit_bytes) by (namespace, pod, container) * 100 > 80
Issue: Service Not Responding
# Check if service endpoints are up up{job="kubernetes-service-endpoints"} # Check scrape failures up == 0
Issue: Disk Full
# Find PVCs >80% full (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100 > 80
Alert Query Testing
When investigating alerts, test the PromQL query:
# Get alert query from Grafana kubectl exec -n observability deployment/grafana -- \ curl -s 'http://localhost:3000/api/v1/provisioning/alert-rules' \ -u admin:admin123 | python3 -c " import sys, json rules = json.load(sys.stdin) alert_uid = 'prometheus-down' # Change this rule = next((r for r in rules if r.get('uid') == alert_uid), None) if rule: query = rule['data'][0]['model']['expr'] print(f'Query: {query}') " # Test the query kubectl exec -n observability deployment/grafana -- \ curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \ --data-urlencode "query=<QUERY_FROM_ABOVE>" | python3 -m json.tool
Metrics Collection Issues
Check if Metrics Are Being Scraped
# Check last scrape time time() - timestamp(up) # Check scrape duration scrape_duration_seconds
Verify Metric Exists
# List all metrics kubectl exec -n observability deployment/grafana -- \ curl -s 'http://prometheus.observability.svc:9090/api/v1/label/__name__/values' | python3 -m json.tool # Search for specific metric kubectl exec -n observability deployment/grafana -- \ curl -s 'http://prometheus.observability.svc:9090/api/v1/label/__name__/values' | grep "your_metric"
Related Documentation
Pro Tips
- Use rate() for counters:
instead of raw counter valuesrate(metric[5m]) - Aggregate with by/without:
to group metricssum by (namespace) (metric) - Use recording rules: For frequently used complex queries
- Set appropriate time ranges: Use
for rate calculations[5m] - Test queries in Explore first: Before adding to dashboards or alerts
🤖 Generated with Claude Code