Claude-skill-registry holmesgpt-skill

Guide for implementing HolmesGPT - an AI agent for troubleshooting cloud-native environments. Use when investigating Kubernetes issues, analyzing alerts from Prometheus/AlertManager/PagerDuty, performing root cause analysis, configuring HolmesGPT installations (CLI/Helm/Docker), setting up AI providers (OpenAI/Anthropic/Azure), creating custom toolsets, or integrating with observability platforms (Grafana, Loki, Tempo, DataDog).

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/holmesgpt-skill" ~/.claude/skills/majiayu000-claude-skill-registry-holmesgpt-skill && rm -rf "$T"

manifest: skills/data/holmesgpt-skill/SKILL.md

HolmesGPT Skill

AI-powered troubleshooting for Kubernetes and cloud-native environments.

Overview

HolmesGPT is a CNCF Sandbox project that connects AI models with live observability data to investigate infrastructure problems, find root causes, and suggest remediations. It operates with read-only access and respects RBAC permissions, making it safe for production environments.

Quick Reference

Topic	Reference
Installation	`references/installation.md`
Configuration	`references/configuration.md`
Data Sources	`references/data-sources.md`
Commands	`references/commands.md`
Troubleshooting	`references/troubleshooting.md`
HTTP API	`references/http-api.md`
Integrations	`references/integrations.md`

Key Features

Root Cause Analysis: Investigates alerts and cluster issues
Multi-Source Integration: 30+ toolsets (K8s, Prometheus, Grafana)
Alert Integration: AlertManager, PagerDuty, OpsGenie, Jira, Slack
Interactive Mode: Troubleshooting with
```
/run
```
,
```
/show
```
,
```
/clear
```
Custom Toolsets: Extend with proprietary tools via YAML configuration
CI/CD Integration: Automated deployment failure investigation

Installation Quick Start

CLI (Homebrew)

brew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt
export ANTHROPIC_API_KEY="your-key"  # or OPENAI_API_KEY
holmes ask "what pods are unhealthy?"

Kubernetes (Helm)

helm repo add robusta https://robusta-charts.storage.googleapis.com
helm repo update
helm install holmesgpt robusta/holmes -f values.yaml

Docker

docker run -it --net=host \
  -e OPENAI_API_KEY="your-key" \
  -v ~/.kube/config:/root/.kube/config \
  us-central1-docker.pkg.dev/genuine-flight-317411/devel/holmes \
  ask "what pods are crashing?"

Essential Commands

# Basic investigation
holmes ask "what pods are unhealthy and why?"
holmes ask "why is my deployment failing?"

# Interactive mode
holmes ask "investigate issue" --interactive

# Alert investigation
holmes investigate alertmanager --alertmanager-url http://localhost:9093
holmes investigate pagerduty --pagerduty-api-key <KEY> --update

# With file context
holmes ask "summarize the key points" -f ./logs.txt

# CI/CD integration
holmes ask "why did deployment fail?" --destination slack --slack-token <TOKEN>

Supported AI Providers

Provider	Environment Variable	Models
Anthropic	`ANTHROPIC_API_KEY`	Sonnet 4, Opus 4.5
OpenAI	`OPENAI_API_KEY`	GPT-4.1, GPT-4o
Azure OpenAI	`AZURE_API_KEY`	GPT-4.1
AWS Bedrock	AWS credentials	Claude 3.5 Sonnet
Google Gemini	`GEMINI_API_KEY`	Gemini 1.5 Pro
Vertex AI	`VERTEXAI_PROJECT`	Gemini 1.5 Pro
Ollama	Local install	Llama 3.1, Mistral

Basic Helm Values Structure

# values.yaml for Kubernetes deployment
image:
  repository: robustadev/holmes
  tag: latest

env:
  - name: ANTHROPIC_API_KEY
    valueFrom:
      secretKeyRef:
        name: holmesgpt-secrets
        key: anthropic-api-key

# Model configuration
modelList:
  sonnet:
    api_key: "{{ env.ANTHROPIC_API_KEY }}"
    model: anthropic/claude-sonnet-4-20250514
    temperature: 0

# Toolsets to enable
toolsets:
  kubernetes/core:
    enabled: true
  kubernetes/logs:
    enabled: true
  prometheus/metrics:
    enabled: true

# Resources
resources:
  requests:
    memory: "1024Mi"
    cpu: "100m"
  limits:
    memory: "1024Mi"

# RBAC (read-only by default)
createServiceAccount: true

Interactive Mode Commands

Command	Description
`/clear`	Reset context when changing topics
`/run`	Execute custom commands and share output with AI
`/show`	Display complete tool outputs
`/context`	Review accumulated investigation information

Custom Toolset Example

# custom-toolset.yaml
toolsets:
  my-custom-tool:
    description: "Custom diagnostic tool"
    tools:
      - name: check_service_health
        description: "Check health of a specific service"
        command: |
          curl -s http://{{ service_name }}.{{ namespace }}.svc.cluster.local/health
        parameters:
          - name: service_name
            description: "Name of the service"
          - name: namespace
            description: "Kubernetes namespace"

Use with:

holmes ask "check health" -t custom-toolset.yaml

Kubernetes Annotations for Integration

# Add to Services/Deployments for HolmesGPT context
metadata:
  annotations:
    holmesgpt.dev/runbook: |
      This service handles payment processing.
      Common issues: database connectivity, API rate limits.
      Check: kubectl logs -l app=payment-service

Environment Variables Reference

Variable	Description	Default
`HOLMES_CONFIG_PATH`	Config file path	`~/.holmes/config.yaml`
`HOLMES_LOG_LEVEL`	Log verbosity	`INFO`
`PROMETHEUS_URL`	Prometheus server URL	-
`GITHUB_TOKEN`	GitHub API token	-
`DATADOG_API_KEY`	DataDog API key	-
`CONFLUENCE_BASE_URL`	Confluence URL	-

Best Practices

Use Specific Queries: Include namespace, deployment name, symptoms
Start with Claude Sonnet 4.0/4.5: Best accuracy for complex investigations
Enable Relevant Toolsets: Only enable what you need to reduce noise
Use Interactive Mode: For complex multi-step investigations
Set Up Runbooks: Provide context for known alert types
CI/CD Integration: Automate deployment failure analysis

Security Considerations

HolmesGPT uses read-only access (
```
get
```
,
```
list
```
,
```
watch
```
only)
Respects existing RBAC permissions
Never modifies, creates, or deletes resources
API keys stored in Kubernetes Secrets
Data not used for model training

Official Resources

Documentation: https://holmesgpt.dev/
GitHub: https://github.com/robusta-dev/holmesgpt
Helm Chart: https://github.com/robusta-dev/holmesgpt/tree/master/helm/holmes
Slack Community: Cloud Native Slack