Claude-skill-registry holmesgpt-skill
Guide for implementing HolmesGPT - an AI agent for troubleshooting cloud-native environments. Use when investigating Kubernetes issues, analyzing alerts from Prometheus/AlertManager/PagerDuty, performing root cause analysis, configuring HolmesGPT installations (CLI/Helm/Docker), setting up AI providers (OpenAI/Anthropic/Azure), creating custom toolsets, or integrating with observability platforms (Grafana, Loki, Tempo, DataDog).
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/holmesgpt-skill" ~/.claude/skills/majiayu000-claude-skill-registry-holmesgpt-skill && rm -rf "$T"
skills/data/holmesgpt-skill/SKILL.mdHolmesGPT Skill
AI-powered troubleshooting for Kubernetes and cloud-native environments.
Overview
HolmesGPT is a CNCF Sandbox project that connects AI models with live observability data to investigate infrastructure problems, find root causes, and suggest remediations. It operates with read-only access and respects RBAC permissions, making it safe for production environments.
Quick Reference
| Topic | Reference |
|---|---|
| Installation | |
| Configuration | |
| Data Sources | |
| Commands | |
| Troubleshooting | |
| HTTP API | |
| Integrations | |
Key Features
- Root Cause Analysis: Investigates alerts and cluster issues
- Multi-Source Integration: 30+ toolsets (K8s, Prometheus, Grafana)
- Alert Integration: AlertManager, PagerDuty, OpsGenie, Jira, Slack
- Interactive Mode: Troubleshooting with
,/run
,/show/clear - Custom Toolsets: Extend with proprietary tools via YAML configuration
- CI/CD Integration: Automated deployment failure investigation
Installation Quick Start
CLI (Homebrew)
brew tap robusta-dev/homebrew-holmesgpt brew install holmesgpt export ANTHROPIC_API_KEY="your-key" # or OPENAI_API_KEY holmes ask "what pods are unhealthy?"
Kubernetes (Helm)
helm repo add robusta https://robusta-charts.storage.googleapis.com helm repo update helm install holmesgpt robusta/holmes -f values.yaml
Docker
docker run -it --net=host \ -e OPENAI_API_KEY="your-key" \ -v ~/.kube/config:/root/.kube/config \ us-central1-docker.pkg.dev/genuine-flight-317411/devel/holmes \ ask "what pods are crashing?"
Essential Commands
# Basic investigation holmes ask "what pods are unhealthy and why?" holmes ask "why is my deployment failing?" # Interactive mode holmes ask "investigate issue" --interactive # Alert investigation holmes investigate alertmanager --alertmanager-url http://localhost:9093 holmes investigate pagerduty --pagerduty-api-key <KEY> --update # With file context holmes ask "summarize the key points" -f ./logs.txt # CI/CD integration holmes ask "why did deployment fail?" --destination slack --slack-token <TOKEN>
Supported AI Providers
| Provider | Environment Variable | Models |
|---|---|---|
| Anthropic | | Sonnet 4, Opus 4.5 |
| OpenAI | | GPT-4.1, GPT-4o |
| Azure OpenAI | | GPT-4.1 |
| AWS Bedrock | AWS credentials | Claude 3.5 Sonnet |
| Google Gemini | | Gemini 1.5 Pro |
| Vertex AI | | Gemini 1.5 Pro |
| Ollama | Local install | Llama 3.1, Mistral |
Basic Helm Values Structure
# values.yaml for Kubernetes deployment image: repository: robustadev/holmes tag: latest env: - name: ANTHROPIC_API_KEY valueFrom: secretKeyRef: name: holmesgpt-secrets key: anthropic-api-key # Model configuration modelList: sonnet: api_key: "{{ env.ANTHROPIC_API_KEY }}" model: anthropic/claude-sonnet-4-20250514 temperature: 0 # Toolsets to enable toolsets: kubernetes/core: enabled: true kubernetes/logs: enabled: true prometheus/metrics: enabled: true # Resources resources: requests: memory: "1024Mi" cpu: "100m" limits: memory: "1024Mi" # RBAC (read-only by default) createServiceAccount: true
Interactive Mode Commands
| Command | Description |
|---|---|
| Reset context when changing topics |
| Execute custom commands and share output with AI |
| Display complete tool outputs |
| Review accumulated investigation information |
Custom Toolset Example
# custom-toolset.yaml toolsets: my-custom-tool: description: "Custom diagnostic tool" tools: - name: check_service_health description: "Check health of a specific service" command: | curl -s http://{{ service_name }}.{{ namespace }}.svc.cluster.local/health parameters: - name: service_name description: "Name of the service" - name: namespace description: "Kubernetes namespace"
Use with:
holmes ask "check health" -t custom-toolset.yaml
Kubernetes Annotations for Integration
# Add to Services/Deployments for HolmesGPT context metadata: annotations: holmesgpt.dev/runbook: | This service handles payment processing. Common issues: database connectivity, API rate limits. Check: kubectl logs -l app=payment-service
Environment Variables Reference
| Variable | Description | Default |
|---|---|---|
| Config file path | |
| Log verbosity | |
| Prometheus server URL | - |
| GitHub API token | - |
| DataDog API key | - |
| Confluence URL | - |
Best Practices
- Use Specific Queries: Include namespace, deployment name, symptoms
- Start with Claude Sonnet 4.0/4.5: Best accuracy for complex investigations
- Enable Relevant Toolsets: Only enable what you need to reduce noise
- Use Interactive Mode: For complex multi-step investigations
- Set Up Runbooks: Provide context for known alert types
- CI/CD Integration: Automate deployment failure analysis
Security Considerations
- HolmesGPT uses read-only access (
,get
,list
only)watch - Respects existing RBAC permissions
- Never modifies, creates, or deletes resources
- API keys stored in Kubernetes Secrets
- Data not used for model training
Official Resources
- Documentation: https://holmesgpt.dev/
- GitHub: https://github.com/robusta-dev/holmesgpt
- Helm Chart: https://github.com/robusta-dev/holmesgpt/tree/master/helm/holmes
- Slack Community: Cloud Native Slack