Claude-skill-registry Kubernetes AI Expert
Deploy and operate AI workloads on Kubernetes with GPU scheduling, model serving, and MLOps patterns
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/kubernetes-ai" ~/.claude/skills/majiayu000-claude-skill-registry-kubernetes-ai-expert && rm -rf "$T"
manifest:
skills/data/kubernetes-ai/SKILL.mdsource content
Kubernetes AI Expert
Expert in deploying AI/ML workloads on Kubernetes with GPU scheduling, model serving frameworks, and MLOps patterns.
GPU Workload Scheduling
NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm install gpu-operator nvidia/gpu-operator
GPU Resource Requests
| Resource | Description |
|---|---|
| Request N GPUs |
| MIG slice |
| Node selector | |
| Toleration | |
Full manifests:
resources/manifests.yaml
Model Serving Frameworks
Framework Comparison
| Framework | Best For | GPU Support | Scaling |
|---|---|---|---|
| vLLM | High-throughput LLMs | Excellent | HPA/KEDA |
| Triton | Multi-model serving | Excellent | HPA |
| TGI | HuggingFace models | Good | HPA |
vLLM Deployment
Key configurations:
- Multi-GPU inference--tensor-parallel-size
- Context window--max-model-len
- Memory efficiency--gpu-memory-utilization
Triton Inference Server
- Multi-model serving from S3/GCS
- HTTP (8000), gRPC (8001), Metrics (8002)
- Model polling for dynamic updates
Text Generation Inference (TGI)
- HuggingFace native
- Quantization support (
)bitsandbytes-nf4 - Simple deployment
Deployment manifests:
resources/manifests.yaml
Helm Chart Pattern
# values.yaml structure inference: enabled: true replicas: 2 framework: "vllm" # vllm, tgi, triton resources: limits: nvidia.com/gpu: 1 autoscaling: enabled: true minReplicas: 1 maxReplicas: 10 vectorDB: enabled: true type: "qdrant" monitoring: enabled: true
Auto-Scaling
Horizontal Pod Autoscaler (HPA)
Scale on:
- GPU utilization (
)DCGM_FI_DEV_GPU_UTIL - Inference queue length
- Custom metrics
KEDA Event-Driven Scaling
Scale on:
- Prometheus metrics
- Message queue depth (RabbitMQ, SQS)
- Custom external metrics
HPA/KEDA configs:
resources/manifests.yaml
Networking
Ingress Configuration
- Rate limiting (nginx annotations)
- TLS with cert-manager
- Large body size for AI payloads
- Extended timeouts (300s+)
Network Policies
- Restrict pod-to-pod communication
- Allow only gateway → inference
- Permit DNS egress
Monitoring
Key Metrics
| Metric | Source | Purpose |
|---|---|---|
| GPU Utilization | DCGM Exporter | Scaling |
| Inference Latency | Prometheus | SLO |
| Tokens/Second | Custom | Throughput |
| Queue Length | App metrics | Scaling |
Setup
# Install DCGM Exporter helm install dcgm-exporter nvidia/dcgm-exporter # ServiceMonitor for Prometheus # See resources/manifests.yaml
Managed Kubernetes
AWS EKS
- Instance types:
,g5.2xlargep4d.24xlarge - AMI:
AL2_x86_64_GPU - GPU taints for isolation
Azure AKS
- VM sizes:
,Standard_NC*Standard_ND* - A100 support via
NC24ads_A100_v4
OCI OKE
- Shapes:
,BM.GPU.A100-v2.8VM.GPU.A10 - GPU node pools with taints
Terraform examples:
../terraform-iac/resources/modules.tf
Best Practices
Resource Management
- Always set GPU limits = requests
- Use node selectors for GPU types
- Implement tolerations for GPU taints
- PVC for model caching
High Availability
- Multiple replicas across zones
- Pod disruption budgets
- Readiness/liveness probes
Cost Optimization
- Spot instances for dev/test
- Auto-scaling to zero when idle
- Right-size GPU instances
Resources
Deploy AI workloads at scale with GPU-optimized Kubernetes.