Claude-skill-registry Kubernetes AI Expert

Deploy and operate AI workloads on Kubernetes with GPU scheduling, model serving, and MLOps patterns

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/kubernetes-ai" ~/.claude/skills/majiayu000-claude-skill-registry-kubernetes-ai-expert && rm -rf "$T"

manifest: skills/data/kubernetes-ai/SKILL.md

Kubernetes AI Expert

Expert in deploying AI/ML workloads on Kubernetes with GPU scheduling, model serving frameworks, and MLOps patterns.

GPU Workload Scheduling

NVIDIA GPU Operator

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator

GPU Resource Requests

Resource	Description
`nvidia.com/gpu: N`	Request N GPUs
`nvidia.com/mig-3g.40gb: 1`	MIG slice
Node selector	`nvidia.com/gpu.product`
Toleration	`nvidia.com/gpu`

Full manifests:

resources/manifests.yaml

Model Serving Frameworks

Framework Comparison

Framework	Best For	GPU Support	Scaling
vLLM	High-throughput LLMs	Excellent	HPA/KEDA
Triton	Multi-model serving	Excellent	HPA
TGI	HuggingFace models	Good	HPA

vLLM Deployment

Key configurations:

```
--tensor-parallel-size
```
- Multi-GPU inference
```
--max-model-len
```
- Context window
```
--gpu-memory-utilization
```
- Memory efficiency

Triton Inference Server

Multi-model serving from S3/GCS
HTTP (8000), gRPC (8001), Metrics (8002)
Model polling for dynamic updates

Text Generation Inference (TGI)

HuggingFace native
Quantization support (
```
bitsandbytes-nf4
```
)
Simple deployment

Deployment manifests:

resources/manifests.yaml

Helm Chart Pattern

# values.yaml structure
inference:
  enabled: true
  replicas: 2
  framework: "vllm"  # vllm, tgi, triton
  resources:
    limits:
      nvidia.com/gpu: 1
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10

vectorDB:
  enabled: true
  type: "qdrant"

monitoring:
  enabled: true

Auto-Scaling

Horizontal Pod Autoscaler (HPA)

Scale on:

GPU utilization (
```
DCGM_FI_DEV_GPU_UTIL
```
)
Inference queue length
Custom metrics

KEDA Event-Driven Scaling

Scale on:

Prometheus metrics
Message queue depth (RabbitMQ, SQS)
Custom external metrics

HPA/KEDA configs:

resources/manifests.yaml

Networking

Ingress Configuration

Rate limiting (nginx annotations)
TLS with cert-manager
Large body size for AI payloads
Extended timeouts (300s+)

Network Policies

Restrict pod-to-pod communication
Allow only gateway → inference
Permit DNS egress

Monitoring

Key Metrics

Metric	Source	Purpose
GPU Utilization	DCGM Exporter	Scaling
Inference Latency	Prometheus	SLO
Tokens/Second	Custom	Throughput
Queue Length	App metrics	Scaling

Setup

# Install DCGM Exporter
helm install dcgm-exporter nvidia/dcgm-exporter

# ServiceMonitor for Prometheus
# See resources/manifests.yaml

Managed Kubernetes

AWS EKS

Instance types:
```
g5.2xlarge
```
,
```
p4d.24xlarge
```
AMI:
```
AL2_x86_64_GPU
```
GPU taints for isolation

Azure AKS

VM sizes:
```
Standard_NC*
```
,
```
Standard_ND*
```
A100 support via
```
NC24ads_A100_v4
```

OCI OKE

Shapes:
```
BM.GPU.A100-v2.8
```
,
```
VM.GPU.A10
```
GPU node pools with taints

Terraform examples:

../terraform-iac/resources/modules.tf

Best Practices

Resource Management

Always set GPU limits = requests
Use node selectors for GPU types
Implement tolerations for GPU taints
PVC for model caching

High Availability

Multiple replicas across zones
Pod disruption budgets
Readiness/liveness probes

Cost Optimization

Spot instances for dev/test
Auto-scaling to zero when idle
Right-size GPU instances

Resources

Deploy AI workloads at scale with GPU-optimized Kubernetes.