Claude-skill-registry Kubernetes AI Expert

Deploy and operate AI workloads on Kubernetes with GPU scheduling, model serving, and MLOps patterns

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/kubernetes-ai" ~/.claude/skills/majiayu000-claude-skill-registry-kubernetes-ai-expert && rm -rf "$T"
manifest: skills/data/kubernetes-ai/SKILL.md
source content

Kubernetes AI Expert

Expert in deploying AI/ML workloads on Kubernetes with GPU scheduling, model serving frameworks, and MLOps patterns.

GPU Workload Scheduling

NVIDIA GPU Operator

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator

GPU Resource Requests

ResourceDescription
nvidia.com/gpu: N
Request N GPUs
nvidia.com/mig-3g.40gb: 1
MIG slice
Node selector
nvidia.com/gpu.product
Toleration
nvidia.com/gpu

Full manifests:

resources/manifests.yaml

Model Serving Frameworks

Framework Comparison

FrameworkBest ForGPU SupportScaling
vLLMHigh-throughput LLMsExcellentHPA/KEDA
TritonMulti-model servingExcellentHPA
TGIHuggingFace modelsGoodHPA

vLLM Deployment

Key configurations:

  • --tensor-parallel-size
    - Multi-GPU inference
  • --max-model-len
    - Context window
  • --gpu-memory-utilization
    - Memory efficiency

Triton Inference Server

  • Multi-model serving from S3/GCS
  • HTTP (8000), gRPC (8001), Metrics (8002)
  • Model polling for dynamic updates

Text Generation Inference (TGI)

  • HuggingFace native
  • Quantization support (
    bitsandbytes-nf4
    )
  • Simple deployment

Deployment manifests:

resources/manifests.yaml

Helm Chart Pattern

# values.yaml structure
inference:
  enabled: true
  replicas: 2
  framework: "vllm"  # vllm, tgi, triton
  resources:
    limits:
      nvidia.com/gpu: 1
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10

vectorDB:
  enabled: true
  type: "qdrant"

monitoring:
  enabled: true

Auto-Scaling

Horizontal Pod Autoscaler (HPA)

Scale on:

  • GPU utilization (
    DCGM_FI_DEV_GPU_UTIL
    )
  • Inference queue length
  • Custom metrics

KEDA Event-Driven Scaling

Scale on:

  • Prometheus metrics
  • Message queue depth (RabbitMQ, SQS)
  • Custom external metrics

HPA/KEDA configs:

resources/manifests.yaml

Networking

Ingress Configuration

  • Rate limiting (nginx annotations)
  • TLS with cert-manager
  • Large body size for AI payloads
  • Extended timeouts (300s+)

Network Policies

  • Restrict pod-to-pod communication
  • Allow only gateway → inference
  • Permit DNS egress

Monitoring

Key Metrics

MetricSourcePurpose
GPU UtilizationDCGM ExporterScaling
Inference LatencyPrometheusSLO
Tokens/SecondCustomThroughput
Queue LengthApp metricsScaling

Setup

# Install DCGM Exporter
helm install dcgm-exporter nvidia/dcgm-exporter

# ServiceMonitor for Prometheus
# See resources/manifests.yaml

Managed Kubernetes

AWS EKS

  • Instance types:
    g5.2xlarge
    ,
    p4d.24xlarge
  • AMI:
    AL2_x86_64_GPU
  • GPU taints for isolation

Azure AKS

  • VM sizes:
    Standard_NC*
    ,
    Standard_ND*
  • A100 support via
    NC24ads_A100_v4

OCI OKE

  • Shapes:
    BM.GPU.A100-v2.8
    ,
    VM.GPU.A10
  • GPU node pools with taints

Terraform examples:

../terraform-iac/resources/modules.tf

Best Practices

Resource Management

  • Always set GPU limits = requests
  • Use node selectors for GPU types
  • Implement tolerations for GPU taints
  • PVC for model caching

High Availability

  • Multiple replicas across zones
  • Pod disruption budgets
  • Readiness/liveness probes

Cost Optimization

  • Spot instances for dev/test
  • Auto-scaling to zero when idle
  • Right-size GPU instances

Resources


Deploy AI workloads at scale with GPU-optimized Kubernetes.