Agent-almanac deploy-ml-model-serving
git clone https://github.com/pjt222/agent-almanac
T=$(mktemp -d) && git clone --depth=1 https://github.com/pjt222/agent-almanac "$T" && mkdir -p ~/.claude/skills && cp -r "$T/i18n/zh-CN/skills/deploy-ml-model-serving" ~/.claude/skills/pjt222-agent-almanac-deploy-ml-model-serving-1007fc && rm -rf "$T"
i18n/zh-CN/skills/deploy-ml-model-serving/SKILL.mdDeploy ML Model Serving
See Extended Examples for complete configuration files and templates.
Deploy machine learning models to production with scalable serving infrastructure, monitoring, and A/B testing.
适用场景
- Deploying trained models to production for real-time inference
- Setting up REST or gRPC APIs for model predictions
- Implementing autoscaling for variable load patterns
- Running A/B tests between model versions
- Migrating from batch to real-time inference
- Building low-latency prediction services
- Managing multiple model versions in production
输入
- 必需: Registered model in MLflow Model Registry or trained model artifact
- 必需: Kubernetes cluster or container orchestration platform
- 必需: Serving framework choice (MLflow, BentoML, Seldon Core, TorchServe)
- 可选: GPU resources for deep learning models
- 可选: Monitoring infrastructure (Prometheus, Grafana)
- 可选: Load balancer and ingress controller
步骤
第 1 步:Deploy with MLflow Models Serving
Use MLflow's built-in serving for quick deployment of scikit-learn, PyTorch, and TensorFlow models.
# Serve model locally for testing mlflow models serve \ --model-uri models:/customer-churn-classifier/Production \ --port 5001 \ --host 0.0.0.0 # Test endpoint curl -X POST http://localhost:5001/invocations \ -H 'Content-Type: application/json' \ -d '{ "dataframe_records": [ {"feature1": 1.0, "feature2": 2.0, "feature3": 3.0} ] }'
Docker deployment:
# Dockerfile.mlflow-serving FROM python:3.9-slim # Install MLflow and dependencies RUN pip install mlflow boto3 scikit-learn # Set environment variables ENV MLFLOW_TRACKING_URI=http://mlflow-server:5000 # ... (see EXAMPLES.md for complete implementation)
Docker Compose for local testing:
# docker-compose.mlflow-serving.yml version: '3.8' services: model-server: build: context: . dockerfile: Dockerfile.mlflow-serving # ... (see EXAMPLES.md for complete implementation)
Test the deployment:
# test_mlflow_serving.py import requests import json def test_prediction(): url = "http://localhost:8080/invocations" # Prepare input data # ... (see EXAMPLES.md for complete implementation)
预期结果: Model server starts successfully, responds to HTTP POST requests, returns predictions in JSON format, Docker container runs without errors.
失败处理: Check model URI is valid (
mlflow models list), verify MLflow tracking server accessibility, ensure all model dependencies installed in container, check port availability (netstat -tulpn | grep 8080), verify model flavor compatibility, inspect container logs (docker logs <container-id>).
第 2 步:Deploy with BentoML for Production Scale
Use BentoML for advanced serving with better performance and features.
# bentoml_service.py import bentoml from bentoml.io import JSON, NumpyNdarray import numpy as np import pandas as pd # Load model from MLflow import mlflow # ... (see EXAMPLES.md for complete implementation)
Build and containerize:
# Build Bento bentoml build # Containerize bentoml containerize customer_churn_classifier:latest \ --image-tag customer-churn:v1.0 # Run container docker run -p 3000:3000 customer-churn:v1.0
BentoML configuration:
# bentofile.yaml service: "bentoml_service:ChurnPredictionService" include: - "bentoml_service.py" - "preprocessing.py" python: packages: - scikit-learn==1.0.2 - pandas==1.4.0 - numpy==1.22.0 - mlflow==2.0.1 docker: distro: debian python_version: "3.9" cuda_version: null # Set to "11.6" for GPU support
Kubernetes deployment:
# k8s/deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: churn-prediction labels: app: churn-prediction spec: # ... (see EXAMPLES.md for complete implementation)
Deploy to Kubernetes:
# Apply Kubernetes manifests kubectl apply -f k8s/deployment.yaml # Check deployment status kubectl get deployments kubectl get pods kubectl get services # Test endpoint EXTERNAL_IP=$(kubectl get svc churn-prediction-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}') curl -X POST http://$EXTERNAL_IP/predict \ -H 'Content-Type: application/json' \ -d '{"instances": [{"tenure": 12, "monthly_charges": 70.35}]}'
预期结果: BentoML service builds successfully, container runs and serves predictions, Kubernetes deployment creates 3 replicas, load balancer exposes external endpoint, health checks pass.
失败处理: Verify BentoML installation (
bentoml --version), check model exists in BentoML store (bentoml models list), ensure Docker daemon running, verify Kubernetes cluster access (kubectl cluster-info), check resource limits not exceeded, inspect pod logs (kubectl logs <pod-name>), verify service selector matches pod labels.
第 3 步:Implement Seldon Core for Advanced Features
Use Seldon Core for multi-model serving, A/B testing, and explainability.
# seldon_wrapper.py import logging from typing import Dict, List, Union import numpy as np import mlflow logger = logging.getLogger(__name__) # ... (see EXAMPLES.md for complete implementation)
Seldon deployment configuration:
# seldon-deployment.yaml apiVersion: machinelearning.seldon.io/v1 kind: SeldonDeployment metadata: name: churn-classifier namespace: seldon spec: name: churn-classifier # ... (see EXAMPLES.md for complete implementation)
A/B testing configuration:
# seldon-ab-test.yaml apiVersion: machinelearning.seldon.io/v1 kind: SeldonDeployment metadata: name: churn-classifier-ab spec: name: churn-classifier-ab predictors: # ... (see EXAMPLES.md for complete implementation)
Deploy to Kubernetes:
# Install Seldon Core operator kubectl create namespace seldon-system helm install seldon-core seldon-core-operator \ --repo https://storage.googleapis.com/seldon-charts \ --namespace seldon-system \ --set usageMetrics.enabled=true # Create namespace for models # ... (see EXAMPLES.md for complete implementation)
预期结果: Seldon Core operator installed successfully, model deployment creates pods, REST endpoint responds to predictions, A/B test splits traffic correctly, Seldon Analytics records metrics.
失败处理: Verify Seldon Core operator running (
kubectl get pods -n seldon-system), check SeldonDeployment status (kubectl describe seldondeployment), ensure image registry accessible from cluster, verify model URI resolution, check RBAC permissions for Seldon operator, inspect model container logs.
第 4 步:Implement Monitoring and Observability
Add comprehensive monitoring for model serving infrastructure.
# monitoring.py from prometheus_client import Counter, Histogram, Gauge, start_http_server import time import logging logger = logging.getLogger(__name__) # Prometheus metrics # ... (see EXAMPLES.md for complete implementation)
Prometheus configuration:
# prometheus-config.yaml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'model-serving' kubernetes_sd_configs: # ... (see EXAMPLES.md for complete implementation)
Grafana dashboard JSON:
{ "dashboard": { "title": "ML Model Serving Metrics", "panels": [ { "title": "Predictions Per Second", "targets": [ { # ... (see EXAMPLES.md for complete implementation)
预期结果: Prometheus scrapes metrics successfully, Grafana dashboards display prediction throughput, latency percentiles, error rates, and active requests in real-time.
失败处理: Verify Prometheus scrape targets are UP (
http://prometheus:9090/targets), check metrics endpoint accessibility (curl http://model-pod:8000/metrics), ensure Kubernetes service discovery configured, verify Grafana data source connection, check firewall rules for metrics port.
第 5 步:Implement Autoscaling
Configure horizontal pod autoscaling based on request load.
# hpa.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: churn-prediction-hpa namespace: seldon spec: scaleTargetRef: # ... (see EXAMPLES.md for complete implementation)
Apply autoscaling:
# Enable metrics server (if not already installed) kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml # Apply HPA kubectl apply -f hpa.yaml # Check HPA status kubectl get hpa -n seldon kubectl describe hpa churn-prediction-hpa -n seldon # Load test to trigger scaling kubectl run -it --rm load-generator --image=busybox --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://churn-prediction-service/predict; done" # Watch scaling kubectl get hpa -n seldon --watch
预期结果: HPA monitors CPU/memory/custom metrics, scales replicas up under load, scales down after stabilization period, min/max replica limits respected.
失败处理: Verify metrics-server running (
kubectl get deployment metrics-server -n kube-system), check pod resource requests defined (HPA requires requests), ensure custom metrics available if used, verify RBAC permissions for HPA controller, check stabilization windows not too restrictive.
第 6 步:Implement Canary Deployment Strategy
Gradually roll out new model versions with traffic shifting.
# canary-deployment.yaml apiVersion: machinelearning.seldon.io/v1 kind: SeldonDeployment metadata: name: churn-classifier-canary spec: name: churn-classifier-canary predictors: # ... (see EXAMPLES.md for complete implementation)
Gradual rollout script:
# canary_rollout.py import time import subprocess import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) # ... (see EXAMPLES.md for complete implementation)
预期结果: Canary deployment starts with 0% traffic, gradual traffic shift occurs automatically, health checks pass at each stage, rollback triggered if metrics degrade, complete rollout after all stages pass.
失败处理: Verify Seldon deployment has multiple predictors, check traffic percentages sum to 100, ensure canary image exists and is pullable, verify Prometheus metrics available for health checks, check rollback logic executes correctly, inspect pod logs for both versions.
验证清单
- Model server responds to prediction requests
- REST/gRPC endpoints functional and documented
- Docker containers build and run successfully
- Kubernetes deployment creates expected replicas
- Load balancer exposes external endpoint
- Health checks (liveness/readiness) pass
- Prometheus metrics exported and scraped
- Grafana dashboards display real-time metrics
- Autoscaling triggers under load
- A/B test splits traffic correctly
- Canary deployment rolls out gradually
- Rollback works when canary fails
常见问题
- Cold start latency: First request slow due to model loading - use readiness probes with adequate delay, implement model caching
- Memory leaks: Long-running servers accumulate memory - monitor memory usage, implement periodic restarts, profile code
- Dependency conflicts: Model dependencies incompatible with serving framework - use exact pinned versions, test in Docker before deployment
- Resource limits too low: Pods OOMKilled or CPU throttled - profile resource usage, set appropriate limits based on load testing
- Missing health checks: Kubernetes routes traffic to unhealthy pods - implement proper liveness/readiness probes
- No rollback strategy: Bad deployment without easy rollback - use canary deployments, keep previous version available
- Ignoring latency: Focusing only on accuracy, not inference speed - benchmark latency, optimize model/code, use batching
- Single replica: No high availability, downtime during deployments - use min 2 replicas, configure anti-affinity
- No monitoring: Issues not detected until customers complain - implement comprehensive metrics from day one
- GPU not utilized: GPU available but not used - set CUDA visible devices, verify GPU allocation in Kubernetes
相关技能
- Register models before deploying themregister-ml-model
- Implement A/B testing between model versionsrun-ab-test-models
- General Kubernetes deployment patternsdeploy-to-kubernetes
- Monitor model drift and degradationmonitor-ml-model-performance
- Automate model retraining and deploymentorchestrate-ml-pipeline