Claude-Skills senior-ml-engineer

install
source · Clone the upstream repo
git clone https://github.com/borghei/Claude-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/borghei/Claude-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/engineering/senior-ml-engineer" ~/.claude/skills/borghei-claude-skills-senior-ml-engineer && rm -rf "$T"
manifest: engineering/senior-ml-engineer/SKILL.md
source content

Senior ML Engineer

Production ML engineering patterns for model deployment, MLOps infrastructure, and LLM integration.


Table of Contents


Model Deployment Workflow

Deploy a trained model to production with monitoring:

  1. Export model to standardized format (ONNX, TorchScript, SavedModel)
  2. Package model with dependencies in Docker container
  3. Deploy to staging environment
  4. Run integration tests against staging
  5. Deploy canary (5% traffic) to production
  6. Monitor latency and error rates for 1 hour
  7. Promote to full production if metrics pass
  8. Validation: p95 latency < 100ms, error rate < 0.1%

Container Template

FROM python:3.11-slim

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY model/ /app/model/
COPY src/ /app/src/

HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1

EXPOSE 8080
CMD ["uvicorn", "src.server:app", "--host", "0.0.0.0", "--port", "8080"]

Serving Options

OptionLatencyThroughputUse Case
FastAPI + UvicornLowMediumREST APIs, small models
Triton Inference ServerVery LowVery HighGPU inference, batching
TensorFlow ServingLowHighTensorFlow models
TorchServeLowHighPyTorch models
Ray ServeMediumHighComplex pipelines, multi-model

MLOps Pipeline Setup

Establish automated training and deployment:

  1. Configure feature store (Feast, Tecton) for training data
  2. Set up experiment tracking (MLflow, Weights & Biases)
  3. Create training pipeline with hyperparameter logging
  4. Register model in model registry with version metadata
  5. Configure staging deployment triggered by registry events
  6. Set up A/B testing infrastructure for model comparison
  7. Enable drift monitoring with alerting
  8. Validation: New models automatically evaluated against baseline

Feature Store Pattern

from feast import Entity, Feature, FeatureView, FileSource

user = Entity(name="user_id", value_type=ValueType.INT64)

user_features = FeatureView(
    name="user_features",
    entities=["user_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="purchase_count_30d", dtype=ValueType.INT64),
        Feature(name="avg_order_value", dtype=ValueType.FLOAT),
    ],
    online=True,
    source=FileSource(path="data/user_features.parquet"),
)

Retraining Triggers

TriggerDetectionAction
ScheduledCron (weekly/monthly)Full retrain
Performance dropAccuracy < thresholdImmediate retrain
Data driftPSI > 0.2Evaluate, then retrain
New data volumeX new samplesIncremental update

LLM Integration Workflow

Integrate LLM APIs into production applications:

  1. Create provider abstraction layer for vendor flexibility
  2. Implement retry logic with exponential backoff
  3. Configure fallback to secondary provider
  4. Set up token counting and context truncation
  5. Add response caching for repeated queries
  6. Implement cost tracking per request
  7. Add structured output validation with Pydantic
  8. Validation: Response parses correctly, cost within budget

Provider Abstraction

from abc import ABC, abstractmethod
from tenacity import retry, stop_after_attempt, wait_exponential

class LLMProvider(ABC):
    @abstractmethod
    def complete(self, prompt: str, **kwargs) -> str:
        pass

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(provider: LLMProvider, prompt: str) -> str:
    return provider.complete(prompt)

Cost Management

ProviderInput CostOutput Cost
GPT-4$0.03/1K$0.06/1K
GPT-3.5$0.0005/1K$0.0015/1K
Claude 3 Opus$0.015/1K$0.075/1K
Claude 3 Haiku$0.00025/1K$0.00125/1K

RAG System Implementation

Build retrieval-augmented generation pipeline:

  1. Choose vector database (Pinecone, Qdrant, Weaviate)
  2. Select embedding model based on quality/cost tradeoff
  3. Implement document chunking strategy
  4. Create ingestion pipeline with metadata extraction
  5. Build retrieval with query embedding
  6. Add reranking for relevance improvement
  7. Format context and send to LLM
  8. Validation: Response references retrieved context, no hallucinations

Vector Database Selection

DatabaseHostingScaleLatencyBest For
PineconeManagedHighLowProduction, managed
QdrantBothHighVery LowPerformance-critical
WeaviateBothHighLowHybrid search
ChromaSelf-hostedMediumLowPrototyping
pgvectorSelf-hostedMediumMediumExisting Postgres

Chunking Strategies

StrategyChunk SizeOverlapBest For
Fixed500-1000 tokens50-100General text
Sentence3-5 sentences1 sentenceStructured text
SemanticVariableBased on meaningResearch papers
RecursiveHierarchicalParent-childLong documents

Model Monitoring

Monitor production models for drift and degradation:

  1. Set up latency tracking (p50, p95, p99)
  2. Configure error rate alerting
  3. Implement input data drift detection
  4. Track prediction distribution shifts
  5. Log ground truth when available
  6. Compare model versions with A/B metrics
  7. Set up automated retraining triggers
  8. Validation: Alerts fire before user-visible degradation

Drift Detection

from scipy.stats import ks_2samp

def detect_drift(reference, current, threshold=0.05):
    statistic, p_value = ks_2samp(reference, current)
    return {
        "drift_detected": p_value < threshold,
        "ks_statistic": statistic,
        "p_value": p_value
    }

Alert Thresholds

MetricWarningCritical
p95 latency> 100ms> 200ms
Error rate> 0.1%> 1%
PSI (drift)> 0.1> 0.2
Accuracy drop> 2%> 5%

Reference Documentation

MLOps Production Patterns

references/mlops_production_patterns.md
contains:

  • Model deployment pipeline with Kubernetes manifests
  • Feature store architecture with Feast examples
  • Model monitoring with drift detection code
  • A/B testing infrastructure with traffic splitting
  • Automated retraining pipeline with MLflow

LLM Integration Guide

references/llm_integration_guide.md
contains:

  • Provider abstraction layer pattern
  • Retry and fallback strategies with tenacity
  • Prompt engineering templates (few-shot, CoT)
  • Token optimization with tiktoken
  • Cost calculation and tracking

RAG System Architecture

references/rag_system_architecture.md
contains:

  • RAG pipeline implementation with code
  • Vector database comparison and integration
  • Chunking strategies (fixed, semantic, recursive)
  • Embedding model selection guide
  • Hybrid search and reranking patterns

Tools

Model Deployment Pipeline

python scripts/model_deployment_pipeline.py --model model.pkl --target staging

Generates deployment artifacts: Dockerfile, Kubernetes manifests, health checks.

RAG System Builder

python scripts/rag_system_builder.py --config rag_config.yaml --analyze

Scaffolds RAG pipeline with vector store integration and retrieval logic.

ML Monitoring Suite

python scripts/ml_monitoring_suite.py --config monitoring.yaml --deploy

Sets up drift detection, alerting, and performance dashboards.


Tech Stack

CategoryTools
ML FrameworksPyTorch, TensorFlow, Scikit-learn, XGBoost
LLM FrameworksLangChain, LlamaIndex, DSPy
MLOpsMLflow, Weights & Biases, Kubeflow
DataSpark, Airflow, dbt, Kafka
DeploymentDocker, Kubernetes, Triton
DatabasesPostgreSQL, BigQuery, Pinecone, Redis

Troubleshooting

ProblemCauseSolution
Model latency spikes after deploymentContainer resource limits too low or cold starts on serverlessPre-warm instances, increase CPU/memory limits, enable GPU request batching
Data drift alerts firing constantlyReference distribution outdated or threshold too sensitiveRecalibrate reference window to recent 30 days, raise PSI warning threshold to 0.15
Feature store serving stale featuresTTL misconfigured or materialization job failing silentlyVerify TTL matches data freshness SLA, add alerting on materialization job status
RAG retrieval returns irrelevant chunksChunk size too large or embedding model mismatchReduce chunk size to 300-500 tokens, switch to domain-tuned embedding model, add reranker
LLM provider rate limits hit in productionNo request queuing or burst traffic exceeds quotaImplement token bucket rate limiter, add request queue with backpressure, configure fallback provider
Model accuracy degrades graduallyConcept drift in underlying data distributionEnable automated retraining triggers on accuracy drop > 2%, schedule weekly evaluation jobs
A/B test results inconclusive after weeksInsufficient traffic split or high-variance metric chosenIncrease treatment allocation to 10-20%, switch to lower-variance proxy metric, extend test duration

Success Criteria

  • Model serving latency p99 under 100ms for real-time inference endpoints
  • Zero data drift alerts unresolved for more than 48 hours
  • Automated retraining pipeline triggers within 1 hour of performance threshold breach
  • RAG system retrieval accuracy (hit rate at k=5) above 90% on evaluation set
  • LLM integration uptime at 99.9% with provider fallback activating in under 2 seconds
  • Feature store materialization freshness within defined TTL for all online features
  • Model deployment rollback completes in under 5 minutes with zero dropped requests

Scope & Limitations

This skill covers:

  • End-to-end model deployment pipelines (packaging, containerization, serving, canary rollout)
  • MLOps infrastructure setup (feature stores, experiment tracking, model registries, retraining)
  • LLM integration patterns (provider abstraction, retries, caching, cost tracking)
  • RAG system architecture (vector databases, chunking, retrieval, reranking)

This skill does NOT cover:

  • Model training algorithms or hyperparameter tuning (see
    senior-data-scientist
    )
  • Raw data pipeline construction and ETL orchestration (see
    senior-data-engineer
    )
  • Prompt engineering techniques, few-shot design, or prompt optimization (see
    senior-prompt-engineer
    )
  • Image/video model architectures or computer vision inference optimization (see
    senior-computer-vision
    )

Integration Points

SkillIntegrationData Flow
senior-data-scientist
Receives trained models and evaluation metrics for deploymentData Scientist exports model artifacts and baseline metrics; ML Engineer packages and deploys
senior-data-engineer
Consumes feature pipelines and data quality outputsData Engineer builds ETL and feature pipelines; ML Engineer reads from feature store for serving
senior-prompt-engineer
Provides LLM serving infrastructure for prompt workflowsPrompt Engineer designs prompts; ML Engineer deploys provider abstraction and manages cost/latency
senior-devops
Leverages CI/CD and Kubernetes infrastructure for model servingDevOps manages cluster and pipelines; ML Engineer defines deployment manifests and health checks
senior-computer-vision
Deploys vision models through shared serving infrastructureCV Engineer trains and exports models; ML Engineer handles Triton/TorchServe deployment and monitoring
senior-security
Applies security scanning to model containers and API endpointsSecurity reviews container images and endpoint auth; ML Engineer remediates findings before promotion

Tool Reference

model_deployment_pipeline.py

Purpose: Generates deployment artifacts for productionizing ML models, including Dockerfiles, Kubernetes manifests, and health check configurations.

Usage:

python scripts/model_deployment_pipeline.py --input <path> --output <path> [--config <file>] [--verbose]

Flags/Parameters:

FlagShortRequiredDescription
--input
-i
YesInput path (model artifact or directory)
--output
-o
YesOutput path for generated deployment artifacts
--config
-c
NoConfiguration file for deployment settings
--verbose
-v
NoEnable debug-level logging output

Example:

python scripts/model_deployment_pipeline.py -i ./models/classifier.pkl -o ./deploy/

Output Formats: JSON to stdout containing

status
,
start_time
,
end_time
, and
processed_items
. Logs progress to stderr.


rag_system_builder.py

Purpose: Scaffolds a RAG pipeline with vector store integration, retrieval logic, and ingestion configuration.

Usage:

python scripts/rag_system_builder.py --input <path> --output <path> [--config <file>] [--verbose]

Flags/Parameters:

FlagShortRequiredDescription
--input
-i
YesInput path (document corpus or configuration directory)
--output
-o
YesOutput path for generated RAG pipeline artifacts
--config
-c
NoConfiguration file for RAG settings (vector DB, chunking, embedding)
--verbose
-v
NoEnable debug-level logging output

Example:

python scripts/rag_system_builder.py -i ./documents/ -o ./rag-pipeline/ -c rag_config.yaml

Output Formats: JSON to stdout containing

status
,
start_time
,
end_time
, and
processed_items
. Logs progress to stderr.


ml_monitoring_suite.py

Purpose: Sets up drift detection, performance alerting, and monitoring dashboards for production ML models.

Usage:

python scripts/ml_monitoring_suite.py --input <path> --output <path> [--config <file>] [--verbose]

Flags/Parameters:

FlagShortRequiredDescription
--input
-i
YesInput path (model metrics, reference data, or monitoring config)
--output
-o
YesOutput path for generated monitoring configuration and dashboards
--config
-c
NoConfiguration file for monitoring thresholds and alert rules
--verbose
-v
NoEnable debug-level logging output

Example:

python scripts/ml_monitoring_suite.py -i ./model-metrics/ -o ./monitoring/ -c monitoring.yaml -v

Output Formats: JSON to stdout containing

status
,
start_time
,
end_time
, and
processed_items
. Logs progress to stderr.