Claude-skill-registry-data mlflow-evaluation
MLflow 3 GenAI evaluation for agent development. Use when (1) writing mlflow.genai.evaluate() code, (2) creating @scorer functions, (3) building evaluation datasets from traces, (4) using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), (5) analyzing traces for latency/errors/architecture, (6) optimizing agent context/prompts/token usage, (7) debugging evaluation failures. Covers the full eval workflow: trace analysis -> dataset building -> scorer creation -> evaluation execution.
git clone https://github.com/majiayu000/claude-skill-registry-data
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/mlflow-evaluation" ~/.claude/skills/majiayu000-claude-skill-registry-data-mlflow-evaluation && rm -rf "$T"
data/mlflow-evaluation/SKILL.mdMLflow 3 GenAI Evaluation
Before Writing Any Code
- Read GOTCHAS.md - 15+ common mistakes that cause failures
- Read CRITICAL-interfaces.md - Exact API signatures and data schemas
End-to-End Workflows
Follow these workflows based on your goal. Each step indicates which reference files to read.
Workflow 1: First-Time Evaluation Setup
For users new to MLflow GenAI evaluation or setting up evaluation for a new agent.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Understand what to evaluate | (Journey 0: Strategy) |
| 2 | Learn API patterns | + |
| 3 | Build initial dataset | (Patterns 1-4) |
| 4 | Choose/create scorers | + (built-in list) |
| 5 | Run evaluation | (Patterns 1-3) |
Workflow 2: Production Trace -> Evaluation Dataset
For building evaluation datasets from production traces.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Search and filter traces | (MCP tools section) |
| 2 | Analyze trace quality | (Patterns 1-7) |
| 3 | Tag traces for inclusion | (Patterns 16-17) |
| 4 | Build dataset from traces | (Patterns 6-7) |
| 5 | Add expectations/ground truth | (Pattern 2) |
Workflow 3: Performance Optimization
For debugging slow or expensive agent execution.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Profile latency by span | (Patterns 4-6) |
| 2 | Analyze token usage | (Pattern 9) |
| 3 | Detect context issues | (Section 5) |
| 4 | Apply optimizations | (Sections 1-4, 6) |
| 5 | Re-evaluate to measure impact | (Pattern 6-7) |
Workflow 4: Regression Detection
For comparing agent versions and finding regressions.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Establish baseline | (Pattern 4: named runs) |
| 2 | Run current version | (Pattern 1) |
| 3 | Compare metrics | (Patterns 6-7) |
| 4 | Analyze failing traces | (Pattern 7) |
| 5 | Debug specific failures | (Patterns 8-9) |
Workflow 5: Custom Scorer Development
For creating project-specific evaluation metrics.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Understand scorer interface | (Scorer section) |
| 2 | Choose scorer pattern | (Patterns 4-11) |
| 3 | For multi-agent scorers | (Patterns 13-16) |
| 4 | Test with evaluation | (Pattern 1) |
Reference Files Quick Lookup
| Reference | Purpose | When to Read |
|---|---|---|
| Common mistakes | Always read first before writing code |
| API signatures, schemas | When writing any evaluation code |
| Running evals, comparing | When executing evaluations |
| Custom scorer creation | When built-in scorers aren't enough |
| Dataset building | When preparing evaluation data |
| Trace debugging | When analyzing agent behavior |
| Token/latency fixes | When agent is slow or expensive |
| High-level workflows | When starting a new evaluation project |
Critical API Facts
- Use:
(NOTmlflow.genai.evaluate()
)mlflow.evaluate() - Data format:
(nested structure required){"inputs": {"query": "..."}} - predict_fn: Receives
(not a dict)**unpacked kwargs
See
GOTCHAS.md for complete list.