Claude-skill-registry-data mlflow-evaluation

MLflow 3 GenAI evaluation for agent development. Use when (1) writing mlflow.genai.evaluate() code, (2) creating @scorer functions, (3) building evaluation datasets from traces, (4) using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), (5) analyzing traces for latency/errors/architecture, (6) optimizing agent context/prompts/token usage, (7) debugging evaluation failures. Covers the full eval workflow: trace analysis -> dataset building -> scorer creation -> evaluation execution.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry-data
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/mlflow-evaluation" ~/.claude/skills/majiayu000-claude-skill-registry-data-mlflow-evaluation && rm -rf "$T"
manifest: data/mlflow-evaluation/SKILL.md
source content

MLflow 3 GenAI Evaluation

Before Writing Any Code

  1. Read GOTCHAS.md - 15+ common mistakes that cause failures
  2. Read CRITICAL-interfaces.md - Exact API signatures and data schemas

End-to-End Workflows

Follow these workflows based on your goal. Each step indicates which reference files to read.

Workflow 1: First-Time Evaluation Setup

For users new to MLflow GenAI evaluation or setting up evaluation for a new agent.

StepActionReference Files
1Understand what to evaluate
user-journeys.md
(Journey 0: Strategy)
2Learn API patterns
GOTCHAS.md
+
CRITICAL-interfaces.md
3Build initial dataset
patterns-datasets.md
(Patterns 1-4)
4Choose/create scorers
patterns-scorers.md
+
CRITICAL-interfaces.md
(built-in list)
5Run evaluation
patterns-evaluation.md
(Patterns 1-3)

Workflow 2: Production Trace -> Evaluation Dataset

For building evaluation datasets from production traces.

StepActionReference Files
1Search and filter traces
patterns-trace-analysis.md
(MCP tools section)
2Analyze trace quality
patterns-trace-analysis.md
(Patterns 1-7)
3Tag traces for inclusion
patterns-datasets.md
(Patterns 16-17)
4Build dataset from traces
patterns-datasets.md
(Patterns 6-7)
5Add expectations/ground truth
patterns-datasets.md
(Pattern 2)

Workflow 3: Performance Optimization

For debugging slow or expensive agent execution.

StepActionReference Files
1Profile latency by span
patterns-trace-analysis.md
(Patterns 4-6)
2Analyze token usage
patterns-trace-analysis.md
(Pattern 9)
3Detect context issues
patterns-context-optimization.md
(Section 5)
4Apply optimizations
patterns-context-optimization.md
(Sections 1-4, 6)
5Re-evaluate to measure impact
patterns-evaluation.md
(Pattern 6-7)

Workflow 4: Regression Detection

For comparing agent versions and finding regressions.

StepActionReference Files
1Establish baseline
patterns-evaluation.md
(Pattern 4: named runs)
2Run current version
patterns-evaluation.md
(Pattern 1)
3Compare metrics
patterns-evaluation.md
(Patterns 6-7)
4Analyze failing traces
patterns-trace-analysis.md
(Pattern 7)
5Debug specific failures
patterns-trace-analysis.md
(Patterns 8-9)

Workflow 5: Custom Scorer Development

For creating project-specific evaluation metrics.

StepActionReference Files
1Understand scorer interface
CRITICAL-interfaces.md
(Scorer section)
2Choose scorer pattern
patterns-scorers.md
(Patterns 4-11)
3For multi-agent scorers
patterns-scorers.md
(Patterns 13-16)
4Test with evaluation
patterns-evaluation.md
(Pattern 1)

Reference Files Quick Lookup

ReferencePurposeWhen to Read
GOTCHAS.md
Common mistakesAlways read first before writing code
CRITICAL-interfaces.md
API signatures, schemasWhen writing any evaluation code
patterns-evaluation.md
Running evals, comparingWhen executing evaluations
patterns-scorers.md
Custom scorer creationWhen built-in scorers aren't enough
patterns-datasets.md
Dataset buildingWhen preparing evaluation data
patterns-trace-analysis.md
Trace debuggingWhen analyzing agent behavior
patterns-context-optimization.md
Token/latency fixesWhen agent is slow or expensive
user-journeys.md
High-level workflowsWhen starting a new evaluation project

Critical API Facts

  • Use:
    mlflow.genai.evaluate()
    (NOT
    mlflow.evaluate()
    )
  • Data format:
    {"inputs": {"query": "..."}}
    (nested structure required)
  • predict_fn: Receives
    **unpacked kwargs
    (not a dict)

See

GOTCHAS.md
for complete list.