Dotfiles databricks-mlflow-evaluation
MLflow 3 GenAI agent evaluation. Use when writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign from domain expert feedback, or running optimize_prompts() with GEPA for automated prompt improvement.
git clone https://github.com/msbaek/dotfiles
T=$(mktemp -d) && git clone --depth=1 https://github.com/msbaek/dotfiles "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/databricks-mlflow-evaluation" ~/.claude/skills/msbaek-dotfiles-databricks-mlflow-evaluation && rm -rf "$T"
.claude/skills/databricks-mlflow-evaluation/SKILL.mdMLflow 3 GenAI Evaluation
Before Writing Any Code
- Read GOTCHAS.md - 15+ common mistakes that cause failures
- Read CRITICAL-interfaces.md - Exact API signatures and data schemas
End-to-End Workflows
Follow these workflows based on your goal. Each step indicates which reference files to read.
Workflow 1: First-Time Evaluation Setup
For users new to MLflow GenAI evaluation or setting up evaluation for a new agent.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Understand what to evaluate | (Journey 0: Strategy) |
| 2 | Learn API patterns | + |
| 3 | Build initial dataset | (Patterns 1-4) |
| 4 | Choose/create scorers | + (built-in list) |
| 5 | Run evaluation | (Patterns 1-3) |
Workflow 2: Production Trace -> Evaluation Dataset
For building evaluation datasets from production traces.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Search and filter traces | (MCP tools section) |
| 2 | Analyze trace quality | (Patterns 1-7) |
| 3 | Tag traces for inclusion | (Patterns 16-17) |
| 4 | Build dataset from traces | (Patterns 6-7) |
| 5 | Add expectations/ground truth | (Pattern 2) |
Workflow 3: Performance Optimization
For debugging slow or expensive agent execution.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Profile latency by span | (Patterns 4-6) |
| 2 | Analyze token usage | (Pattern 9) |
| 3 | Detect context issues | (Section 5) |
| 4 | Apply optimizations | (Sections 1-4, 6) |
| 5 | Re-evaluate to measure impact | (Pattern 6-7) |
Workflow 4: Regression Detection
For comparing agent versions and finding regressions.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Establish baseline | (Pattern 4: named runs) |
| 2 | Run current version | (Pattern 1) |
| 3 | Compare metrics | (Patterns 6-7) |
| 4 | Analyze failing traces | (Pattern 7) |
| 5 | Debug specific failures | (Patterns 8-9) |
Workflow 5: Custom Scorer Development
For creating project-specific evaluation metrics.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Understand scorer interface | (Scorer section) |
| 2 | Choose scorer pattern | (Patterns 4-11) |
| 3 | For multi-agent scorers | (Patterns 13-16) |
| 4 | Test with evaluation | (Pattern 1) |
Workflow 6: Unity Catalog Trace Ingestion & Production Monitoring
For storing traces in Unity Catalog, instrumenting applications, and enabling continuous production monitoring.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Link UC schema to experiment | (Patterns 1-2) |
| 2 | Set trace destination | (Patterns 3-4) |
| 3 | Instrument your application | (Patterns 5-8) |
| 4 | Configure trace sources (Apps/Serving/OTEL) | (Patterns 9-11) |
| 5 | Enable production monitoring | (Patterns 12-13) |
| 6 | Query and analyze UC traces | (Pattern 14) |
Workflow 7: Judge Alignment with MemAlign
For aligning an LLM judge to match domain expert preferences. A well-aligned judge improves every downstream use: evaluation accuracy, production monitoring signal, and prompt optimization quality. This workflow is valuable on its own, independent of prompt optimization.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Design base judge with (any feedback type) | (Pattern 1) |
| 2 | Run evaluate(), tag successful traces | (Pattern 2) |
| 3 | Build UC dataset + create SME labeling session | (Pattern 3) |
| 4 | Align judge with MemAlign after labeling completes | (Pattern 4) |
| 5 | Register aligned judge to experiment | (Pattern 5) |
| 6 | Re-evaluate with aligned judge (baseline) | (Pattern 6) |
Workflow 8: Automated Prompt Optimization with GEPA
For automatically improving a registered system prompt using
optimize_prompts(). Works with any scorer, but paired with an aligned judge (Workflow 7) gives the most domain-accurate signal. For the full end-to-end loop combining alignment and optimization, see user-journeys.md Journey 10.
| Step | Action | Reference Files |
|---|---|---|
| 1 | Build optimization dataset (inputs + expectations) | (Pattern 1) |
| 2 | Run optimize_prompts() with GEPA + scorer | (Pattern 2) |
| 3 | Register new version, promote conditionally | (Pattern 3) |
Reference Files Quick Lookup
| Reference | Purpose | When to Read |
|---|---|---|
| Common mistakes | Always read first before writing code |
| API signatures, schemas | When writing any evaluation code |
| Running evals, comparing | When executing evaluations |
| Custom scorer creation | When built-in scorers aren't enough |
| Dataset building | When preparing evaluation data |
| Trace debugging | When analyzing agent behavior |
| Token/latency fixes | When agent is slow or expensive |
| UC trace setup, monitoring | When setting up trace storage or production monitoring |
| MemAlign judge alignment, labeling sessions, SME feedback | When aligning judges to domain expert preferences |
| GEPA optimization: build dataset, optimize_prompts(), promote | When running automated prompt improvement |
| High-level workflows, full domain-expert optimization loop | When starting a new evaluation project or running the full align + optimize cycle |
Critical API Facts
- Use:
(NOTmlflow.genai.evaluate()
)mlflow.evaluate() - Data format:
(nested structure required){"inputs": {"query": "..."}} - predict_fn: Receives
(not a dict)**unpacked kwargs - MemAlign: Scorer-agnostic (works with any
-- float, bool, categorical); token-heavy on the embedding model so setfeedback_value_type
explicitlyembedding_model - Label schema name matching: The label schema
in the labeling session MUST match the judgename
used inname
forevaluate()
to pair scoresalign() - Aligned judge scores: May be lower than unaligned judge scores -- this is expected and means the judge is now more accurate, not that the agent regressed
- GEPA optimization dataset: Must have both
ANDinputs
per record (different from eval dataset)expectations - Episodic memory: Lazily loaded --
results won't show episodic memory on print until the judge is first usedget_scorer() - optimize_prompts: Requires MLflow >= 3.5.0
See
GOTCHAS.md for complete list.
Related Skills
- databricks-docs - General Databricks documentation reference
- databricks-model-serving - Deploying models and agents to serving endpoints
- databricks-agent-bricks - Building agents that can be evaluated with this skill
- databricks-python-sdk - SDK patterns used alongside MLflow APIs
- databricks-unity-catalog - Unity Catalog tables for managed evaluation datasets