Dotfiles databricks-mlflow-evaluation

MLflow 3 GenAI agent evaluation. Use when writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign from domain expert feedback, or running optimize_prompts() with GEPA for automated prompt improvement.

install

source · Clone the upstream repo

git clone https://github.com/msbaek/dotfiles

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/msbaek/dotfiles "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/databricks-mlflow-evaluation" ~/.claude/skills/msbaek-dotfiles-databricks-mlflow-evaluation && rm -rf "$T"

manifest: .claude/skills/databricks-mlflow-evaluation/SKILL.md

source content

MLflow 3 GenAI Evaluation

Before Writing Any Code

Read GOTCHAS.md - 15+ common mistakes that cause failures
Read CRITICAL-interfaces.md - Exact API signatures and data schemas

End-to-End Workflows

Follow these workflows based on your goal. Each step indicates which reference files to read.

Workflow 1: First-Time Evaluation Setup

For users new to MLflow GenAI evaluation or setting up evaluation for a new agent.


user-journeys.md
GOTCHAS.md
patterns-datasets.md
patterns-scorers.md
patterns-evaluation.md

Workflow 2: Production Trace -> Evaluation Dataset

For building evaluation datasets from production traces.


patterns-trace-analysis.md
patterns-trace-analysis.md
patterns-datasets.md
patterns-datasets.md
patterns-datasets.md

Workflow 3: Performance Optimization

For debugging slow or expensive agent execution.


patterns-trace-analysis.md
patterns-trace-analysis.md
patterns-context-optimization.md
patterns-context-optimization.md
patterns-evaluation.md

Workflow 4: Regression Detection

For comparing agent versions and finding regressions.


patterns-evaluation.md
patterns-evaluation.md
patterns-evaluation.md
patterns-trace-analysis.md
patterns-trace-analysis.md

Workflow 5: Custom Scorer Development

For creating project-specific evaluation metrics.

Step	Action	Reference Files
1	Understand scorer interface	`CRITICAL-interfaces.md` (Scorer section)
2	Choose scorer pattern	`patterns-scorers.md` (Patterns 4-11)
3	For multi-agent scorers	`patterns-scorers.md` (Patterns 13-16)
4	Test with evaluation	`patterns-evaluation.md` (Pattern 1)

Workflow 6: Unity Catalog Trace Ingestion & Production Monitoring

For storing traces in Unity Catalog, instrumenting applications, and enabling continuous production monitoring.


patterns-trace-ingestion.md
patterns-trace-ingestion.md
patterns-trace-ingestion.md
patterns-trace-ingestion.md
patterns-trace-ingestion.md
patterns-trace-ingestion.md

Workflow 7: Judge Alignment with MemAlign

For aligning an LLM judge to match domain expert preferences. A well-aligned judge improves every downstream use: evaluation accuracy, production monitoring signal, and prompt optimization quality. This workflow is valuable on its own, independent of prompt optimization.


patterns-judge-alignment.md
patterns-judge-alignment.md
patterns-judge-alignment.md
patterns-judge-alignment.md
patterns-judge-alignment.md
patterns-judge-alignment.md

Workflow 8: Automated Prompt Optimization with GEPA

For automatically improving a registered system prompt using

optimize_prompts()

. Works with any scorer, but paired with an aligned judge (Workflow 7) gives the most domain-accurate signal. For the full end-to-end loop combining alignment and optimization, see

user-journeys.md

Journey 10.

Step	Action	Reference Files
1	Build optimization dataset (inputs + expectations)	`patterns-prompt-optimization.md` (Pattern 1)
2	Run optimize_prompts() with GEPA + scorer	`patterns-prompt-optimization.md` (Pattern 2)
3	Register new version, promote conditionally	`patterns-prompt-optimization.md` (Pattern 3)

Reference Files Quick Lookup

Reference	Purpose	When to Read
`GOTCHAS.md`	Common mistakes	Always read first before writing code
`CRITICAL-interfaces.md`	API signatures, schemas	When writing any evaluation code
`patterns-evaluation.md`	Running evals, comparing	When executing evaluations
`patterns-scorers.md`	Custom scorer creation	When built-in scorers aren't enough
`patterns-datasets.md`	Dataset building	When preparing evaluation data
`patterns-trace-analysis.md`	Trace debugging	When analyzing agent behavior
`patterns-context-optimization.md`	Token/latency fixes	When agent is slow or expensive
`patterns-trace-ingestion.md`	UC trace setup, monitoring	When setting up trace storage or production monitoring
`patterns-judge-alignment.md`	MemAlign judge alignment, labeling sessions, SME feedback	When aligning judges to domain expert preferences
`patterns-prompt-optimization.md`	GEPA optimization: build dataset, optimize_prompts(), promote	When running automated prompt improvement
`user-journeys.md`	High-level workflows, full domain-expert optimization loop	When starting a new evaluation project or running the full align + optimize cycle

Critical API Facts

Use:

mlflow.genai.evaluate()

(NOT

mlflow.evaluate()

)

Data format:
```
{"inputs": {"query": "..."}}
```
(nested structure required)
predict_fn: Receives
```
**unpacked kwargs
```
(not a dict)
MemAlign: Scorer-agnostic (works with any
```
feedback_value_type
```
-- float, bool, categorical); token-heavy on the embedding model so set
```
embedding_model
```
explicitly
Label schema name matching: The label schema
```
name
```
in the labeling session MUST match the judge
```
name
```
used in
```
evaluate()
```
for
```
align()
```
to pair scores
Aligned judge scores: May be lower than unaligned judge scores -- this is expected and means the judge is now more accurate, not that the agent regressed
GEPA optimization dataset: Must have both
```
inputs
```
AND
```
expectations
```
per record (different from eval dataset)
Episodic memory: Lazily loaded --
```
get_scorer()
```
results won't show episodic memory on print until the judge is first used
optimize_prompts: Requires MLflow >= 3.5.0

See

GOTCHAS.md

for complete list.

Related Skills

databricks-docs - General Databricks documentation reference
databricks-model-serving - Deploying models and agents to serving endpoints
databricks-agent-bricks - Building agents that can be evaluated with this skill
databricks-python-sdk - SDK patterns used alongside MLflow APIs
databricks-unity-catalog - Unity Catalog tables for managed evaluation datasets

Step	Action	Reference Files
1	Understand what to evaluate	`user-journeys.md` (Journey 0: Strategy)
2	Learn API patterns	`GOTCHAS.md` + `CRITICAL-interfaces.md`
3	Build initial dataset	`patterns-datasets.md` (Patterns 1-4)
4	Choose/create scorers	`patterns-scorers.md` + `CRITICAL-interfaces.md` (built-in list)
5	Run evaluation	`patterns-evaluation.md` (Patterns 1-3)

Step	Action	Reference Files
1	Search and filter traces	`patterns-trace-analysis.md` (MCP tools section)
2	Analyze trace quality	`patterns-trace-analysis.md` (Patterns 1-7)
3	Tag traces for inclusion	`patterns-datasets.md` (Patterns 16-17)
4	Build dataset from traces	`patterns-datasets.md` (Patterns 6-7)
5	Add expectations/ground truth	`patterns-datasets.md` (Pattern 2)

Step	Action	Reference Files
1	Profile latency by span	`patterns-trace-analysis.md` (Patterns 4-6)
2	Analyze token usage	`patterns-trace-analysis.md` (Pattern 9)
3	Detect context issues	`patterns-context-optimization.md` (Section 5)
4	Apply optimizations	`patterns-context-optimization.md` (Sections 1-4, 6)
5	Re-evaluate to measure impact	`patterns-evaluation.md` (Pattern 6-7)

Step	Action	Reference Files
1	Establish baseline	`patterns-evaluation.md` (Pattern 4: named runs)
2	Run current version	`patterns-evaluation.md` (Pattern 1)
3	Compare metrics	`patterns-evaluation.md` (Patterns 6-7)
4	Analyze failing traces	`patterns-trace-analysis.md` (Pattern 7)
5	Debug specific failures	`patterns-trace-analysis.md` (Patterns 8-9)

Step	Action	Reference Files
1	Link UC schema to experiment	`patterns-trace-ingestion.md` (Patterns 1-2)
2	Set trace destination	`patterns-trace-ingestion.md` (Patterns 3-4)
3	Instrument your application	`patterns-trace-ingestion.md` (Patterns 5-8)
4	Configure trace sources (Apps/Serving/OTEL)	`patterns-trace-ingestion.md` (Patterns 9-11)
5	Enable production monitoring	`patterns-trace-ingestion.md` (Patterns 12-13)
6	Query and analyze UC traces	`patterns-trace-ingestion.md` (Pattern 14)

Step	Action	Reference Files
1	Design base judge with `make_judge` (any feedback type)	`patterns-judge-alignment.md` (Pattern 1)
2	Run evaluate(), tag successful traces	`patterns-judge-alignment.md` (Pattern 2)
3	Build UC dataset + create SME labeling session	`patterns-judge-alignment.md` (Pattern 3)
4	Align judge with MemAlign after labeling completes	`patterns-judge-alignment.md` (Pattern 4)
5	Register aligned judge to experiment	`patterns-judge-alignment.md` (Pattern 5)
6	Re-evaluate with aligned judge (baseline)	`patterns-judge-alignment.md` (Pattern 6)