Phoenix phoenix-evals-new-metric
git clone https://github.com/Arize-ai/phoenix
T=$(mktemp -d) && git clone --depth=1 https://github.com/Arize-ai/phoenix "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/phoenix-evals-new-metric" ~/.claude/skills/arize-ai-phoenix-phoenix-evals-new-metric && rm -rf "$T"
.agents/skills/phoenix-evals-new-metric/SKILL.mdCreating a New Built-in Classification Evaluator
A built-in evaluator is a YAML config (source of truth) that gets compiled into Python and TypeScript code, wrapped in evaluator classes, benchmarked, and documented. The whole pipeline is linear — follow these steps in order.
Step 0: Gather Requirements
Before writing anything, clarify with the user:
- What does this evaluator measure? Get a one-sentence description of the quality dimension.
- What input data is available? This determines the template placeholders (e.g.,
,{{input}}
,{{output}}
,{{reference}}
). If the user is vague, ask follow-up questions — the placeholders are the contract between the evaluator and the caller.{{tool_definitions}} - What labels make sense? Binary is most common (e.g., correct/incorrect, faithful/unfaithful), but some metrics use more. Labels map to scores.
- Should this appear in the dataset experiments UI? If yes, it needs the
label. Currently only correctness, tool_selection, and tool_invocation have this — some may new evaluators don't need it.promoted_dataset_evaluator
Step 1: Create the YAML Config
Create
prompts/classification_evaluator_configs/{NAME}_CLASSIFICATION_EVALUATOR_CONFIG.yaml.
Read an existing config to match the current schema. Start with
CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.yaml for a simple example, or TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.yaml if your evaluator needs structured span data.
Key Decision Points
— Maps label strings to numeric scores. For binary evaluators, use positive/negative labels (e.g., choices
correct: 1.0 / incorrect: 0.0). The labels you pick here flow through to the Python class, TS factory, and benchmarks.
— Use optimization_direction
maximize when the positive label is the desired outcome (most evaluators). Use minimize only if the metric measures something undesirable (e.g., hallucination). This affects how Phoenix displays the metric in the UI.
— Optional list. Add labels
promoted_dataset_evaluator only if this evaluator should appear in the dataset experiments UI sidebar.
— Only needed if the evaluator is a substitutions
promoted_dataset_evaluator and works with structured span data (tool definitions, tool calls, message arrays). These reference formatter snippets defined in prompts/formatters/server.yaml. Read that file if you need substitutions — it defines what structured data formats are available. Most evaluators that only use simple text fields (input, output, reference) don't need substitutions.
Prompt Writing Tips
- Be explicit about what makes each label correct — the LLM judge needs a clear rubric
- Separate concerns: if evaluating X, explicitly state you're NOT evaluating Y
- Wrap inputs in XML-style tags (e.g.,
,<context>
) for clear data formatting<output> - Tell the judge to reason before deciding — this improves accuracy
- Use
(Mustache syntax) for template variables{{placeholder}}
Step 2: Compile Prompts
make codegen-prompts
This generates code in three places:
(Python)packages/phoenix-evals/src/phoenix/evals/__generated__/classification_evaluator_configs/
(Python, server copy)src/phoenix/__generated__/classification_evaluator_configs/
(TypeScript)js/packages/phoenix-evals/src/__generated__/default_templates/
Verify the generated files look correct before moving on.
Step 3: Create the Python Evaluator
Create
packages/phoenix-evals/src/phoenix/evals/metrics/{name}.py.
Read
in that directory — it's the canonical example. Your evaluator follows the same pattern: subclass correctness.py
ClassificationEvaluator, pull constants from the generated config, define a Pydantic input schema with fields matching your template placeholders.
After creating the file, add it to the exports in
metrics/__init__.py — both the import and the __all__ list. Read the current __init__.py to see the existing pattern.
Step 4: Create the TypeScript Evaluator
Create
js/packages/phoenix-evals/src/llm/create{Name}Evaluator.ts.
Read
— it's the canonical example. The pattern is a factory function that wraps createCorrectnessEvaluator.ts
createClassificationEvaluator with defaults from the generated config.
Then:
- Add the export to
js/packages/phoenix-evals/src/llm/index.ts - Add a vitest test — read
for the test patterncreateFaithfulnessEvaluator.test.ts
Step 5: Build JS
cd js && pnpm build
Fix any TypeScript errors before proceeding.
Step 6: Write the Benchmark
Create
js/benchmarks/evals-benchmarks/src/{name}_benchmark.ts.
Read existing benchmarks in that directory to match the current patterns:
— confusion matrix printing, multi-category analysistool_invocation_benchmark.ts
Benchmark Requirements
- 30-50 synthetic examples organized by category
- 2-4 examples per category covering: success cases, failure modes, and edge cases
- Accuracy evaluator that compares predicted vs expected labels
- Failed examples printer — this is critical for debugging. For each misclassified example, print: category, input, output (truncated), expected vs actual label, and the LLM judge's explanation
- Per-category accuracy breakdown in the output
- For binary evaluators, a confusion matrix is helpful
The task function must return
input and output text in its result so the failed examples printer has access to them.
Consider using a separate agent session for synthetic dataset generation if the examples need realistic domain-specific content — this keeps the dataset creation focused and avoids context-switching.
Step 7: Run the Benchmark
# Terminal 1: Start Phoenix PHOENIX_WORKING_DIR=/tmp/phoenix-test phoenix serve # Terminal 2: Run the benchmark cd js/benchmarks/evals-benchmarks pnpm tsx src/{name}_benchmark.ts
Target >80% accuracy. If accuracy is low, look at the failed examples output to decide whether to adjust the prompt (Step 1) or the benchmark examples (Step 6). Iterate until accuracy is acceptable.
Step 8: Create Documentation
Create
docs/phoenix/evaluation/pre-built-metrics/{name}.mdx.
Read
in that directory — it's the template. Follow the same section structure:faithfulness.mdx
- Overview — when to use, what it measures
- Supported Levels — span/trace/session, relevant span kinds
- Input Requirements — required fields table
- Output Interpretation — labels, scores, direction
- Usage Examples — Python and TypeScript in tabs
- Using Input Mapping — lambda example if applicable
- Viewing/Modifying the Prompt — link to GitHub config, custom prompt usage
- Configuration — link to LLM config docs
- Using with Phoenix — links to traces and experiments docs
- Benchmarks — "Coming soon" placeholder (until benchmark results are published)
- API Reference — links to Python and TypeScript API docs
- Related — links to related evaluators
Navigation Updates
After creating the docs page, update these three files:
— add the page to the Evaluation > Pre-built Metrics nav groupdocs.json
— add a card to the landing page griddocs/phoenix/evaluation/pre-built-metrics.mdx
— add the new URLdocs/phoenix/sitemap.xml
Read each file to see the existing pattern before editing.
Checklist
Before calling it done, verify:
- YAML config created with clear rubric and appropriate labels/choices
-
ran successfullymake codegen-prompts - Python evaluator class with input schema matching template placeholders
- Python exports updated in
metrics/__init__.py - TypeScript evaluator factory with types
- TypeScript export added to
llm/index.ts - Vitest test for TypeScript evaluator
- JS packages rebuilt (
)cd js && pnpm build - Benchmark with 30-50 examples, category breakdown, failed examples printer
- Benchmark accuracy >80%
- Documentation page following the template structure
-
nav updateddocs.json - Landing page card added
- Sitemap updated
Retrospection
After completing the workflow, verify these instructions matched reality:
- Did any file paths, export patterns, or command names change from what's described here?
- Did the YAML config schema gain or lose fields since this was written?
- Did the benchmark or docs patterns evolve from the referenced examples?
- Did
generate to different locations?make codegen-prompts
If anything drifted, update this SKILL.md before finishing so the next person (or agent) doesn't hit the same surprises.