Claude-skill-registry azure-ai-evaluation-py
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/azure-ai-evaluation-py" ~/.claude/skills/majiayu000-claude-skill-registry-azure-ai-evaluation-py && rm -rf "$T"
manifest:
skills/data/azure-ai-evaluation-py/SKILL.mdsource content
Azure AI Evaluation SDK for Python
Assess generative AI application performance with built-in and custom evaluators.
Installation
pip install azure-ai-evaluation # With remote evaluation support pip install azure-ai-evaluation[remote]
Environment Variables
# For AI-assisted evaluators AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com AZURE_OPENAI_API_KEY=<your-api-key> AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini # For Foundry project integration AIPROJECT_CONNECTION_STRING=<your-connection-string>
Built-in Evaluators
Quality Evaluators (AI-Assisted)
from azure.ai.evaluation import ( GroundednessEvaluator, RelevanceEvaluator, CoherenceEvaluator, FluencyEvaluator, SimilarityEvaluator, RetrievalEvaluator ) # Initialize with Azure OpenAI model config model_config = { "azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT"], "api_key": os.environ["AZURE_OPENAI_API_KEY"], "azure_deployment": os.environ["AZURE_OPENAI_DEPLOYMENT"] } groundedness = GroundednessEvaluator(model_config) relevance = RelevanceEvaluator(model_config) coherence = CoherenceEvaluator(model_config)
Quality Evaluators (NLP-based)
from azure.ai.evaluation import ( F1ScoreEvaluator, RougeScoreEvaluator, BleuScoreEvaluator, GleuScoreEvaluator, MeteorScoreEvaluator ) f1 = F1ScoreEvaluator() rouge = RougeScoreEvaluator() bleu = BleuScoreEvaluator()
Safety Evaluators
from azure.ai.evaluation import ( ViolenceEvaluator, SexualEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator, IndirectAttackEvaluator, ProtectedMaterialEvaluator ) violence = ViolenceEvaluator(azure_ai_project=project_scope) sexual = SexualEvaluator(azure_ai_project=project_scope)
Single Row Evaluation
from azure.ai.evaluation import GroundednessEvaluator groundedness = GroundednessEvaluator(model_config) result = groundedness( query="What is Azure AI?", context="Azure AI is Microsoft's AI platform...", response="Azure AI provides AI services and tools." ) print(f"Groundedness score: {result['groundedness']}") print(f"Reason: {result['groundedness_reason']}")
Batch Evaluation with evaluate()
from azure.ai.evaluation import evaluate result = evaluate( data="test_data.jsonl", evaluators={ "groundedness": groundedness, "relevance": relevance, "coherence": coherence }, evaluator_config={ "default": { "column_mapping": { "query": "${data.query}", "context": "${data.context}", "response": "${data.response}" } } } ) print(result["metrics"])
Composite Evaluators
from azure.ai.evaluation import QAEvaluator, ContentSafetyEvaluator # All quality metrics in one qa_evaluator = QAEvaluator(model_config) # All safety metrics in one safety_evaluator = ContentSafetyEvaluator(azure_ai_project=project_scope) result = evaluate( data="data.jsonl", evaluators={ "qa": qa_evaluator, "content_safety": safety_evaluator } )
Evaluate Application Target
from azure.ai.evaluation import evaluate from my_app import chat_app # Your application result = evaluate( data="queries.jsonl", target=chat_app, # Callable that takes query, returns response evaluators={ "groundedness": groundedness }, evaluator_config={ "default": { "column_mapping": { "query": "${data.query}", "context": "${outputs.context}", "response": "${outputs.response}" } } } )
Custom Evaluators
Code-Based
from azure.ai.evaluation import evaluator @evaluator def word_count_evaluator(response: str) -> dict: return {"word_count": len(response.split())} # Use in evaluate() result = evaluate( data="data.jsonl", evaluators={"word_count": word_count_evaluator} )
Prompt-Based
from azure.ai.evaluation import PromptChatTarget class CustomEvaluator: def __init__(self, model_config): self.model = PromptChatTarget(model_config) def __call__(self, query: str, response: str) -> dict: prompt = f"Rate this response 1-5: Query: {query}, Response: {response}" result = self.model.send_prompt(prompt) return {"custom_score": int(result)}
Log to Foundry Project
from azure.ai.projects import AIProjectClient from azure.identity import DefaultAzureCredential project = AIProjectClient.from_connection_string( conn_str=os.environ["AIPROJECT_CONNECTION_STRING"], credential=DefaultAzureCredential() ) result = evaluate( data="data.jsonl", evaluators={"groundedness": groundedness}, azure_ai_project=project.scope # Logs results to Foundry ) print(f"View results: {result['studio_url']}")
Evaluator Reference
| Evaluator | Type | Metrics |
|---|---|---|
| AI | groundedness (1-5) |
| AI | relevance (1-5) |
| AI | coherence (1-5) |
| AI | fluency (1-5) |
| AI | similarity (1-5) |
| AI | retrieval (1-5) |
| NLP | f1_score (0-1) |
| NLP | rouge scores |
| Safety | violence (0-7) |
| Safety | sexual (0-7) |
| Safety | self_harm (0-7) |
| Safety | hate_unfairness (0-7) |
| Composite | All quality metrics |
| Composite | All safety metrics |
Best Practices
- Use composite evaluators for comprehensive assessment
- Map columns correctly — mismatched columns cause silent failures
- Log to Foundry for tracking and comparison across runs
- Create custom evaluators for domain-specific metrics
- Use NLP evaluators when you have ground truth answers
- Safety evaluators require Azure AI project scope
- Batch evaluation is more efficient than single-row loops
Reference Files
| File | Contents |
|---|---|
| references/built-in-evaluators.md | Detailed patterns for AI-assisted, NLP-based, and Safety evaluators with configuration tables |
| references/custom-evaluators.md | Creating code-based and prompt-based custom evaluators, testing patterns |
| scripts/run_batch_evaluation.py | CLI tool for running batch evaluations with quality, safety, and custom evaluators |