Claude-skill-registry eval-testing
Develop and run agent behavior evaluations. Use this skill when asked to "write evals", "test agent behavior", "create eval cases", "run evals", "add eval tests", "test tool selection", "verify agent responses", or when developing tests for agents. Covers YAML eval case creation, assertion types, mock configuration, multi-model matrix testing, and LLM-as-judge scoring.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/eval-testing" ~/.claude/skills/majiayu000-claude-skill-registry-eval-testing && rm -rf "$T"
skills/data/eval-testing/SKILL.mdAgent Evaluation Testing
System for testing multi-agent behavior consistency across prompts, tools, skills, models, and agent configs.
Quick Reference - Commands
# Run all evals with default model npm run eval # Run with fast model (Haiku) npm run eval:fast # Run with all models (Sonnet, Opus, Haiku) npm run eval:full # CI mode (exit 1 on failure) npm run eval:ci # Filter by type npm run eval -- --type tool_selection npm run eval -- --type response_quality npm run eval -- --type skill_invocation npm run eval -- --type multi_step_workflow # Filter by agent npm run eval -- --agent pm-assistant npm run eval -- --agent communicator # Filter by pattern npm run eval -- --pattern "jira-*" # Run via Vitest npm run test:eval
Directory Structure
evals/ ├── config/ │ └── models.yaml # Model matrix definitions ├── schemas/ │ └── eval-schema.yaml # JSON Schema for validation ├── tool-selection/ # Tool selection evals ├── response-quality/ # Response quality evals ├── skill-invocation/ # Skill activation evals └── multi-step/ # Workflow evals
Eval Types
| Type | Purpose | Key Assertions |
|---|---|---|
| Verify correct tools are called | , |
| Check response content | , , LLM-as-judge |
| Test skill activation | |
| Multi-step sequences | |
YAML Eval Case Schema
name: unique-eval-name description: Human-readable description type: tool_selection # tool_selection | response_quality | skill_invocation | multi_step_workflow agent: pm-assistant # Agent ID to test # Optional context context: platform: slack # slack | whatsapp | opencode | cursor # User input input: prompt: 'Check for blocked tickets' conversationHistory: # Optional prior messages - role: user content: 'Previous message' # Mock external service responses mocks: jira: ai_first_get_blockers: response: count: 2 issues: [...] error: null # Optional error to simulate delay: 100 # Optional delay in ms slack: ai_first_slack_send_message: response: success: true ts: '1705670400.000001' # Expected behavior expect: tool_calls: required: - name: ai_first_get_blockers arguments: # Optional partial match status: 'Blocked' forbidden: - ai_first_get_all_issues order: strict # strict | any skills: activated: - jira-management content_used: - pattern: 'blocker' workflow: # For multi_step_workflow type steps: - name: check_blockers tools: [ai_first_get_blockers] - name: notify_slack depends_on: check_blockers tools: [ai_first_slack_send_message] assertions: - type: response_mentions values: ['blocked', 'PROJ-123'] - type: response_matches pattern: 'blocked|waiting' # LLM-as-judge scoring (optional) scoring: llm_judge: enabled: true criteria: - name: accuracy description: 'Correctly identifies blockers' weight: 0.5 - name: clarity description: 'Clear and concise response' weight: 0.5 threshold: 0.7 rubric: | Score 1.0: Excellent - all blockers identified, clear summary Score 0.7: Good - most blockers found, minor issues Score 0.4: Needs work - incomplete or unclear Score 0.0: Poor - wrong information
Assertion Types
| Type | Purpose | Required Fields |
|---|---|---|
| Verify tool was invoked | |
| Verify tool was NOT invoked | |
| Check tool arguments | , |
| Verify skill loaded | |
| Check response contains values | |
| Regex match on response | |
| Multi-step verification | |
Mock Services
Available mock services:
jira, slack, google, whatsapp
Common Tool Mocks
JIRA:
ai_first_get_blockersai_first_get_in_progressai_first_get_all_issuesai_first_get_weekly_summaryai_first_jira_create_issue
Slack:
ai_first_slack_send_messageai_first_slack_send_dmai_first_slack_lookup_user_by_email
Google Slides:
ai_first_slides_get_presentationai_first_slides_duplicate_templateai_first_slides_update_slide_text
WhatsApp:
ai_first_whatsapp_search_messagesai_first_whatsapp_get_chat_history
Examples by Type
Tool Selection Eval
name: jira-blockers-detection description: Agent should use blockers tool when asked about blocked tickets type: tool_selection agent: pm-assistant input: prompt: 'Are there any blocked tickets?' mocks: jira: ai_first_get_blockers: response: count: 2 issues: - key: PROJ-190 summary: 'Waiting for API access' status: 'Blocked' blockedDays: 5 expect: tool_calls: required: - name: ai_first_get_blockers forbidden: - ai_first_get_all_issues assertions: - type: response_mentions values: ['PROJ-190', 'blocked']
Response Quality Eval
name: communicator-slack-format description: Communicator should use Slack mrkdwn correctly type: response_quality agent: communicator context: platform: slack input: prompt: 'Format a standup: Yesterday I finished PROJ-150, today PROJ-151' expect: assertions: - type: response_matches pattern: 'yesterday|today' scoring: llm_judge: enabled: true criteria: - name: slack_formatting description: 'Uses *single asterisks* for bold, not **double**' weight: 0.5 - name: structure description: 'Clear Yesterday/Today/Blockers format' weight: 0.5 threshold: 0.7
Multi-Step Workflow Eval
name: weekly-report-workflow description: Agent should gather data and update slides type: multi_step_workflow agent: pm-assistant input: prompt: 'Update the weekly presentation with the latest sprint data' mocks: jira: ai_first_get_weekly_summary: response: sprint: 'Sprint 12' velocity: 42 completedStories: 8 google: ai_first_slides_duplicate_template: response: slideId: 'slide_123' ai_first_slides_update_slide_text: response: success: true expect: workflow: steps: - name: gather_data tools: [ai_first_get_weekly_summary] - name: create_slide depends_on: gather_data tools: [ai_first_slides_duplicate_template] - name: update_content depends_on: create_slide tools: [ai_first_slides_update_slide_text] assertions: - type: workflow_completed steps: [gather_data, create_slide, update_content]
Model Matrix Configuration
Edit
evals/config/models.yaml:
models: default: - anthropic/claude-sonnet-4-20250514 full_matrix: - anthropic/claude-sonnet-4-20250514 - anthropic/claude-opus-4-20250514 - anthropic/claude-haiku-3-5-20241022 fast: - anthropic/claude-haiku-3-5-20241022
LLM-as-Judge Setup
Requires
ANTHROPIC_API_KEY environment variable set in .env. The CLI automatically loads dotenv, so ensure your API key is configured:
# In .env file ANTHROPIC_API_KEY=sk-ant-api03-...
The judge uses Claude to evaluate response quality against defined criteria.
Criteria weights must sum to 1.0.
scoring: llm_judge: enabled: true criteria: - name: accuracy description: 'Information is correct' weight: 0.4 - name: completeness description: 'All requested info included' weight: 0.3 - name: clarity description: 'Easy to understand' weight: 0.3 threshold: 0.7 # Minimum score to pass
Creating New Evals
- Choose appropriate type based on what you're testing
- Create YAML file in correct subdirectory (
)evals/<type>/ - Define mocks for any external services
- Add assertions for expected behavior
- Optionally add LLM-as-judge for quality scoring
- Run with
npm run eval -- --pattern "your-eval-name"
Debugging Failed Evals
Check the JSON output in
eval-results/ for:
- what tools were actually calledexecutionTrace.toolCalls
- which skills loadedexecutionTrace.skillActivations
- full response textexecutionTrace.responseText
- which assertions failed and whyassertions
- per-criterion scores with reasoningjudgeScore.criteria
Key Files
| File | Purpose |
|---|---|
| Type definitions |
| Main runner |
| Assertion logic |
| LLM-as-judge |
| Mock service registry |
| CLI interface |
| OpenCode API client for agent invocation |
How Tool Tracking Works
Tool calls are extracted from the OpenCode session history, not the immediate response. The flow is:
- Agent receives prompt via
POST /chat - OpenCode returns response with
,step-start
,reasoning
,text
partsstep-finish - Tool calls appear only in session history (not in immediate response)
- After response completes, the eval runner fetches
GET /session/{id}/message - Tool parts have
with the tool name in thetype: "tool"
fieldtool - Tool names are prefixed with MCP server name (e.g.,
)orienter_ai_first_get_blockers - The prefix is stripped to get the canonical tool name (
)ai_first_get_blockers
This is why mocks in eval YAML files don't directly return data to the agent - the agent calls real APIs through OpenCode, and the eval system verifies which tools were called.
Best Practices for Assertions
Test Behavior, Not Mock Data
Since agents call real APIs (not mocks), assertions should test behavior patterns rather than specific mock values:
# BAD - Tests specific mock IDs that won't exist in real API assertions: - type: response_mentions values: ["PROJ-123", "PROJ-124"] # GOOD - Tests that agent discusses the right concepts assertions: - type: response_matches pattern: "block|stuck|impediment|waiting"
Use Flexible Regex Patterns
Match word stems to catch variations:
# Matches: "completed", "complete", "completion", "completing" pattern: "complet|finish|done" # Matches: "notification", "notified", "notify", "notifying" pattern: "notif|sent|posted|messag"
One Behavior Per Assertion
Keep assertions focused on single behaviors for clearer failure diagnostics:
assertions: # Tests blocker detection - type: response_matches pattern: 'block|stuck|waiting' # Tests notification action (separate assertion) - type: response_matches pattern: 'sent|posted|notif'
Use tool_calls.required
Over Assertions
tool_calls.requiredFor tool selection tests, prefer the structured
tool_calls section:
# GOOD - Clear, structured expect: tool_calls: required: - name: ai_first_get_blockers # Less preferred - assertion-based expect: assertions: - type: tool_called tool: ai_first_get_blockers