Claude-skill-registry eval-testing

Develop and run agent behavior evaluations. Use this skill when asked to "write evals", "test agent behavior", "create eval cases", "run evals", "add eval tests", "test tool selection", "verify agent responses", or when developing tests for agents. Covers YAML eval case creation, assertion types, mock configuration, multi-model matrix testing, and LLM-as-judge scoring.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/eval-testing" ~/.claude/skills/majiayu000-claude-skill-registry-eval-testing && rm -rf "$T"

manifest: skills/data/eval-testing/SKILL.md

source content

Agent Evaluation Testing

System for testing multi-agent behavior consistency across prompts, tools, skills, models, and agent configs.

Quick Reference - Commands

# Run all evals with default model
npm run eval

# Run with fast model (Haiku)
npm run eval:fast

# Run with all models (Sonnet, Opus, Haiku)
npm run eval:full

# CI mode (exit 1 on failure)
npm run eval:ci

# Filter by type
npm run eval -- --type tool_selection
npm run eval -- --type response_quality
npm run eval -- --type skill_invocation
npm run eval -- --type multi_step_workflow

# Filter by agent
npm run eval -- --agent pm-assistant
npm run eval -- --agent communicator

# Filter by pattern
npm run eval -- --pattern "jira-*"

# Run via Vitest
npm run test:eval

Directory Structure

evals/
├── config/
│   └── models.yaml           # Model matrix definitions
├── schemas/
│   └── eval-schema.yaml      # JSON Schema for validation
├── tool-selection/           # Tool selection evals
├── response-quality/         # Response quality evals
├── skill-invocation/         # Skill activation evals
└── multi-step/               # Workflow evals

Eval Types

Type	Purpose	Key Assertions
`tool_selection`	Verify correct tools are called	`tool_called` , `tool_not_called`
`response_quality`	Check response content	`response_mentions` , `response_matches` , LLM-as-judge
`skill_invocation`	Test skill activation	`skill_activated`
`multi_step_workflow`	Multi-step sequences	`workflow_completed`

YAML Eval Case Schema

name: unique-eval-name
description: Human-readable description
type: tool_selection # tool_selection | response_quality | skill_invocation | multi_step_workflow
agent: pm-assistant # Agent ID to test

# Optional context
context:
  platform: slack # slack | whatsapp | opencode | cursor

# User input
input:
  prompt: 'Check for blocked tickets'
  conversationHistory: # Optional prior messages
    - role: user
      content: 'Previous message'

# Mock external service responses
mocks:
  jira:
    ai_first_get_blockers:
      response:
        count: 2
        issues: [...]
      error: null # Optional error to simulate
      delay: 100 # Optional delay in ms
  slack:
    ai_first_slack_send_message:
      response:
        success: true
        ts: '1705670400.000001'

# Expected behavior
expect:
  tool_calls:
    required:
      - name: ai_first_get_blockers
        arguments: # Optional partial match
          status: 'Blocked'
    forbidden:
      - ai_first_get_all_issues
    order: strict # strict | any

  skills:
    activated:
      - jira-management
    content_used:
      - pattern: 'blocker'

  workflow: # For multi_step_workflow type
    steps:
      - name: check_blockers
        tools: [ai_first_get_blockers]
      - name: notify_slack
        depends_on: check_blockers
        tools: [ai_first_slack_send_message]

  assertions:
    - type: response_mentions
      values: ['blocked', 'PROJ-123']
    - type: response_matches
      pattern: 'blocked|waiting'

# LLM-as-judge scoring (optional)
scoring:
  llm_judge:
    enabled: true
    criteria:
      - name: accuracy
        description: 'Correctly identifies blockers'
        weight: 0.5
      - name: clarity
        description: 'Clear and concise response'
        weight: 0.5
    threshold: 0.7
    rubric: |
      Score 1.0: Excellent - all blockers identified, clear summary
      Score 0.7: Good - most blockers found, minor issues
      Score 0.4: Needs work - incomplete or unclear
      Score 0.0: Poor - wrong information

Assertion Types

Type	Purpose	Required Fields
`tool_called`	Verify tool was invoked	`tool`
`tool_not_called`	Verify tool was NOT invoked	`tool`
`tool_arguments`	Check tool arguments	`tool` , `values`
`skill_activated`	Verify skill loaded	`skill`
`response_mentions`	Check response contains values	`values`
`response_matches`	Regex match on response	`pattern`
`workflow_completed`	Multi-step verification	`steps`

Mock Services

Available mock services:

jira

slack

google

whatsapp

Common Tool Mocks

JIRA:

```
ai_first_get_blockers
```
```
ai_first_get_in_progress
```
```
ai_first_get_all_issues
```
```
ai_first_get_weekly_summary
```
```
ai_first_jira_create_issue
```

Slack:

```
ai_first_slack_send_message
```
```
ai_first_slack_send_dm
```
```
ai_first_slack_lookup_user_by_email
```

Google Slides:

```
ai_first_slides_get_presentation
```
```
ai_first_slides_duplicate_template
```
```
ai_first_slides_update_slide_text
```

WhatsApp:

```
ai_first_whatsapp_search_messages
```
```
ai_first_whatsapp_get_chat_history
```

Examples by Type

Tool Selection Eval

name: jira-blockers-detection
description: Agent should use blockers tool when asked about blocked tickets
type: tool_selection
agent: pm-assistant

input:
  prompt: 'Are there any blocked tickets?'

mocks:
  jira:
    ai_first_get_blockers:
      response:
        count: 2
        issues:
          - key: PROJ-190
            summary: 'Waiting for API access'
            status: 'Blocked'
            blockedDays: 5

expect:
  tool_calls:
    required:
      - name: ai_first_get_blockers
    forbidden:
      - ai_first_get_all_issues
  assertions:
    - type: response_mentions
      values: ['PROJ-190', 'blocked']

Response Quality Eval

name: communicator-slack-format
description: Communicator should use Slack mrkdwn correctly
type: response_quality
agent: communicator

context:
  platform: slack

input:
  prompt: 'Format a standup: Yesterday I finished PROJ-150, today PROJ-151'

expect:
  assertions:
    - type: response_matches
      pattern: 'yesterday|today'

scoring:
  llm_judge:
    enabled: true
    criteria:
      - name: slack_formatting
        description: 'Uses *single asterisks* for bold, not **double**'
        weight: 0.5
      - name: structure
        description: 'Clear Yesterday/Today/Blockers format'
        weight: 0.5
    threshold: 0.7

Multi-Step Workflow Eval

name: weekly-report-workflow
description: Agent should gather data and update slides
type: multi_step_workflow
agent: pm-assistant

input:
  prompt: 'Update the weekly presentation with the latest sprint data'

mocks:
  jira:
    ai_first_get_weekly_summary:
      response:
        sprint: 'Sprint 12'
        velocity: 42
        completedStories: 8
  google:
    ai_first_slides_duplicate_template:
      response:
        slideId: 'slide_123'
    ai_first_slides_update_slide_text:
      response:
        success: true

expect:
  workflow:
    steps:
      - name: gather_data
        tools: [ai_first_get_weekly_summary]
      - name: create_slide
        depends_on: gather_data
        tools: [ai_first_slides_duplicate_template]
      - name: update_content
        depends_on: create_slide
        tools: [ai_first_slides_update_slide_text]
  assertions:
    - type: workflow_completed
      steps: [gather_data, create_slide, update_content]

Model Matrix Configuration

Edit

evals/config/models.yaml

models:
  default:
    - anthropic/claude-sonnet-4-20250514
  full_matrix:
    - anthropic/claude-sonnet-4-20250514
    - anthropic/claude-opus-4-20250514
    - anthropic/claude-haiku-3-5-20241022
  fast:
    - anthropic/claude-haiku-3-5-20241022

LLM-as-Judge Setup

Requires

ANTHROPIC_API_KEY

environment variable set in

.env

. The CLI automatically loads dotenv, so ensure your API key is configured:

# In .env file
ANTHROPIC_API_KEY=sk-ant-api03-...

The judge uses Claude to evaluate response quality against defined criteria.

Criteria weights must sum to 1.0.

scoring:
  llm_judge:
    enabled: true
    criteria:
      - name: accuracy
        description: 'Information is correct'
        weight: 0.4
      - name: completeness
        description: 'All requested info included'
        weight: 0.3
      - name: clarity
        description: 'Easy to understand'
        weight: 0.3
    threshold: 0.7 # Minimum score to pass

Creating New Evals

Choose appropriate type based on what you're testing
Create YAML file in correct subdirectory (
```
evals/<type>/
```
)
Define mocks for any external services
Add assertions for expected behavior
Optionally add LLM-as-judge for quality scoring

Run with

npm run eval -- --pattern "your-eval-name"

Debugging Failed Evals

Check the JSON output in

eval-results/

for:

```
executionTrace.toolCalls
```
- what tools were actually called
```
executionTrace.skillActivations
```
- which skills loaded
```
executionTrace.responseText
```
- full response text
```
assertions
```
- which assertions failed and why
```
judgeScore.criteria
```
- per-criterion scores with reasoning

Key Files

File	Purpose
`src/eval/types.ts`	Type definitions
`src/eval/runner/index.ts`	Main runner
`src/eval/runner/assertions.ts`	Assertion logic
`src/eval/judge/index.ts`	LLM-as-judge
`src/eval/mocks/registry.ts`	Mock service registry
`src/eval/cli.ts`	CLI interface
`src/services/openCodeClient.ts`	OpenCode API client for agent invocation

How Tool Tracking Works

Tool calls are extracted from the OpenCode session history, not the immediate response. The flow is:

Agent receives prompt via
```
POST /chat
```
OpenCode returns response with
```
step-start
```
,
```
reasoning
```
,
```
text
```
,
```
step-finish
```
parts
Tool calls appear only in session history (not in immediate response)
After response completes, the eval runner fetches
```
GET /session/{id}/message
```
Tool parts have
```
type: "tool"
```
with the tool name in the
```
tool
```
field
Tool names are prefixed with MCP server name (e.g.,
```
orienter_ai_first_get_blockers
```
)
The prefix is stripped to get the canonical tool name (
```
ai_first_get_blockers
```
)

This is why mocks in eval YAML files don't directly return data to the agent - the agent calls real APIs through OpenCode, and the eval system verifies which tools were called.

Best Practices for Assertions

Test Behavior, Not Mock Data

Since agents call real APIs (not mocks), assertions should test behavior patterns rather than specific mock values:

# BAD - Tests specific mock IDs that won't exist in real API
assertions:
  - type: response_mentions
    values: ["PROJ-123", "PROJ-124"]

# GOOD - Tests that agent discusses the right concepts
assertions:
  - type: response_matches
    pattern: "block|stuck|impediment|waiting"

Use Flexible Regex Patterns

Match word stems to catch variations:

# Matches: "completed", "complete", "completion", "completing"
pattern: "complet|finish|done"

# Matches: "notification", "notified", "notify", "notifying"
pattern: "notif|sent|posted|messag"

One Behavior Per Assertion

Keep assertions focused on single behaviors for clearer failure diagnostics:

assertions:
  # Tests blocker detection
  - type: response_matches
    pattern: 'block|stuck|waiting'

  # Tests notification action (separate assertion)
  - type: response_matches
    pattern: 'sent|posted|notif'

Use

tool_calls.required

Over Assertions

For tool selection tests, prefer the structured

tool_calls

section:

# GOOD - Clear, structured
expect:
  tool_calls:
    required:
      - name: ai_first_get_blockers

# Less preferred - assertion-based
expect:
  assertions:
    - type: tool_called
      tool: ai_first_get_blockers