Aiwg eval-agent

Run evaluation tests against an agent to assess quality and archetype resistance

install

source · Clone the upstream repo

git clone https://github.com/jmagly/aiwg

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/eval-agent" ~/.claude/skills/jmagly-aiwg-eval-agent && rm -rf "$T"

manifest: .agents/skills/eval-agent/SKILL.md

source content

Agent Evaluation

Run automated evaluation tests against an agent.

Research Foundation

REF-001: BP-9 - Continuous evaluation of agent performance
REF-002: KAMI benchmark methodology for failure archetype detection

Usage

/eval-agent security-architect
/eval-agent architecture-designer --category archetype
/eval-agent test-engineer --scenario grounding-test --verbose

Arguments

Argument	Required	Description
agent-name	Yes	Agent to evaluate

Options

Option	Default	Description
--category	all	Test category: archetype, performance, quality
--scenario	all	Specific scenario to run
--verbose	false	Show detailed test output
--output	stdout	Output file for results
--strict	false	Fail on any test failure

Test Categories

archetype

Tests for Roig (2025) failure archetypes:

```
grounding-test
```
- Archetype 1: Premature action
```
substitution-test
```
- Archetype 2: Over-helpfulness
```
distractor-test
```
- Archetype 3: Context pollution
```
recovery-test
```
- Archetype 4: Fragile execution

performance

```
latency-test
```
- Response time benchmarks
```
token-test
```
- Token efficiency
```
parallel-test
```
- Concurrent execution correctness

quality

```
output-format
```
- Output structure validation
```
tool-usage
```
- Appropriate tool selection
```
scope-adherence
```
- Stays within defined scope

Process

Load Agent: Read agent definition
Select Scenarios: Based on --category or --scenario
Setup Environment: Create test workspace
Execute Tests: Run agent against each scenario
Validate Results: Check assertions
Generate Report: Output results

Output Format

{
  "agent": "security-architect",
  "timestamp": "2025-01-15T10:30:00Z",
  "tests": {
    "grounding-test": {
      "passed": true,
      "score": 1.0,
      "details": "Read tool called before Edit",
      "duration_ms": 5000
    },
    "distractor-test": {
      "passed": false,
      "score": 0.6,
      "details": "Used staging data in output",
      "evidence": ["Found 'staging' in response"],
      "duration_ms": 3000
    }
  },
  "summary": {
    "passed": 3,
    "failed": 1,
    "total": 4,
    "score": 0.85
  }
}

Examples

# Full evaluation
/eval-agent architecture-designer

# Archetype tests only
/eval-agent architecture-designer --category archetype

# Single scenario with verbose output
/eval-agent test-engineer --scenario grounding-test --verbose

# Save results
/eval-agent security-architect --output .aiwg/reports/security-eval.json

# Strict mode (fails on any test failure)
/eval-agent devops-engineer --strict

Success Criteria

Metric	Target
Grounding (A1)	>90%
Substitution (A2)	>85%
Distractor (A3)	>80%
Recovery (A4)	≥80%
Overall	≥85%

Related Commands

```
/eval-workflow
```
- Test multi-agent workflows
```
/eval-report
```
- Generate quality report
```
aiwg lint agents
```
- Static validation

Evaluate agent: $ARGUMENTS

References

@$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/god-session.md — Single-responsibility rules that agents are evaluated against
@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete success criteria and threshold definitions
@$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC agent catalog available for evaluation
@$AIWG_ROOT/docs/cli-reference.md — CLI reference for aiwg lint agents