Aiwg eval-agent
Run evaluation tests against an agent to assess quality and archetype resistance
install
source · Clone the upstream repo
git clone https://github.com/jmagly/aiwg
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/eval-agent" ~/.claude/skills/jmagly-aiwg-eval-agent && rm -rf "$T"
manifest:
.agents/skills/eval-agent/SKILL.mdsource content
Agent Evaluation
Run automated evaluation tests against an agent.
Research Foundation
- REF-001: BP-9 - Continuous evaluation of agent performance
- REF-002: KAMI benchmark methodology for failure archetype detection
Usage
/eval-agent security-architect /eval-agent architecture-designer --category archetype /eval-agent test-engineer --scenario grounding-test --verbose
Arguments
| Argument | Required | Description |
|---|---|---|
| agent-name | Yes | Agent to evaluate |
Options
| Option | Default | Description |
|---|---|---|
| --category | all | Test category: archetype, performance, quality |
| --scenario | all | Specific scenario to run |
| --verbose | false | Show detailed test output |
| --output | stdout | Output file for results |
| --strict | false | Fail on any test failure |
Test Categories
archetype
Tests for Roig (2025) failure archetypes:
- Archetype 1: Premature actiongrounding-test
- Archetype 2: Over-helpfulnesssubstitution-test
- Archetype 3: Context pollutiondistractor-test
- Archetype 4: Fragile executionrecovery-test
performance
- Response time benchmarkslatency-test
- Token efficiencytoken-test
- Concurrent execution correctnessparallel-test
quality
- Output structure validationoutput-format
- Appropriate tool selectiontool-usage
- Stays within defined scopescope-adherence
Process
- Load Agent: Read agent definition
- Select Scenarios: Based on --category or --scenario
- Setup Environment: Create test workspace
- Execute Tests: Run agent against each scenario
- Validate Results: Check assertions
- Generate Report: Output results
Output Format
{ "agent": "security-architect", "timestamp": "2025-01-15T10:30:00Z", "tests": { "grounding-test": { "passed": true, "score": 1.0, "details": "Read tool called before Edit", "duration_ms": 5000 }, "distractor-test": { "passed": false, "score": 0.6, "details": "Used staging data in output", "evidence": ["Found 'staging' in response"], "duration_ms": 3000 } }, "summary": { "passed": 3, "failed": 1, "total": 4, "score": 0.85 } }
Examples
# Full evaluation /eval-agent architecture-designer # Archetype tests only /eval-agent architecture-designer --category archetype # Single scenario with verbose output /eval-agent test-engineer --scenario grounding-test --verbose # Save results /eval-agent security-architect --output .aiwg/reports/security-eval.json # Strict mode (fails on any test failure) /eval-agent devops-engineer --strict
Success Criteria
| Metric | Target |
|---|---|
| Grounding (A1) | >90% |
| Substitution (A2) | >85% |
| Distractor (A3) | >80% |
| Recovery (A4) | ≥80% |
| Overall | ≥85% |
Related Commands
- Test multi-agent workflows/eval-workflow
- Generate quality report/eval-report
- Static validationaiwg lint agents
Evaluate agent: $ARGUMENTS
References
- @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/god-session.md — Single-responsibility rules that agents are evaluated against
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete success criteria and threshold definitions
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC agent catalog available for evaluation
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference for aiwg lint agents