Aiwg eval-agent

Run evaluation tests against an agent to assess quality and archetype resistance

install
source · Clone the upstream repo
git clone https://github.com/jmagly/aiwg
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agentic/code/addons/aiwg-evals/skills/eval-agent" ~/.claude/skills/jmagly-aiwg-eval-agent-7c6c85 && rm -rf "$T"
manifest: agentic/code/addons/aiwg-evals/skills/eval-agent/SKILL.md
source content

Agent Evaluation

Run automated evaluation tests against an agent.

Research Foundation

  • REF-001: BP-9 - Continuous evaluation of agent performance
  • REF-002: KAMI benchmark methodology for failure archetype detection

Usage

/eval-agent security-architect
/eval-agent architecture-designer --category archetype
/eval-agent test-engineer --scenario grounding-test --verbose

Arguments

ArgumentRequiredDescription
agent-nameYesAgent to evaluate

Options

OptionDefaultDescription
--categoryallTest category: archetype, performance, quality
--scenarioallSpecific scenario to run
--verbosefalseShow detailed test output
--outputstdoutOutput file for results
--strictfalseFail on any test failure

Test Categories

archetype

Tests for Roig (2025) failure archetypes:

  • grounding-test
    - Archetype 1: Premature action
  • substitution-test
    - Archetype 2: Over-helpfulness
  • distractor-test
    - Archetype 3: Context pollution
  • recovery-test
    - Archetype 4: Fragile execution

performance

  • latency-test
    - Response time benchmarks
  • token-test
    - Token efficiency
  • parallel-test
    - Concurrent execution correctness

quality

  • output-format
    - Output structure validation
  • tool-usage
    - Appropriate tool selection
  • scope-adherence
    - Stays within defined scope

Process

  1. Load Agent: Read agent definition
  2. Select Scenarios: Based on --category or --scenario
  3. Setup Environment: Create test workspace
  4. Execute Tests: Run agent against each scenario
  5. Validate Results: Check assertions
  6. Generate Report: Output results

Output Format

{
  "agent": "security-architect",
  "timestamp": "2025-01-15T10:30:00Z",
  "tests": {
    "grounding-test": {
      "passed": true,
      "score": 1.0,
      "details": "Read tool called before Edit",
      "duration_ms": 5000
    },
    "distractor-test": {
      "passed": false,
      "score": 0.6,
      "details": "Used staging data in output",
      "evidence": ["Found 'staging' in response"],
      "duration_ms": 3000
    }
  },
  "summary": {
    "passed": 3,
    "failed": 1,
    "total": 4,
    "score": 0.85
  }
}

Examples

# Full evaluation
/eval-agent architecture-designer

# Archetype tests only
/eval-agent architecture-designer --category archetype

# Single scenario with verbose output
/eval-agent test-engineer --scenario grounding-test --verbose

# Save results
/eval-agent security-architect --output .aiwg/reports/security-eval.json

# Strict mode (fails on any test failure)
/eval-agent devops-engineer --strict

Success Criteria

MetricTarget
Grounding (A1)>90%
Substitution (A2)>85%
Distractor (A3)>80%
Recovery (A4)≥80%
Overall≥85%

Related Commands

  • /eval-workflow
    - Test multi-agent workflows
  • /eval-report
    - Generate quality report
  • aiwg lint agents
    - Static validation

Evaluate agent: $ARGUMENTS

References

  • @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
  • @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/god-session.md — Single-responsibility rules that agents are evaluated against
  • @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete success criteria and threshold definitions
  • @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC agent catalog available for evaluation
  • @$AIWG_ROOT/docs/cli-reference.md — CLI reference for aiwg lint agents