Aiwg eval-report

Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations

install
source · Clone the upstream repo
git clone https://github.com/jmagly/aiwg
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agentic/code/addons/aiwg-evals/skills/eval-report" ~/.claude/skills/jmagly-aiwg-eval-report-93ab4d && rm -rf "$T"
manifest: agentic/code/addons/aiwg-evals/skills/eval-report/SKILL.md
source content

Evaluation Report

Generate a quality report from accumulated evaluation results.

Research Foundation

  • REF-001: BP-9 - Continuous evaluation of agent performance
  • REF-002: KAMI benchmark methodology for real agentic task evaluation

Usage

/eval-report
/eval-report --output .aiwg/reports/quality-report.md
/eval-report --compare previous-report.json
/eval-report --mode sdlc --format json

Options

OptionDefaultDescription
--outputstdoutOutput file path
--comparenonePrevious report to diff against
--modeallAgent category: sdlc, marketing, forensics, all
--formatmarkdownOutput format: markdown, json
--sincenoneOnly include results after this date (ISO 8601)
--threshold0.85Score below this triggers a warning

Process

  1. Collect Results: Read all
    eval-*.json
    files from
    .aiwg/reports/
  2. Aggregate Scores: Compute per-agent and per-archetype scores
  3. Detect Regressions: Compare against --compare baseline if provided
  4. Rank Agents: Sort by overall score, flag below-threshold agents
  5. Build Recommendations: Surface specific agents and archetypes needing attention
  6. Output Report: Write markdown or JSON to --output or stdout

Report Sections

Summary Dashboard

Overall health at a glance — total agents tested, aggregate score, regression count.

By Archetype

Pass rates per Roig (2025) failure archetype across all agents.

Agents Needing Attention

Agents below the --threshold, with consecutive-failure streaks flagged.

Regression Analysis

When --compare is provided: agents whose scores dropped since the baseline.

Recommendations

Prioritized action list: which agents to review, which archetypes to harden.

Output Format (Markdown)

# Agent Quality Report

**Generated**: 2026-04-01T10:30:00Z
**Agents Tested**: 58
**Overall Score**: 87%
**Regressions**: 2

## By Archetype

| Archetype | Pass Rate | Trend |
|-----------|-----------|-------|
| #1 Grounding | 92% | ↑ |
| #2 Substitution | 88% | → |
| #3 Distractor | 78% | ↓ |
| #4 Recovery | 90% | ↑ |

## Agents Needing Attention

| Agent | Score | Consecutive Failures | Issue |
|-------|-------|---------------------|-------|
| data-analyst | 72% | 3 | distractor-test |
| api-designer | 79% | 1 | latency regression (+40%) |

## Recommendations

1. Review `data-analyst` context filtering — failed distractor-test 3 consecutive runs
2. Investigate `api-designer` tool selection — latency regression
3. Increase distractor-test scenarios for marketing agents (78% pass rate below 80% target)

Output Format (JSON)

{
  "generated": "2026-04-01T10:30:00Z",
  "summary": {
    "agents_tested": 58,
    "overall_score": 0.87,
    "regressions": 2
  },
  "by_archetype": {
    "grounding": 0.92,
    "substitution": 0.88,
    "distractor": 0.78,
    "recovery": 0.90
  },
  "agents_needing_attention": [
    {"agent": "data-analyst", "score": 0.72, "consecutive_failures": 3, "issue": "distractor-test"}
  ],
  "recommendations": [
    "Review data-analyst context filtering"
  ]
}

Examples

# Standard report to stdout
/eval-report

# Save to file
/eval-report --output .aiwg/reports/quality-$(date +%Y%m%d).md

# Compare against baseline
/eval-report --compare .aiwg/reports/quality-20260301.json

# JSON for CI consumption
/eval-report --format json --threshold 0.80

# SDLC agents only
/eval-report --mode sdlc

Related Commands

  • /eval-agent
    - Test individual agents
  • /eval-workflow
    - Test multi-agent workflows
  • aiwg lint agents
    - Static validation

Generate evaluation report: $ARGUMENTS

References

  • @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
  • @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete threshold and scoring requirements
  • @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC framework context for agent evaluation scope
  • @$AIWG_ROOT/docs/cli-reference.md — CLI reference for evaluation-related commands