Aiwg eval-report
Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations
install
source · Clone the upstream repo
git clone https://github.com/jmagly/aiwg
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agentic/code/addons/aiwg-evals/skills/eval-report" ~/.claude/skills/jmagly-aiwg-eval-report-93ab4d && rm -rf "$T"
manifest:
agentic/code/addons/aiwg-evals/skills/eval-report/SKILL.mdsource content
Evaluation Report
Generate a quality report from accumulated evaluation results.
Research Foundation
- REF-001: BP-9 - Continuous evaluation of agent performance
- REF-002: KAMI benchmark methodology for real agentic task evaluation
Usage
/eval-report /eval-report --output .aiwg/reports/quality-report.md /eval-report --compare previous-report.json /eval-report --mode sdlc --format json
Options
| Option | Default | Description |
|---|---|---|
| --output | stdout | Output file path |
| --compare | none | Previous report to diff against |
| --mode | all | Agent category: sdlc, marketing, forensics, all |
| --format | markdown | Output format: markdown, json |
| --since | none | Only include results after this date (ISO 8601) |
| --threshold | 0.85 | Score below this triggers a warning |
Process
- Collect Results: Read all
files fromeval-*.json.aiwg/reports/ - Aggregate Scores: Compute per-agent and per-archetype scores
- Detect Regressions: Compare against --compare baseline if provided
- Rank Agents: Sort by overall score, flag below-threshold agents
- Build Recommendations: Surface specific agents and archetypes needing attention
- Output Report: Write markdown or JSON to --output or stdout
Report Sections
Summary Dashboard
Overall health at a glance — total agents tested, aggregate score, regression count.
By Archetype
Pass rates per Roig (2025) failure archetype across all agents.
Agents Needing Attention
Agents below the --threshold, with consecutive-failure streaks flagged.
Regression Analysis
When --compare is provided: agents whose scores dropped since the baseline.
Recommendations
Prioritized action list: which agents to review, which archetypes to harden.
Output Format (Markdown)
# Agent Quality Report **Generated**: 2026-04-01T10:30:00Z **Agents Tested**: 58 **Overall Score**: 87% **Regressions**: 2 ## By Archetype | Archetype | Pass Rate | Trend | |-----------|-----------|-------| | #1 Grounding | 92% | ↑ | | #2 Substitution | 88% | → | | #3 Distractor | 78% | ↓ | | #4 Recovery | 90% | ↑ | ## Agents Needing Attention | Agent | Score | Consecutive Failures | Issue | |-------|-------|---------------------|-------| | data-analyst | 72% | 3 | distractor-test | | api-designer | 79% | 1 | latency regression (+40%) | ## Recommendations 1. Review `data-analyst` context filtering — failed distractor-test 3 consecutive runs 2. Investigate `api-designer` tool selection — latency regression 3. Increase distractor-test scenarios for marketing agents (78% pass rate below 80% target)
Output Format (JSON)
{ "generated": "2026-04-01T10:30:00Z", "summary": { "agents_tested": 58, "overall_score": 0.87, "regressions": 2 }, "by_archetype": { "grounding": 0.92, "substitution": 0.88, "distractor": 0.78, "recovery": 0.90 }, "agents_needing_attention": [ {"agent": "data-analyst", "score": 0.72, "consecutive_failures": 3, "issue": "distractor-test"} ], "recommendations": [ "Review data-analyst context filtering" ] }
Examples
# Standard report to stdout /eval-report # Save to file /eval-report --output .aiwg/reports/quality-$(date +%Y%m%d).md # Compare against baseline /eval-report --compare .aiwg/reports/quality-20260301.json # JSON for CI consumption /eval-report --format json --threshold 0.80 # SDLC agents only /eval-report --mode sdlc
Related Commands
- Test individual agents/eval-agent
- Test multi-agent workflows/eval-workflow
- Static validationaiwg lint agents
Generate evaluation report: $ARGUMENTS
References
- @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete threshold and scoring requirements
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC framework context for agent evaluation scope
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference for evaluation-related commands