Aiwg eval-workflow
Run evaluation tests against a multi-agent workflow to assess orchestration quality and failure archetype resistance
install
source · Clone the upstream repo
git clone https://github.com/jmagly/aiwg
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/eval-workflow" ~/.claude/skills/jmagly-aiwg-eval-workflow && rm -rf "$T"
manifest:
.agents/skills/eval-workflow/SKILL.mdsource content
Workflow Evaluation
Run automated evaluation tests against a multi-agent workflow.
Research Foundation
- REF-001: BP-9 - Continuous evaluation of agent performance
- REF-002: KAMI benchmark methodology for real agentic task evaluation
Usage
/eval-workflow flow-security-review-cycle /eval-workflow flow-inception-to-elaboration --scenario distractor-test /eval-workflow flow-deploy-to-production --verbose --strict
Arguments
| Argument | Required | Description |
|---|---|---|
| workflow-name | Yes | Workflow (flow command) to evaluate |
Options
| Option | Default | Description |
|---|---|---|
| --scenario | all | Specific scenario to run |
| --verbose | false | Show detailed test output |
| --output | stdout | Output file for results |
| --strict | false | Fail on any test failure |
| --timeout | 300 | Maximum seconds per scenario |
What Gets Evaluated
Orchestration Quality
- Agent coordination: Parallel agents launched correctly in single message
- Handoff fidelity: Artifacts pass correctly between phases
- Gate enforcement: Phase gates checked before transition
Archetype Resistance
— Archetype 1: Premature action without reading stategrounding-test
— Archetype 3: Context pollution from irrelevant artifactsdistractor-test
— Archetype 4: Fragile execution when subagent failsrecovery-test
Output Validation
- Required artifacts created in correct
paths.aiwg/ - Document structure matches templates
- Traceability links intact
Process
- Load Workflow: Read flow command definition
- Select Scenarios: Based on --scenario flag or all applicable
- Setup Workspace: Create isolated
test space.aiwg/working/ - Execute Flow: Run workflow against each scenario
- Validate Outputs: Check artifact presence, structure, and content
- Generate Report: Output results with pass/fail per assertion
- Cleanup: Remove test workspace
Output Format
{ "workflow": "flow-security-review-cycle", "timestamp": "2026-04-01T10:30:00Z", "scenarios": { "grounding-test": { "passed": true, "score": 1.0, "assertions": [ {"name": "threat-model-created", "passed": true}, {"name": "security-gate-run", "passed": true} ], "duration_ms": 45000 }, "distractor-test": { "passed": false, "score": 0.7, "assertions": [ {"name": "correct-assets-only", "passed": false, "evidence": "Distractor file referenced in output"} ], "duration_ms": 38000 } }, "summary": { "passed": 4, "failed": 1, "total": 5, "score": 0.80 } }
Examples
# Full evaluation of a workflow /eval-workflow flow-security-review-cycle # Single scenario with verbose output /eval-workflow flow-inception-to-elaboration --scenario grounding-test --verbose # Strict mode with output saved /eval-workflow flow-deploy-to-production --strict --output .aiwg/reports/deploy-eval.json
Success Criteria
| Metric | Target |
|---|---|
| Artifact creation | 100% |
| Grounding compliance | >90% |
| Distractor resistance | >80% |
| Recovery success | ≥80% |
| Overall | ≥85% |
Related Commands
- Test individual agents/eval-agent
- Generate aggregate quality report/eval-report
- Static validationaiwg lint agents
Evaluate workflow: $ARGUMENTS
References
- @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/subagent-scoping.md — Parallel agent coordination rules evaluated in workflows
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete success thresholds and test criteria
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC flow commands available for workflow evaluation
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference for aiwg lint and eval commands