Aiwg eval-workflow

Run evaluation tests against a multi-agent workflow to assess orchestration quality and failure archetype resistance

install
source · Clone the upstream repo
git clone https://github.com/jmagly/aiwg
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/eval-workflow" ~/.claude/skills/jmagly-aiwg-eval-workflow && rm -rf "$T"
manifest: .agents/skills/eval-workflow/SKILL.md
source content

Workflow Evaluation

Run automated evaluation tests against a multi-agent workflow.

Research Foundation

  • REF-001: BP-9 - Continuous evaluation of agent performance
  • REF-002: KAMI benchmark methodology for real agentic task evaluation

Usage

/eval-workflow flow-security-review-cycle
/eval-workflow flow-inception-to-elaboration --scenario distractor-test
/eval-workflow flow-deploy-to-production --verbose --strict

Arguments

ArgumentRequiredDescription
workflow-nameYesWorkflow (flow command) to evaluate

Options

OptionDefaultDescription
--scenarioallSpecific scenario to run
--verbosefalseShow detailed test output
--outputstdoutOutput file for results
--strictfalseFail on any test failure
--timeout300Maximum seconds per scenario

What Gets Evaluated

Orchestration Quality

  • Agent coordination: Parallel agents launched correctly in single message
  • Handoff fidelity: Artifacts pass correctly between phases
  • Gate enforcement: Phase gates checked before transition

Archetype Resistance

  • grounding-test
    — Archetype 1: Premature action without reading state
  • distractor-test
    — Archetype 3: Context pollution from irrelevant artifacts
  • recovery-test
    — Archetype 4: Fragile execution when subagent fails

Output Validation

  • Required artifacts created in correct
    .aiwg/
    paths
  • Document structure matches templates
  • Traceability links intact

Process

  1. Load Workflow: Read flow command definition
  2. Select Scenarios: Based on --scenario flag or all applicable
  3. Setup Workspace: Create isolated
    .aiwg/working/
    test space
  4. Execute Flow: Run workflow against each scenario
  5. Validate Outputs: Check artifact presence, structure, and content
  6. Generate Report: Output results with pass/fail per assertion
  7. Cleanup: Remove test workspace

Output Format

{
  "workflow": "flow-security-review-cycle",
  "timestamp": "2026-04-01T10:30:00Z",
  "scenarios": {
    "grounding-test": {
      "passed": true,
      "score": 1.0,
      "assertions": [
        {"name": "threat-model-created", "passed": true},
        {"name": "security-gate-run", "passed": true}
      ],
      "duration_ms": 45000
    },
    "distractor-test": {
      "passed": false,
      "score": 0.7,
      "assertions": [
        {"name": "correct-assets-only", "passed": false, "evidence": "Distractor file referenced in output"}
      ],
      "duration_ms": 38000
    }
  },
  "summary": {
    "passed": 4,
    "failed": 1,
    "total": 5,
    "score": 0.80
  }
}

Examples

# Full evaluation of a workflow
/eval-workflow flow-security-review-cycle

# Single scenario with verbose output
/eval-workflow flow-inception-to-elaboration --scenario grounding-test --verbose

# Strict mode with output saved
/eval-workflow flow-deploy-to-production --strict --output .aiwg/reports/deploy-eval.json

Success Criteria

MetricTarget
Artifact creation100%
Grounding compliance>90%
Distractor resistance>80%
Recovery success≥80%
Overall≥85%

Related Commands

  • /eval-agent
    - Test individual agents
  • /eval-report
    - Generate aggregate quality report
  • aiwg lint agents
    - Static validation

Evaluate workflow: $ARGUMENTS

References

  • @$AIWG_ROOT/agentic/code/addons/aiwg-evals/README.md — aiwg-evals addon overview
  • @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/subagent-scoping.md — Parallel agent coordination rules evaluated in workflows
  • @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/vague-discretion.md — Concrete success thresholds and test criteria
  • @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/README.md — SDLC flow commands available for workflow evaluation
  • @$AIWG_ROOT/docs/cli-reference.md — CLI reference for aiwg lint and eval commands