Personal_AI_Infrastructure Evals
Agent evaluation framework based on Anthropic's best practices. USE WHEN eval, evaluate, test agent, benchmark, verify behavior, regression test, capability test. Includes three grader types (code-based, model-based, human), transcript capture, pass@k/pass^k metrics, and ALGORITHM integration.
git clone https://github.com/danielmiessler/Personal_AI_Infrastructure
T=$(mktemp -d) && git clone --depth=1 https://github.com/danielmiessler/Personal_AI_Infrastructure "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Releases/v2.5/.claude/skills/Evals" ~/.claude/skills/danielmiessler-personal-ai-infrastructure-evals-831dbe && rm -rf "$T"
Releases/v2.5/.claude/skills/Evals/SKILL.mdCustomization
Before executing, check for user customizations at:
~/.claude/skills/PAI/USER/SKILLCUSTOMIZATIONS/Evals/
If this directory exists, load and apply any PREFERENCES.md, configurations, or resources found there. These override default behavior. If the directory does not exist, proceed with skill defaults.
🚨 MANDATORY: Voice Notification (REQUIRED BEFORE ANY ACTION)
You MUST send this notification BEFORE doing anything else when this skill is invoked.
-
Send voice notification:
curl -s -X POST http://localhost:8888/notify \ -H "Content-Type: application/json" \ -d '{"message": "Running the WORKFLOWNAME workflow in the Evals skill to ACTION"}' \ > /dev/null 2>&1 & -
Output text notification:
Running the **WorkflowName** workflow in the **Evals** skill to ACTION...
This is not optional. Execute this curl command immediately upon skill invocation.
Evals - AI Agent Evaluation Framework
Comprehensive agent evaluation system based on Anthropic's "Demystifying Evals for AI Agents" (Jan 2026).
Key differentiator: Evaluates agent workflows (transcripts, tool calls, multi-turn conversations), not just single outputs.
When to Activate
- "run evals", "test this agent", "evaluate", "check quality", "benchmark"
- "regression test", "capability test"
- Compare agent behaviors across changes
- Validate agent workflows before deployment
- Verify ALGORITHM ISC rows
- Create new evaluation tasks from failures
Core Concepts
Three Grader Types
| Type | Strengths | Weaknesses | Use For |
|---|---|---|---|
| Code-based | Fast, cheap, deterministic, reproducible | Brittle, lacks nuance | Tests, state checks, tool verification |
| Model-based | Flexible, captures nuance, scalable | Non-deterministic, expensive | Quality rubrics, assertions, comparisons |
| Human | Gold standard, handles subjectivity | Expensive, slow | Calibration, spot checks, A/B testing |
Evaluation Types
| Type | Pass Target | Purpose |
|---|---|---|
| Capability | ~70% | Stretch goals, measuring improvement potential |
| Regression | ~99% | Quality gates, detecting backsliding |
Key Metrics
- pass@k: Probability of at least 1 success in k trials (measures capability)
- pass^k: Probability all k trials succeed (measures consistency/reliability)
Workflow Routing
| Trigger | Workflow |
|---|---|
| "run evals", "evaluate suite" | Run suite via |
| "log failure" | Log failure via |
| "convert failures" | Convert to tasks via |
| "create suite" | Create suite via |
| "check saturation" | Check via |
Quick Reference
CLI Commands
# Run an eval suite bun run ~/.claude/skills/Evals/Tools/AlgorithmBridge.ts -s <suite> # Log a failure for later conversion bun run ~/.claude/skills/Evals/Tools/FailureToTask.ts log "description" -c category -s severity # Convert failures to test tasks bun run ~/.claude/skills/Evals/Tools/FailureToTask.ts convert-all # Manage suites bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts create <name> -t capability -d "description" bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts list bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts check-saturation <name> bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts graduate <name>
ALGORITHM Integration
Evals is a verification method for THE ALGORITHM ISC rows:
# Run eval and update ISC row bun run ~/.claude/skills/Evals/Tools/AlgorithmBridge.ts -s regression-core -r 3 -u
ISC rows can specify eval verification:
| # | What Ideal Looks Like | Verify | |---|----------------------|--------| | 1 | Auth bypass fixed | eval:auth-security | | 2 | Tests all pass | eval:regression |
Available Graders
Code-Based (Fast, Deterministic)
| Grader | Use Case |
|---|---|
| Exact substring matching |
| Pattern matching |
| Run test files |
| Lint, type-check, security scan |
| Verify system state after execution |
| Verify specific tools were called |
Model-Based (Nuanced)
| Grader | Use Case |
|---|---|
| Score against detailed rubric |
| Check assertions are true |
| Compare to reference with position swap |
Domain Patterns
Pre-configured grader stacks for common agent types:
| Domain | Primary Graders |
|---|---|
| binary_tests + static_analysis + tool_calls + llm_rubric |
| llm_rubric + natural_language_assert + state_check |
| llm_rubric + natural_language_assert + tool_calls |
| state_check + tool_calls + llm_rubric |
See
Data/DomainPatterns.yaml for full configurations.
Task Schema (YAML)
task: id: "fix-auth-bypass_1" description: "Fix authentication bypass when password is empty" type: regression # or capability domain: coding graders: - type: binary_tests required: [test_empty_pw.py] weight: 0.30 - type: tool_calls weight: 0.20 params: sequence: [read_file, edit_file, run_tests] - type: llm_rubric weight: 0.50 params: rubric: prompts/security_review.md trials: 3 pass_threshold: 0.75
Resource Index
| Resource | Purpose |
|---|---|
| Core type definitions |
| Deterministic graders |
| LLM-powered graders |
| Capture agent trajectories |
| Multi-trial execution with pass@k |
| Suite management and saturation |
| Convert failures to test tasks |
| ALGORITHM integration |
| Domain-specific grader configs |
Key Principles (from Anthropic)
- Start with 20-50 real failures - Don't overthink, capture what actually broke
- Unambiguous tasks - Two experts should reach identical verdicts
- Balanced problem sets - Test both "should do" AND "should NOT do"
- Grade outputs, not paths - Don't penalize valid creative solutions
- Calibrate LLM judges - Against human expert judgment
- Check transcripts regularly - Verify graders work correctly
- Monitor saturation - Graduate to regression when hitting 95%+
- Build infrastructure early - Evals shape how quickly you can adopt new models
Related
- ALGORITHM: Evals is a verification method
- Science: Evals implements scientific method
- Browser: For visual verification graders