Claude-skill-registry evaluation

Evaluate agent systems with quality gates and LLM-as-judge. Use when you need to measure component quality or implement quality gates. Not for simple unit testing or binary pass/fail checks without nuance.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/evaluation-git-fg-meta-plugin-manager" ~/.claude/skills/majiayu000-claude-skill-registry-evaluation-c6b22c && rm -rf "$T"

manifest: skills/data/evaluation-git-fg-meta-plugin-manager/SKILL.md

source content

Evaluation Methods for Agent Systems

<mission_control> <objective>Build quality gates and measure component quality using outcome-focused evaluation that accounts for non-determinism and multiple valid paths</objective> <success_criteria>Multi-dimensional rubric implemented with weighted scoring, evidence requirements, and threshold-based quality gates</success_criteria> </mission_control>

<trigger>When building quality gates, measuring component quality, or implementing LLM-as-judge. Not for: Simple unit testing or binary pass/fail checks without nuance.</trigger>

<interaction_schema> DEFINE_RUBRIC → BUILD_TEST_SET → IMPLEMENT_EVALUATION → TRACK_METRICS </interaction_schema>

Agent evaluation requires outcome-focused approaches that account for non-determinism and multiple valid paths. A robust framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects.

Core Concepts

The 95% Finding

Research on BrowseComp evaluation (which tests browsing agents' ability to locate hard-to-find information) found three factors explain 95% of performance variance:

Factor	Variance Explained	Implication
Token usage	80%	More tokens = better performance
Number of tool calls	~10%	More exploration helps
Model choice	~5%	Better models multiply efficiency

Critical Insight: Model upgrades often provide larger gains than doubling token budgets. Claude Sonnet 4.5 > 2× tokens on previous Sonnet.

Evaluation Challenges

Non-Determinism and Multiple Valid Paths

Agents may take different valid paths to goals
Traditional evaluations checking specific steps fail
Solution: Outcome-focused evaluation judging results, not paths

Context-Dependent Failures

Success on simple queries ≠ success on complex ones
Failures emerge only after extended interaction
Solution: Test across complexity levels, include extended interactions

Composite Quality Dimensions

Agent quality is multi-dimensional
Includes: factual accuracy, completeness, coherence, tool efficiency
Solution: Multi-dimensional rubrics with appropriate weighting

Evaluation Framework

Multi-Dimensional Rubrics

Design Principles:

Cover key quality dimensions
Use descriptive levels (excellent, good, fair, poor, failed)
Convert to numeric scores (0.0 to 1.0)
Weight dimensions based on use case

Core Dimensions:

Factual Accuracy

Claims match ground truth
1.0: All facts correct, no hallucinations
0.7: Mostly correct, minor inaccuracies
0.5: Mixed accuracy, some errors
0.3: Many errors, significant inaccuracies
0.0: Mostly false, major hallucinations

Completeness

Output covers all requested aspects
1.0: Addresses all requirements comprehensively
0.7: Covers most requirements with minor gaps
0.5: Partial coverage, missing some aspects
0.3: Minimal coverage, many gaps
0.0: Fails to address core requirements

Portability (Seed System Specific)

Component works without external dependencies
1.0: Zero dependencies, self-contained, portable
0.7: Minimal dependencies, mostly portable
0.5: Some dependencies, requires configuration
0.3: Many dependencies, limited portability
0.0: Tightly coupled, non-portable

Context Efficiency (Seed System Specific)

Uses context optimally (progressive disclosure)
1.0: Excellent use of progressive disclosure, minimal context
0.7: Good context management, some optimization
0.5: Adequate context usage, could be improved
0.3: Inefficient context usage, verbose
0.0: Wasteful context usage, bloats prompts

Tool Efficiency

Uses appropriate tools reasonable number of times
1.0: Optimal tool selection, minimal calls
0.7: Good tool usage, slightly inefficient
0.5: Adequate tool usage, some redundancy
0.3: Inefficient tool usage, many redundant calls
0.0: Poor tool selection, excessive calls

Scoring System

Individual Dimension Scores: 0.0 to 1.0 for each dimension

Weighted Overall Score:

overall_score = sum(score[dim] * weight[dim] for dim in dimensions)

Pass Threshold: Set based on use case

Production components: ≥ 0.8
Development components: ≥ 0.7
Experimental: ≥ 0.6

LLM-as-Judge Pattern

Direct Scoring

Evaluate against weighted criteria with rubrics
Provide clear task description
Include agent output and ground truth (if available)
Request structured judgment with evidence

Prompt Template:

Task: [Description]
Agent Output: [Output]
Evaluation Criteria: [Rubric]

Evaluate the agent output on each dimension:
1. Factual Accuracy (0.0-1.0): [Score] - [Evidence]
2. Completeness (0.0-1.0): [Score] - [Evidence]
3. Portability (0.0-1.0): [Score] - [Evidence]
4. Context Efficiency (0.0-1.0): [Score] - [Evidence]
5. Tool Efficiency (0.0-1.0): [Score] - [Evidence]

Overall Score: [Weighted average]
Pass/Fail: [Threshold-based]

Pairwise Comparison

Compare two outputs with position bias mitigation
Automatically swap positions to reduce bias
Ask judge to choose better overall output

Position Swapping:

def evaluate_pairwise(output_a, output_b):
    # First comparison: A vs B
    result_1 = judge_evaluate(output_a, output_b)

    # Second comparison: B vs A (swapped)
    result_2 = judge_evaluate(output_b, output_a)

    # Combine results
    return reconcile_comparisons(result_1, result_2)

Evaluation Methods

Test Set Design

Sample Selection

Start small during development (dramatic impacts early)
Sample from real usage patterns
Add known edge cases
Ensure coverage across complexity levels

Complexity Stratification

Simple: Single tool call, clear requirements
Medium: Multiple tool calls, some ambiguity
Complex: Many tool calls, significant ambiguity
Very Complex: Extended interaction, deep reasoning

Context Engineering Evaluation

Testing Context Strategies

Run with different context strategies on same test set
Compare quality scores, token usage, efficiency metrics
Validate progressive disclosure effectiveness

Degradation Testing

Test at different context sizes
Identify performance cliffs
Establish safe operating limits

Continuous Evaluation

Evaluation Pipeline

Run automatically on component changes
Track results over time
Compare versions to identify improvements/regressions

Production Monitoring

Sample interactions in production
Evaluate randomly
Set alerts for quality drops

Practical Implementation

Building Evaluation Frameworks

Step 1: Define quality dimensions relevant to use case Step 2: Create rubrics with clear level descriptions Step 3: Build test sets from real patterns and edge cases Step 4: Implement automated evaluation pipelines Step 5: Establish baseline metrics before changes Step 6: Run evaluations on all significant changes Step 7: Track metrics over time Step 8: Supplement with human review

Example: Component Evaluation

def evaluate_component(component, test_set):
    """Evaluate a Seed System component"""

    rubric = {
        "factual_accuracy": {"weight": 0.25},
        "completeness": {"weight": 0.25},
        "portability": {"weight": 0.25},
        "context_efficiency": {"weight": 0.15},
        "tool_efficiency": {"weight": 0.10}
    }

    scores = {}
    for test in test_set:
        result = run_test(component, test)
        for dimension in rubric:
            scores[dimension] = assess_dimension(result, dimension)

    overall = weighted_average(scores, rubric)
    passed = overall >= 0.7

    return {
        "passed": passed,
        "scores": scores,
        "overall": overall,
        "threshold": 0.7
    }

Avoiding Evaluation Pitfalls

❌ Overfitting to specific paths

Evaluate outcomes, not specific steps

❌ Ignoring edge cases

Include diverse test scenarios

❌ Single-metric obsession

Use multi-dimensional rubrics

❌ Neglecting context effects

Test with realistic context sizes

❌ Skipping human evaluation

Automated evaluation misses subtle issues

Enhanced Validation Workflow

Phase 1: Component Generation

Generate component using meta-skills
Apply progressive disclosure principles
Optimize context usage

Phase 2: Multi-Dimensional Evaluation

Run through evaluation framework
Score on all 5 dimensions
Calculate weighted overall score

Phase 3: Quality Gate

Block if below threshold (e.g., 0.7)
Provide detailed feedback
Suggest improvements

Phase 4: Evidence Collection

Store evaluation results
Track metrics over time
Enable regression detection

Example: Validation Report

# Validation Report
component: my-skill
timestamp: 2026-01-26
overall_score: 0.82
threshold: 0.70
passed: true

dimensions:
  factual_accuracy: 0.90
    evidence: "All technical claims verified"
  completeness: 0.85
    evidence: "Covers all requirements with minor gaps"
  portability: 0.80
    evidence: "Self-contained, zero external dependencies"
  context_efficiency: 0.75
    evidence: "Good progressive disclosure, some optimization possible"
  tool_efficiency: 0.85
    evidence: "Optimal tool selection, minimal redundant calls"

recommendations:
  - "Consider further context optimization for large components"
  - "Add more examples to references/"

Guidelines

Judge outcomes, not paths - Multiple valid routes to goals
Use multi-dimensional rubrics - Quality is composite
Test across complexity levels - Simple ≠ Complex
Implement position swapping - Mitigate pairwise bias
Require evidence - Justify all scores
Track over time - Detect regressions
Combine automated and human - Catch what automation misses
Test context strategies - Validate progressive disclosure

References

<routing_table> Skills referenced for related evaluation capabilities.

When You Need To...	Use This Skill	Routing Command
Implement progressive disclosure	`filesystem-context`	`Skill("filesystem-context")`
Validate component quality	`quality-standards`	`Skill("quality-standards")`
Build evaluation rubrics and scoring	(this skill)	(current context)

Research References:

BrowseComp evaluation on performance drivers
Eugene Yan on LLM-evaluators
Position bias in pairwise comparison

Key Principle: Evaluation should be outcome-focused, multi-dimensional, and continuously validated. Judge whether components achieve right outcomes while following reasonable processes. </routing_table>

Genetic Code

This component carries essential Seed System principles for context: fork isolation:

<critical_constraint> MANDATORY: All components MUST be self-contained (zero .claude/rules dependency) MANDATORY: Achieve 80-95% autonomy (0-5 AskUserQuestion rounds per session) MANDATORY: Description MUST use What-When-Not format in third person MANDATORY: No component references another component by name in description MANDATORY: Progressive disclosure - references/ for detailed content MANDATORY: Use XML for control (mission_control, critical_constraint), Markdown for data No exceptions. Portability invariant must be maintained. </critical_constraint>

Delta Standard: Good Component = Expert Knowledge − What Claude Already Knows

Recognition Questions:

"Would Claude know this without being told?" → Delete (zero delta)
"Can this work standalone?" → Fix if no (non-self-sufficient)
"Did I read the actual file, or just see it in grep?" → Verify before claiming MANDATORY: Use multi-dimensional rubrics (not single metrics) MANDATORY: Require evidence for all scores MANDATORY: Implement position swapping for pairwise comparisons MANDATORY: Block below threshold (≥0.7 for production) No exceptions. Evaluation without evidence is opinion, not assessment. </critical_constraint>