Awesome-omni-skills advanced-evaluation

Advanced Evaluation workflow skill. Use this skill when the user needs This skill should be used when the user asks to \"implement LLM-as-judge\", \"compare model outputs\", \"create evaluation rubrics\", \"mitigate evaluation bias\", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment and the operator should preserve the upstream workflow, copied support files, and provenance before merging or handing off.

install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/advanced-evaluation" ~/.claude/skills/diegosouzapw-awesome-omni-skills-advanced-evaluation && rm -rf "$T"
manifest: skills/advanced-evaluation/SKILL.md
source content

Advanced Evaluation

Overview

This public intake copy packages

plugins/antigravity-awesome-skills-claude/skills/advanced-evaluation
from
https://github.com/sickn33/antigravity-awesome-skills
into the native Omni Skills editorial shape without hiding its origin.

Use it when the operator needs the upstream workflow, support files, and repository context to stay intact while the public validator and private enhancer continue their normal downstream flow.

This intake keeps the copied upstream files intact and uses

metadata.json
plus
ORIGIN.md
as the provenance anchor for review.

Advanced Evaluation This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems. Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.

Imported source sections that did not map cleanly to the public headings are still preserved below or in the support files. Notable imported sections: Core Concepts, Evaluation Approaches, Task, Original Prompt, Response to Evaluate, Criteria.

When to Use This Skill

Use this section as the trigger filter. It should make the activation boundary explicit before the operator loads files, runs commands, or opens a pull request.

  • Building automated evaluation pipelines for LLM outputs
  • Comparing multiple model responses to select the best one
  • Establishing consistent quality standards across evaluation teams
  • Debugging evaluation systems that show inconsistent results
  • Designing A/B tests for prompt or model changes
  • Creating rubrics for human or automated evaluation

Operating Table

SituationStart hereWhy it matters
First-time use
metadata.json
Confirms repository, branch, commit, and imported path before touching the copied workflow
Provenance review
ORIGIN.md
Gives reviewers a plain-language audit trail for the imported source
Workflow execution
SKILL.md
Starts with the smallest copied file that materially changes execution
Supporting context
SKILL.md
Adds the next most relevant copied source file without loading the entire package
Handoff decision
## Related Skills
Helps the operator switch to a stronger native skill when the task drifts

Workflow

This workflow is intentionally editorial and operational at the same time. It keeps the imported source useful to the operator while still satisfying the public intake standards that feed the downstream enhancer flow.

  1. Find specific evidence in the response
  2. Score according to the rubric (1-{max} scale)
  3. Justify your score with evidence
  4. Suggest one specific improvement
  5. Analyze each response independently first
  6. Compare them on each criterion
  7. Determine overall winner with confidence level

Imported Workflow Notes

Imported: Instructions

For each criterion:

  1. Find specific evidence in the response
  2. Score according to the rubric (1-{max} scale)
  3. Justify your score with evidence
  4. Suggest one specific improvement

Imported: Instructions

  1. Analyze each response independently first
  2. Compare them on each criterion
  3. Determine overall winner with confidence level

Imported: Core Concepts

The Evaluation Taxonomy

Evaluation approaches fall into two primary categories with distinct reliability profiles:

Direct Scoring: A single LLM rates one response on a defined scale.

  • Best for: Objective criteria (factual accuracy, instruction following, toxicity)
  • Reliability: Moderate to high for well-defined criteria
  • Failure mode: Score calibration drift, inconsistent scale interpretation

Pairwise Comparison: An LLM compares two responses and selects the better one.

  • Best for: Subjective preferences (tone, style, persuasiveness)
  • Reliability: Higher than direct scoring for preferences
  • Failure mode: Position bias, length bias

Research from the MT-Bench paper (Zheng et al., 2023) establishes that pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation, while direct scoring remains appropriate for objective criteria with clear ground truth.

The Bias Landscape

LLM judges exhibit systematic biases that must be actively mitigated:

Position Bias: First-position responses receive preferential treatment in pairwise comparison. Mitigation: Evaluate twice with swapped positions, use majority vote or consistency check.

Length Bias: Longer responses are rated higher regardless of quality. Mitigation: Explicit prompting to ignore length, length-normalized scoring.

Self-Enhancement Bias: Models rate their own outputs higher. Mitigation: Use different models for generation and evaluation, or acknowledge limitation.

Verbosity Bias: Detailed explanations receive higher scores even when unnecessary. Mitigation: Criteria-specific rubrics that penalize irrelevant detail.

Authority Bias: Confident, authoritative tone rated higher regardless of accuracy. Mitigation: Require evidence citation, fact-checking layer.

Metric Selection Framework

Choose metrics based on the evaluation task structure:

Task TypePrimary MetricsSecondary Metrics
Binary classification (pass/fail)Recall, Precision, F1Cohen's κ
Ordinal scale (1-5 rating)Spearman's ρ, Kendall's τCohen's κ (weighted)
Pairwise preferenceAgreement rate, Position consistencyConfidence calibration
Multi-labelMacro-F1, Micro-F1Per-label precision/recall

The critical insight: High absolute agreement matters less than systematic disagreement patterns. A judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.

Examples

Example 1: Ask for the upstream workflow directly

Use @advanced-evaluation to handle <task>. Start from the copied upstream workflow, load only the files that change the outcome, and keep provenance visible in the answer.

Explanation: This is the safest starting point when the operator needs the imported workflow, but not the entire repository.

Example 2: Ask for a provenance-grounded review

Review @advanced-evaluation against metadata.json and ORIGIN.md, then explain which copied upstream files you would load first and why.

Explanation: Use this before review or troubleshooting when you need a precise, auditable explanation of origin and file selection.

Example 3: Narrow the copied support files before execution

Use @advanced-evaluation for <task>. Load only the copied references, examples, or scripts that change the outcome, and name the files explicitly before proceeding.

Explanation: This keeps the skill aligned with progressive disclosure instead of loading the whole copied package by default.

Example 4: Build a reviewer packet

Review @advanced-evaluation using the copied upstream files plus provenance, then summarize any gaps before merge.

Explanation: This is useful when the PR is waiting for human review and you want a repeatable audit packet.

Imported Usage Notes

Imported: Examples

Example 1: Direct Scoring for Accuracy

Input:

Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun, 
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5

Output:

{
  "criterion": "Factual Accuracy",
  "score": 5,
  "evidence": [
    "Correctly identifies axial tilt as primary cause",
    "Correctly explains differential sunlight by hemisphere",
    "No factual errors present"
  ],
  "justification": "Response accurately explains the cause of seasons with correct 
scientific reasoning. Both the axial tilt and its effect on sunlight distribution 
are correctly described.",
  "improvement": "Could add the specific tilt angle (23.5°) for completeness."
}

Example 2: Pairwise Comparison with Position Swap

Input:

Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]

First Pass (A first):

{ "winner": "B", "confidence": 0.8 }

Second Pass (B first):

{ "winner": "A", "confidence": 0.6 }

(Note: Winner is A because B was in first position)

Mapped Second Pass:

{ "winner": "B", "confidence": 0.6 }

Final Result:

{
  "winner": "B",
  "confidence": 0.7,
  "positionConsistency": {
    "consistent": true,
    "firstPassWinner": "B",
    "secondPassWinner": "B"
  }
}

Example 3: Rubric Generation

Input:

criterionName: "Code Readability"
criterionDescription: "How easy the code is to understand and maintain"
domain: "software engineering"
scale: "1-5"
strictness: "balanced"

Output (abbreviated):

{
  "levels": [
    {
      "score": 1,
      "label": "Poor",
      "description": "Code is difficult to understand without significant effort",
      "characteristics": [
        "No meaningful variable or function names",
        "No comments or documentation",
        "Deeply nested or convoluted logic"
      ]
    },
    {
      "score": 3,
      "label": "Adequate",
      "description": "Code is understandable with some effort",
      "characteristics": [
        "Most variables have meaningful names",
        "Basic comments present for complex sections",
        "Logic is followable but could be cleaner"
      ]
    },
    {
      "score": 5,
      "label": "Excellent",
      "description": "Code is immediately clear and maintainable",
      "characteristics": [
        "All names are descriptive and consistent",
        "Comprehensive documentation",
        "Clean, modular structure"
      ]
    }
  ],
  "edgeCases": [
    {
      "situation": "Code is well-structured but uses domain-specific abbreviations",
      "guidance": "Score based on readability for domain experts, not general audience"
    }
  ]
}

Best Practices

Treat the generated public skill as a reviewable packaging layer around the upstream repository. The goal is to keep provenance explicit and load only the copied source material that materially improves execution.

  • Always require justification before scores - Chain-of-thought prompting improves reliability by 15-25%
  • Always swap positions in pairwise comparison - Single-pass comparison is corrupted by position bias
  • Match scale granularity to rubric specificity - Don't use 1-10 without detailed level descriptions
  • Separate objective and subjective criteria - Use direct scoring for objective, pairwise for subjective
  • Include confidence scores - Calibrate to position consistency and evidence strength
  • Define edge cases explicitly - Ambiguous situations cause the most evaluation variance
  • Use domain-specific rubrics - Generic rubrics produce generic (less useful) evaluations

Imported Operating Notes

Imported: Guidelines

  1. Always require justification before scores - Chain-of-thought prompting improves reliability by 15-25%

  2. Always swap positions in pairwise comparison - Single-pass comparison is corrupted by position bias

  3. Match scale granularity to rubric specificity - Don't use 1-10 without detailed level descriptions

  4. Separate objective and subjective criteria - Use direct scoring for objective, pairwise for subjective

  5. Include confidence scores - Calibrate to position consistency and evidence strength

  6. Define edge cases explicitly - Ambiguous situations cause the most evaluation variance

  7. Use domain-specific rubrics - Generic rubrics produce generic (less useful) evaluations

  8. Validate against human judgments - Automated evaluation is only valuable if it correlates with human assessment

  9. Monitor for systematic bias - Track disagreement patterns by criterion, response type, model

  10. Design for iteration - Evaluation systems improve with feedback loops

Troubleshooting

Problem: The operator skipped the imported context and answered too generically

Symptoms: The result ignores the upstream workflow in

plugins/antigravity-awesome-skills-claude/skills/advanced-evaluation
, fails to mention provenance, or does not use any copied source files at all. Solution: Re-open
metadata.json
,
ORIGIN.md
, and the most relevant copied upstream files. Load only the files that materially change the answer, then restate the provenance before continuing.

Problem: The imported workflow feels incomplete during review

Symptoms: Reviewers can see the generated

SKILL.md
, but they cannot quickly tell which references, examples, or scripts matter for the current task. Solution: Point at the exact copied references, examples, scripts, or assets that justify the path you took. If the gap is still real, record it in the PR instead of hiding it.

Problem: The task drifted into a different specialization

Symptoms: The imported skill starts in the right place, but the work turns into debugging, architecture, design, security, or release orchestration that a native skill handles better. Solution: Use the related skills section to hand off deliberately. Keep the imported provenance visible so the next skill inherits the right context instead of starting blind.

Related Skills

  • @00-andruia-consultant
    - Use when the work is better handled by that native specialization after this imported skill establishes context.
  • @10-andruia-skill-smith
    - Use when the work is better handled by that native specialization after this imported skill establishes context.
  • @20-andruia-niche-intelligence
    - Use when the work is better handled by that native specialization after this imported skill establishes context.
  • @3d-web-experience
    - Use when the work is better handled by that native specialization after this imported skill establishes context.

Additional Resources

Use this support matrix and the linked files below as the operator packet for this imported skill. They should reflect real copied source material, not generic scaffolding.

Resource familyWhat it gives the reviewerExample path
references
copied reference notes, guides, or background material from upstream
references/n/a
examples
worked examples or reusable prompts copied from upstream
examples/n/a
scripts
upstream helper scripts that change execution or validation
scripts/n/a
agents
routing or delegation notes that are genuinely part of the imported package
agents/n/a
assets
supporting assets or schemas copied from the source package
assets/n/a

Imported Reference Notes

Imported: References

Internal reference:

  • LLM-as-Judge Implementation Patterns
  • Bias Mitigation Techniques
  • Metric Selection Guide

External research:

Related skills in this collection:

  • evaluation - Foundational evaluation concepts
  • context-fundamentals - Context structure for evaluation prompts
  • tool-design - Building evaluation tools

Imported: Evaluation Approaches

Direct Scoring Implementation

Direct scoring requires three components: clear criteria, a calibrated scale, and structured output format.

Criteria Definition Pattern:

Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]

Scale Calibration:

  • 1-3 scales: Binary with neutral option, lowest cognitive load
  • 1-5 scales: Standard Likert, good balance of granularity and reliability
  • 1-10 scales: High granularity but harder to calibrate, use only with detailed rubrics

Prompt Structure for Direct Scoring:

You are an expert evaluator assessing response quality.

#### Imported: Task

Evaluate the following response against each criterion.

#### Imported: Original Prompt

{prompt}

#### Imported: Response to Evaluate

{response}

#### Imported: Criteria

{for each criterion: name, description, weight}

#### Imported: Output Format

Respond with structured JSON containing scores, justifications, and summary.

Chain-of-Thought Requirement: All scoring prompts must require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches.

Pairwise Comparison Implementation

Pairwise comparison is inherently more reliable for preference-based evaluation but requires bias mitigation.

Position Bias Mitigation Protocol:

  1. First pass: Response A in first position, Response B in second
  2. Second pass: Response B in first position, Response A in second
  3. Consistency check: If passes disagree, return TIE with reduced confidence
  4. Final verdict: Consistent winner with averaged confidence

Prompt Structure for Pairwise Comparison:

You are an expert evaluator comparing two AI responses.

#### Imported: Critical Instructions

- Do NOT prefer responses because they are longer
- Do NOT prefer responses based on position (first vs second)
- Focus ONLY on quality according to the specified criteria
- Ties are acceptable when responses are genuinely equivalent

#### Imported: Original Prompt

{prompt}

#### Imported: Response A

{response_a}

#### Imported: Response B

{response_b}

#### Imported: Comparison Criteria

{criteria list}

#### Imported: Output Format

JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.

Confidence Calibration: Confidence scores should reflect position consistency:

  • Both passes agree: confidence = average of individual confidences
  • Passes disagree: confidence = 0.5, verdict = TIE

Rubric Generation

Well-defined rubrics reduce evaluation variance by 40-60% compared to open-ended scoring.

Rubric Components:

  1. Level descriptions: Clear boundaries for each score level
  2. Characteristics: Observable features that define each level
  3. Examples: Representative text for each level (optional but valuable)
  4. Edge cases: Guidance for ambiguous situations
  5. Scoring guidelines: General principles for consistent application

Strictness Calibration:

  • Lenient: Lower bar for passing scores, appropriate for encouraging iteration
  • Balanced: Fair, typical expectations for production use
  • Strict: High standards, appropriate for safety-critical or high-stakes evaluation

Domain Adaptation: Rubrics should use domain-specific terminology. A "code readability" rubric mentions variables, functions, and comments. A "medical accuracy" rubric references clinical terminology and evidence standards.

Imported: Practical Guidance

Evaluation Pipeline Design

Production evaluation systems require multiple layers:

┌─────────────────────────────────────────────────┐
│                 Evaluation Pipeline              │
├─────────────────────────────────────────────────┤
│                                                   │
│  Input: Response + Prompt + Context               │
│           │                                       │
│           ▼                                       │
│  ┌─────────────────────┐                         │
│  │   Criteria Loader   │ ◄── Rubrics, weights    │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   Primary Scorer    │ ◄── Direct or Pairwise  │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   Bias Mitigation   │ ◄── Position swap, etc. │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │ Confidence Scoring  │ ◄── Calibration         │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  Output: Scores + Justifications + Confidence     │
│                                                   │
└─────────────────────────────────────────────────┘

Common Anti-Patterns

Anti-pattern: Scoring without justification

  • Problem: Scores lack grounding, difficult to debug or improve
  • Solution: Always require evidence-based justification before score

Anti-pattern: Single-pass pairwise comparison

  • Problem: Position bias corrupts results
  • Solution: Always swap positions and check consistency

Anti-pattern: Overloaded criteria

  • Problem: Criteria measuring multiple things are unreliable
  • Solution: One criterion = one measurable aspect

Anti-pattern: Missing edge case guidance

  • Problem: Evaluators handle ambiguous cases inconsistently
  • Solution: Include edge cases in rubrics with explicit guidance

Anti-pattern: Ignoring confidence calibration

  • Problem: High-confidence wrong judgments are worse than low-confidence
  • Solution: Calibrate confidence to position consistency and evidence strength

Decision Framework: Direct vs. Pairwise

Use this decision tree:

Is there an objective ground truth?
├── Yes → Direct Scoring
│   └── Examples: factual accuracy, instruction following, format compliance
│
└── No → Is it a preference or quality judgment?
    ├── Yes → Pairwise Comparison
    │   └── Examples: tone, style, persuasiveness, creativity
    │
    └── No → Consider reference-based evaluation
        └── Examples: summarization (compare to source), translation (compare to reference)

Scaling Evaluation

For high-volume evaluation:

  1. Panel of LLMs (PoLL): Use multiple models as judges, aggregate votes

    • Reduces individual model bias
    • More expensive but more reliable for high-stakes decisions
  2. Hierarchical evaluation: Fast cheap model for screening, expensive model for edge cases

    • Cost-effective for large volumes
    • Requires calibration of screening threshold
  3. Human-in-the-loop: Automated evaluation for clear cases, human review for low-confidence

    • Best reliability for critical applications
    • Design feedback loop to improve automated evaluation

Imported: Integration

This skill integrates with:

  • context-fundamentals - Evaluation prompts require effective context structure
  • tool-design - Evaluation tools need proper schemas and error handling
  • context-optimization - Evaluation prompts can be optimized for token efficiency
  • evaluation (foundational) - This skill extends the foundational evaluation concepts

Imported: Skill Metadata

Created: 2024-12-24 Last Updated: 2024-12-24 Author: Muratcan Koylan Version: 1.0.0

Imported: Limitations

  • Use this skill only when the task clearly matches the scope described above.
  • Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
  • Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.