Claude-skill-registry LLM Judge Patterns

Comprehensive guide to using LLMs as judges for automated evaluation including prompt patterns, calibration, bias reduction, and multi-judge ensembles

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/llm-judge-patterns" ~/.claude/skills/majiayu000-claude-skill-registry-llm-judge-patterns && rm -rf "$T"

manifest: skills/data/llm-judge-patterns/SKILL.md

LLM Judge Patterns

What is LLM-as-Judge?

Definition: Using LLMs (GPT-4, Claude) to evaluate other LLM outputs automatically.

Model

Input: Question + Answer (to evaluate)
Judge LLM: GPT-4 or Claude
Output: Score + Reasoning

Example:
Question: "What is the capital of France?"
Answer: "Paris is the capital of France."
Judge: "Score: 5/5 - Correct, concise, directly answers question"

Why LLM-as-Judge?

Human Eval is Slow and Expensive

Comparison:

Human evaluation:
- 100 answers × 5 min each = 500 min = 8.3 hours
- Cost: $20/hour × 8.3 = $166

LLM-as-judge:
- 100 answers × 2 sec each = 200 sec = 3.3 min
- Cost: 100 × $0.01 = $1

Need to Evaluate Thousands of Outputs

Scale:

Development: Test 1000+ variations
Production: Evaluate millions of responses
Human eval: Impossible at this scale
LLM-judge: Feasible

Research Shows High Correlation with Human Judgment

Studies:

GPT-4 as judge correlates 0.8+ with human ratings
Works well for subjective quality (fluency, helpfulness)
Less reliable for factual correctness

Enables Continuous Evaluation

Workflow:

Every response → LLM judge → Score logged → Dashboard
Detect regressions in real-time

When to Use LLM-as-Judge

Subjective Quality (Fluency, Relevance, Helpfulness)

Good Use Cases:

- Is this answer helpful?
- Is this text fluent and natural?
- Is this response relevant to the question?
- Is this summary coherent?

Complex Rubrics (Multi-Criteria)

Example:

Evaluate on:
1. Accuracy (1-5)
2. Completeness (1-5)
3. Clarity (1-5)
4. Tone (1-5)

LLM can handle multi-dimensional evaluation

Large-Scale Evaluation

When:

Need to evaluate 1000+ examples
Human eval too slow/expensive

Rapid Iteration

Development:

Test 10 prompt variations
Evaluate each on 100 examples
LLM-judge: Minutes
Human eval: Days

When NOT to Use LLM-as-Judge

Objective Correctness (Factual Answers)

Problem:

Question: "What is 2+2?"
Answer: "5"
LLM judge might say: "The answer is clear and confident" (wrong!)

Better: Exact match or computation

Mathematical Reasoning (Verify with Computation)

Better Approach:

Execute code to verify answer
Not: Ask LLM if math is correct

Code Correctness (Run Tests)

Better Approach:

Run unit tests
Check if code compiles
Not: Ask LLM if code is correct

Safety-Critical (Use Human Evaluation)

Examples:

Medical advice
Legal guidance
Financial recommendations

→ Always use human experts

Judge Model Selection

GPT-4 (Most Commonly Used)

Pros:

High quality judgments
Good correlation with humans
Widely tested

Cons:

Expensive ($0.03/1K tokens)
Can be slow

Claude Sonnet 4 (Excellent Reasoning)

Pros:

Excellent reasoning
Good for complex evaluations
Fast

Cons:

Expensive
Less tested than GPT-4

GPT-3.5 (Cheaper, Less Accurate)

Pros:

Cheap ($0.001/1K tokens)
Fast

Cons:

Less accurate
More biased

Open-Source (Llama, Mixtral)

Pros:

Free (if self-hosted)
Privacy (on-prem)

Cons:

Lower quality
Requires infrastructure

Judge Prompt Patterns

Single-Answer Grading

Pattern:

You are evaluating an AI assistant's response.

Question: {question}
Answer: {answer}

Rate the answer on a scale of 1-5:
1 = Poor
5 = Excellent

Consider:
- Accuracy
- Relevance
- Completeness

Score:

Example:

def single_answer_grading(question, answer):
    prompt = f"""
    You are evaluating an AI assistant's response.
    
    Question: {question}
    Answer: {answer}
    
    Rate the answer on a scale of 1-5:
    1 = Poor (incorrect, irrelevant, or incomplete)
    5 = Excellent (correct, relevant, and complete)
    
    Provide:
    - Score (1-5)
    - Brief reasoning
    
    Format:
    Score: [number]
    Reasoning: [explanation]
    """
    
    response = llm.generate(prompt)
    score = extract_score(response)
    return score

Pairwise Comparison (A vs B)

Pattern:

Which answer is better?

Question: {question}
Answer A: {answer_a}
Answer B: {answer_b}

Which is better? A or B?
Explain why.

More Reliable:

Pairwise comparison reduces absolute scoring bias
Humans also find comparisons easier than absolute ratings

Example:

def pairwise_comparison(question, answer_a, answer_b):
    prompt = f"""
    Question: {question}
    
    Answer A: {answer_a}
    Answer B: {answer_b}
    
    Which answer is better? A or B?
    
    Consider:
    - Accuracy
    - Relevance
    - Clarity
    
    Respond with:
    - Winner: A or B
    - Reasoning: Why is it better?
    
    Format:
    Winner: [A or B]
    Reasoning: [explanation]
    """
    
    response = llm.generate(prompt)
    winner = extract_winner(response)
    return winner

Aggregate via Elo Ratings:

# After many pairwise comparisons
# Calculate Elo rating for each model
# Higher Elo = better model

Multi-Aspect Evaluation (Rubric)

Pattern:

Evaluate on multiple criteria:
1. Accuracy (1-5)
2. Relevance (1-5)
3. Completeness (1-5)
4. Clarity (1-5)

Score each separately

Example:

def multi_aspect_evaluation(question, answer):
    prompt = f"""
    Question: {question}
    Answer: {answer}
    
    Evaluate on these criteria (1-5 scale):
    
    1. Accuracy: Is the information correct?
       1 = Incorrect, 5 = Perfectly accurate
    
    2. Relevance: Does it answer the question?
       1 = Irrelevant, 5 = Highly relevant
    
    3. Completeness: Does it cover all aspects?
       1 = Incomplete, 5 = Comprehensive
    
    4. Clarity: Is it clear and well-written?
       1 = Confusing, 5 = Very clear
    
    Provide scores and brief reasoning for each.
    
    Format:
    Accuracy: [score] - [reasoning]
    Relevance: [score] - [reasoning]
    Completeness: [score] - [reasoning]
    Clarity: [score] - [reasoning]
    Overall: [average score]
    """
    
    response = llm.generate(prompt)
    scores = extract_scores(response)
    return scores

Chain-of-Thought Judging

Pattern:

First, explain your reasoning
Then, provide score

This increases reliability

Example:

def cot_judging(question, answer):
    prompt = f"""
    Question: {question}
    Answer: {answer}
    
    Evaluate this answer step by step:
    
    Step 1: Is the answer factually correct?
    Step 2: Does it fully address the question?
    Step 3: Is it clear and well-written?
    
    Based on your analysis, rate the answer (1-5).
    
    Format:
    Step 1: [analysis]
    Step 2: [analysis]
    Step 3: [analysis]
    Final Score: [number]
    """
    
    response = llm.generate(prompt)
    return response

Judge Prompt Template

Comprehensive Template:

You are an expert evaluator assessing AI assistant responses.

Question: {question}
Answer: {answer}
{optional: Ground Truth: {ground_truth}}
{optional: Context: {context}}

Evaluate the answer on these criteria:

1. **Accuracy** (1-5): Is the information factually correct?
   - 1 = Completely incorrect
   - 3 = Partially correct
   - 5 = Fully correct

2. **Relevance** (1-5): Does it address the question?
   - 1 = Completely irrelevant
   - 3 = Partially relevant
   - 5 = Directly addresses question

3. **Completeness** (1-5): Does it cover all aspects?
   - 1 = Missing most information
   - 3 = Covers some aspects
   - 5 = Comprehensive

4. **Clarity** (1-5): Is it clear and well-written?
   - 1 = Confusing or poorly written
   - 3 = Acceptable clarity
   - 5 = Very clear and well-written

Provide:
- Score for each criterion (1-5)
- Brief reasoning for each score
- Overall score (average of all criteria)

Format:
Accuracy: [score] - [reasoning]
Relevance: [score] - [reasoning]
Completeness: [score] - [reasoning]
Clarity: [score] - [reasoning]
Overall: [average score]

Judge Calibration

Compare Judge Scores to Human Scores

Process:

1. Get 100 examples
2. Human annotators rate each (1-5)
3. LLM judge rates each (1-5)
4. Calculate correlation

Correlation:

from scipy.stats import pearsonr

human_scores = [4, 5, 3, 4, 2, ...]
judge_scores = [4.2, 4.8, 3.1, 4.5, 2.3, ...]

correlation, p_value = pearsonr(human_scores, judge_scores)
print(f"Correlation: {correlation:.2f}")

# Target: >0.7 (good correlation)
# If <0.7: Adjust prompt or use different judge

Calculate Correlation

See above

Adjust Prompt if Low Correlation

If correlation <0.7:

1. Analyze disagreements (where judge differs from human)
2. Update prompt to address issues
3. Re-test correlation
4. Iterate until >0.7

Test on Multiple Examples

Validation Set:

Use 100-500 examples with human ratings
Ensure diverse (easy, hard, edge cases)

Reducing Judge Bias

Position Bias (Favors First Option in A/B)

Problem:

Judge tends to prefer Answer A over Answer B
Even when B is better

Mitigation:

# Randomize order
import random

if random.random() < 0.5:
    winner = compare(question, answer_a, answer_b)
else:
    winner = compare(question, answer_b, answer_a)
    winner = "A" if winner == "B" else "B"  # Flip

Length Bias (Favors Longer Answers)

Problem:

Judge tends to prefer longer answers
Even if shorter answer is better

Mitigation:

Prompt: "Do not favor longer answers. Concise answers can be better."
Or: Normalize scores by length

Self-Preference Bias (Favors Own Outputs)

Problem:

GPT-4 as judge tends to prefer GPT-4 outputs
Over Claude outputs

Mitigation:

Use external judge (Claude to judge GPT-4)
Or: Blind evaluation (don't reveal which model)

Multi-Judge Ensemble

Use Multiple Judges (GPT-4 + Claude)

Approach:

def multi_judge_ensemble(question, answer):
    # Judge 1: GPT-4
    score_gpt4 = gpt4_judge(question, answer)
    
    # Judge 2: Claude
    score_claude = claude_judge(question, answer)
    
    # Judge 3: GPT-3.5 (cheaper, as tiebreaker)
    score_gpt35 = gpt35_judge(question, answer)
    
    return {
        "gpt4": score_gpt4,
        "claude": score_claude,
        "gpt35": score_gpt35
    }

Aggregate Scores (Majority Vote, Average)

Majority Vote:

scores = [4, 5, 4]  # Three judges
majority = max(set(scores), key=scores.count)  # 4

Average:

scores = [4.2, 4.8, 4.5]
average = sum(scores) / len(scores)  # 4.5

Weighted Average:

scores = {"gpt4": 4.8, "claude": 4.5, "gpt35": 4.0}
weights = {"gpt4": 0.5, "claude": 0.4, "gpt35": 0.1}

weighted_avg = sum(scores[j] * weights[j] for j in scores)  # 4.58

Increases Reliability

Why:

Single judge can be wrong
Multiple judges reduce variance
Ensemble is more robust

Cost Optimization

Use Cheaper Judge for Initial Filtering

Two-Stage:

Stage 1: GPT-3.5 judge (cheap, fast)
  - Filter out clearly bad answers (score <3)
  
Stage 2: GPT-4 judge (expensive, accurate)
  - Evaluate borderline cases (score 3-4)

Use Expensive Judge for Borderline Cases

See above

Cache Judge Results

Caching:

import hashlib
import json

cache = {}

def cached_judge(question, answer):
    # Create cache key
    key = hashlib.md5(f"{question}{answer}".encode()).hexdigest()
    
    # Check cache
    if key in cache:
        return cache[key]
    
    # Call judge
    score = llm_judge(question, answer)
    
    # Cache result
    cache[key] = score
    
    return score

Judge Evaluation Frameworks

G-Eval (Using GPT-4)

Paper: "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"

Approach:

Use GPT-4 to generate evaluation criteria
Then use GPT-4 to evaluate based on those criteria

Prometheus (Using Llama)

Open-Source Judge:

Fine-tuned Llama model for evaluation
Free to use
Lower quality than GPT-4 but no API costs

Custom Implementation

See examples throughout this document

Metrics to Track

Judge-Human Correlation

Target: >0.7

Calculation:

from scipy.stats import pearsonr

correlation, p_value = pearsonr(human_scores, judge_scores)

Inter-Judge Agreement (If Multiple Judges)

Kappa Score:

from sklearn.metrics import cohen_kappa_score

kappa = cohen_kappa_score(judge1_scores, judge2_scores)
# >0.7 = good agreement

Judge Consistency (Same Input → Same Output)

Test:

# Evaluate same example 10 times
scores = [judge(question, answer) for _ in range(10)]

# Calculate variance
variance = np.var(scores)
# Low variance = consistent judge

Real-World Judge Use Cases

RAG Answer Evaluation

See RAG Evaluation skill

Chatbot Response Quality

Criteria:

Helpfulness
Relevance
Safety
Tone

Content Moderation

Criteria:

Toxicity
Hate speech
Misinformation
Spam

Translation Quality

Criteria:

Accuracy
Fluency
Preserves meaning

Summarization Quality

Criteria:

Completeness
Conciseness
Accuracy

Limitations

Judge Can Be Wrong (Validate with Humans)

Always:

Spot-check judge results with human evaluation
Don't blindly trust judge

Expensive (API Costs)

Cost:

1000 evaluations × $0.01 = $10
10,000 evaluations × $0.01 = $100

Can add up quickly

Judge Bias (Needs Careful Prompting)

See "Reducing Judge Bias" section

Not Suitable for All Tasks

See "When NOT to Use" section

Implementation

Judge Prompt Templates

See "Judge Prompt Template" section

Multi-Judge Aggregation

See "Multi-Judge Ensemble" section

Calibration Scripts

def calibrate_judge(judge_fn, test_set):
    """
    test_set: List of (question, answer, human_score)
    """
    judge_scores = []
    human_scores = []
    
    for question, answer, human_score in test_set:
        judge_score = judge_fn(question, answer)
        judge_scores.append(judge_score)
        human_scores.append(human_score)
    
    correlation, p_value = pearsonr(human_scores, judge_scores)
    
    return {
        "correlation": correlation,
        "p_value": p_value,
        "judge_scores": judge_scores,
        "human_scores": human_scores
    }

Summary

Quick Reference

LLM-as-Judge: Use LLMs to evaluate other LLM outputs

Why:

Fast and cheap vs human eval
Scales to thousands of examples
High correlation with humans (>0.8)

When to Use:

Subjective quality
Complex rubrics
Large-scale evaluation

When NOT:

Objective correctness
Math/code (use computation)
Safety-critical (use humans)

Judge Models:

GPT-4 (best quality)
Claude (excellent reasoning)
GPT-3.5 (cheaper)
Open-source (free but lower quality)

Prompt Patterns:

Single-answer grading
Pairwise comparison (more reliable)
Multi-aspect (rubric)
Chain-of-thought (increases reliability)

Bias Reduction:

Position bias: Randomize order
Length bias: Normalize or prompt
Self-preference: External judge

Multi-Judge:

Use multiple judges
Aggregate (majority vote, average)
Increases reliability

Cost Optimization:

Cheap judge for filtering
Expensive judge for borderline
Cache results

Calibration:

Compare to human scores
Target correlation >0.7
Adjust prompt if low

Limitations:

Can be wrong (validate with humans)
Expensive (API costs)
Biased (careful prompting)