Claude-skill-registry LLM Judge Patterns
Comprehensive guide to using LLMs as judges for automated evaluation including prompt patterns, calibration, bias reduction, and multi-judge ensembles
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/llm-judge-patterns" ~/.claude/skills/majiayu000-claude-skill-registry-llm-judge-patterns && rm -rf "$T"
skills/data/llm-judge-patterns/SKILL.mdLLM Judge Patterns
What is LLM-as-Judge?
Definition: Using LLMs (GPT-4, Claude) to evaluate other LLM outputs automatically.
Model
Input: Question + Answer (to evaluate) Judge LLM: GPT-4 or Claude Output: Score + Reasoning Example: Question: "What is the capital of France?" Answer: "Paris is the capital of France." Judge: "Score: 5/5 - Correct, concise, directly answers question"
Why LLM-as-Judge?
Human Eval is Slow and Expensive
Comparison:
Human evaluation: - 100 answers × 5 min each = 500 min = 8.3 hours - Cost: $20/hour × 8.3 = $166 LLM-as-judge: - 100 answers × 2 sec each = 200 sec = 3.3 min - Cost: 100 × $0.01 = $1
Need to Evaluate Thousands of Outputs
Scale:
Development: Test 1000+ variations Production: Evaluate millions of responses Human eval: Impossible at this scale LLM-judge: Feasible
Research Shows High Correlation with Human Judgment
Studies:
- GPT-4 as judge correlates 0.8+ with human ratings
- Works well for subjective quality (fluency, helpfulness)
- Less reliable for factual correctness
Enables Continuous Evaluation
Workflow:
Every response → LLM judge → Score logged → Dashboard Detect regressions in real-time
When to Use LLM-as-Judge
Subjective Quality (Fluency, Relevance, Helpfulness)
Good Use Cases:
- Is this answer helpful? - Is this text fluent and natural? - Is this response relevant to the question? - Is this summary coherent?
Complex Rubrics (Multi-Criteria)
Example:
Evaluate on: 1. Accuracy (1-5) 2. Completeness (1-5) 3. Clarity (1-5) 4. Tone (1-5) LLM can handle multi-dimensional evaluation
Large-Scale Evaluation
When:
Need to evaluate 1000+ examples Human eval too slow/expensive
Rapid Iteration
Development:
Test 10 prompt variations Evaluate each on 100 examples LLM-judge: Minutes Human eval: Days
When NOT to Use LLM-as-Judge
Objective Correctness (Factual Answers)
Problem:
Question: "What is 2+2?" Answer: "5" LLM judge might say: "The answer is clear and confident" (wrong!) Better: Exact match or computation
Mathematical Reasoning (Verify with Computation)
Better Approach:
Execute code to verify answer Not: Ask LLM if math is correct
Code Correctness (Run Tests)
Better Approach:
Run unit tests Check if code compiles Not: Ask LLM if code is correct
Safety-Critical (Use Human Evaluation)
Examples:
Medical advice Legal guidance Financial recommendations → Always use human experts
Judge Model Selection
GPT-4 (Most Commonly Used)
Pros:
- High quality judgments
- Good correlation with humans
- Widely tested
Cons:
- Expensive ($0.03/1K tokens)
- Can be slow
Claude Sonnet 4 (Excellent Reasoning)
Pros:
- Excellent reasoning
- Good for complex evaluations
- Fast
Cons:
- Expensive
- Less tested than GPT-4
GPT-3.5 (Cheaper, Less Accurate)
Pros:
- Cheap ($0.001/1K tokens)
- Fast
Cons:
- Less accurate
- More biased
Open-Source (Llama, Mixtral)
Pros:
- Free (if self-hosted)
- Privacy (on-prem)
Cons:
- Lower quality
- Requires infrastructure
Judge Prompt Patterns
Single-Answer Grading
Pattern:
You are evaluating an AI assistant's response. Question: {question} Answer: {answer} Rate the answer on a scale of 1-5: 1 = Poor 5 = Excellent Consider: - Accuracy - Relevance - Completeness Score:
Example:
def single_answer_grading(question, answer): prompt = f""" You are evaluating an AI assistant's response. Question: {question} Answer: {answer} Rate the answer on a scale of 1-5: 1 = Poor (incorrect, irrelevant, or incomplete) 5 = Excellent (correct, relevant, and complete) Provide: - Score (1-5) - Brief reasoning Format: Score: [number] Reasoning: [explanation] """ response = llm.generate(prompt) score = extract_score(response) return score
Pairwise Comparison (A vs B)
Pattern:
Which answer is better? Question: {question} Answer A: {answer_a} Answer B: {answer_b} Which is better? A or B? Explain why.
More Reliable:
Pairwise comparison reduces absolute scoring bias Humans also find comparisons easier than absolute ratings
Example:
def pairwise_comparison(question, answer_a, answer_b): prompt = f""" Question: {question} Answer A: {answer_a} Answer B: {answer_b} Which answer is better? A or B? Consider: - Accuracy - Relevance - Clarity Respond with: - Winner: A or B - Reasoning: Why is it better? Format: Winner: [A or B] Reasoning: [explanation] """ response = llm.generate(prompt) winner = extract_winner(response) return winner
Aggregate via Elo Ratings:
# After many pairwise comparisons # Calculate Elo rating for each model # Higher Elo = better model
Multi-Aspect Evaluation (Rubric)
Pattern:
Evaluate on multiple criteria: 1. Accuracy (1-5) 2. Relevance (1-5) 3. Completeness (1-5) 4. Clarity (1-5) Score each separately
Example:
def multi_aspect_evaluation(question, answer): prompt = f""" Question: {question} Answer: {answer} Evaluate on these criteria (1-5 scale): 1. Accuracy: Is the information correct? 1 = Incorrect, 5 = Perfectly accurate 2. Relevance: Does it answer the question? 1 = Irrelevant, 5 = Highly relevant 3. Completeness: Does it cover all aspects? 1 = Incomplete, 5 = Comprehensive 4. Clarity: Is it clear and well-written? 1 = Confusing, 5 = Very clear Provide scores and brief reasoning for each. Format: Accuracy: [score] - [reasoning] Relevance: [score] - [reasoning] Completeness: [score] - [reasoning] Clarity: [score] - [reasoning] Overall: [average score] """ response = llm.generate(prompt) scores = extract_scores(response) return scores
Chain-of-Thought Judging
Pattern:
First, explain your reasoning Then, provide score This increases reliability
Example:
def cot_judging(question, answer): prompt = f""" Question: {question} Answer: {answer} Evaluate this answer step by step: Step 1: Is the answer factually correct? Step 2: Does it fully address the question? Step 3: Is it clear and well-written? Based on your analysis, rate the answer (1-5). Format: Step 1: [analysis] Step 2: [analysis] Step 3: [analysis] Final Score: [number] """ response = llm.generate(prompt) return response
Judge Prompt Template
Comprehensive Template:
You are an expert evaluator assessing AI assistant responses. Question: {question} Answer: {answer} {optional: Ground Truth: {ground_truth}} {optional: Context: {context}} Evaluate the answer on these criteria: 1. **Accuracy** (1-5): Is the information factually correct? - 1 = Completely incorrect - 3 = Partially correct - 5 = Fully correct 2. **Relevance** (1-5): Does it address the question? - 1 = Completely irrelevant - 3 = Partially relevant - 5 = Directly addresses question 3. **Completeness** (1-5): Does it cover all aspects? - 1 = Missing most information - 3 = Covers some aspects - 5 = Comprehensive 4. **Clarity** (1-5): Is it clear and well-written? - 1 = Confusing or poorly written - 3 = Acceptable clarity - 5 = Very clear and well-written Provide: - Score for each criterion (1-5) - Brief reasoning for each score - Overall score (average of all criteria) Format: Accuracy: [score] - [reasoning] Relevance: [score] - [reasoning] Completeness: [score] - [reasoning] Clarity: [score] - [reasoning] Overall: [average score]
Judge Calibration
Compare Judge Scores to Human Scores
Process:
1. Get 100 examples 2. Human annotators rate each (1-5) 3. LLM judge rates each (1-5) 4. Calculate correlation
Correlation:
from scipy.stats import pearsonr human_scores = [4, 5, 3, 4, 2, ...] judge_scores = [4.2, 4.8, 3.1, 4.5, 2.3, ...] correlation, p_value = pearsonr(human_scores, judge_scores) print(f"Correlation: {correlation:.2f}") # Target: >0.7 (good correlation) # If <0.7: Adjust prompt or use different judge
Calculate Correlation
See above
Adjust Prompt if Low Correlation
If correlation <0.7:
1. Analyze disagreements (where judge differs from human) 2. Update prompt to address issues 3. Re-test correlation 4. Iterate until >0.7
Test on Multiple Examples
Validation Set:
Use 100-500 examples with human ratings Ensure diverse (easy, hard, edge cases)
Reducing Judge Bias
Position Bias (Favors First Option in A/B)
Problem:
Judge tends to prefer Answer A over Answer B Even when B is better
Mitigation:
# Randomize order import random if random.random() < 0.5: winner = compare(question, answer_a, answer_b) else: winner = compare(question, answer_b, answer_a) winner = "A" if winner == "B" else "B" # Flip
Length Bias (Favors Longer Answers)
Problem:
Judge tends to prefer longer answers Even if shorter answer is better
Mitigation:
Prompt: "Do not favor longer answers. Concise answers can be better." Or: Normalize scores by length
Self-Preference Bias (Favors Own Outputs)
Problem:
GPT-4 as judge tends to prefer GPT-4 outputs Over Claude outputs
Mitigation:
Use external judge (Claude to judge GPT-4) Or: Blind evaluation (don't reveal which model)
Multi-Judge Ensemble
Use Multiple Judges (GPT-4 + Claude)
Approach:
def multi_judge_ensemble(question, answer): # Judge 1: GPT-4 score_gpt4 = gpt4_judge(question, answer) # Judge 2: Claude score_claude = claude_judge(question, answer) # Judge 3: GPT-3.5 (cheaper, as tiebreaker) score_gpt35 = gpt35_judge(question, answer) return { "gpt4": score_gpt4, "claude": score_claude, "gpt35": score_gpt35 }
Aggregate Scores (Majority Vote, Average)
Majority Vote:
scores = [4, 5, 4] # Three judges majority = max(set(scores), key=scores.count) # 4
Average:
scores = [4.2, 4.8, 4.5] average = sum(scores) / len(scores) # 4.5
Weighted Average:
scores = {"gpt4": 4.8, "claude": 4.5, "gpt35": 4.0} weights = {"gpt4": 0.5, "claude": 0.4, "gpt35": 0.1} weighted_avg = sum(scores[j] * weights[j] for j in scores) # 4.58
Increases Reliability
Why:
Single judge can be wrong Multiple judges reduce variance Ensemble is more robust
Cost Optimization
Use Cheaper Judge for Initial Filtering
Two-Stage:
Stage 1: GPT-3.5 judge (cheap, fast) - Filter out clearly bad answers (score <3) Stage 2: GPT-4 judge (expensive, accurate) - Evaluate borderline cases (score 3-4)
Use Expensive Judge for Borderline Cases
See above
Cache Judge Results
Caching:
import hashlib import json cache = {} def cached_judge(question, answer): # Create cache key key = hashlib.md5(f"{question}{answer}".encode()).hexdigest() # Check cache if key in cache: return cache[key] # Call judge score = llm_judge(question, answer) # Cache result cache[key] = score return score
Judge Evaluation Frameworks
G-Eval (Using GPT-4)
Paper: "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"
Approach:
Use GPT-4 to generate evaluation criteria Then use GPT-4 to evaluate based on those criteria
Prometheus (Using Llama)
Open-Source Judge:
Fine-tuned Llama model for evaluation Free to use Lower quality than GPT-4 but no API costs
Custom Implementation
See examples throughout this document
Metrics to Track
Judge-Human Correlation
Target: >0.7
Calculation:
from scipy.stats import pearsonr correlation, p_value = pearsonr(human_scores, judge_scores)
Inter-Judge Agreement (If Multiple Judges)
Kappa Score:
from sklearn.metrics import cohen_kappa_score kappa = cohen_kappa_score(judge1_scores, judge2_scores) # >0.7 = good agreement
Judge Consistency (Same Input → Same Output)
Test:
# Evaluate same example 10 times scores = [judge(question, answer) for _ in range(10)] # Calculate variance variance = np.var(scores) # Low variance = consistent judge
Real-World Judge Use Cases
RAG Answer Evaluation
See RAG Evaluation skill
Chatbot Response Quality
Criteria:
- Helpfulness
- Relevance
- Safety
- Tone
Content Moderation
Criteria:
- Toxicity
- Hate speech
- Misinformation
- Spam
Translation Quality
Criteria:
- Accuracy
- Fluency
- Preserves meaning
Summarization Quality
Criteria:
- Completeness
- Conciseness
- Accuracy
Limitations
Judge Can Be Wrong (Validate with Humans)
Always:
Spot-check judge results with human evaluation Don't blindly trust judge
Expensive (API Costs)
Cost:
1000 evaluations × $0.01 = $10 10,000 evaluations × $0.01 = $100 Can add up quickly
Judge Bias (Needs Careful Prompting)
See "Reducing Judge Bias" section
Not Suitable for All Tasks
See "When NOT to Use" section
Implementation
Judge Prompt Templates
See "Judge Prompt Template" section
Multi-Judge Aggregation
See "Multi-Judge Ensemble" section
Calibration Scripts
def calibrate_judge(judge_fn, test_set): """ test_set: List of (question, answer, human_score) """ judge_scores = [] human_scores = [] for question, answer, human_score in test_set: judge_score = judge_fn(question, answer) judge_scores.append(judge_score) human_scores.append(human_score) correlation, p_value = pearsonr(human_scores, judge_scores) return { "correlation": correlation, "p_value": p_value, "judge_scores": judge_scores, "human_scores": human_scores }
Summary
Quick Reference
LLM-as-Judge: Use LLMs to evaluate other LLM outputs
Why:
- Fast and cheap vs human eval
- Scales to thousands of examples
- High correlation with humans (>0.8)
When to Use:
- Subjective quality
- Complex rubrics
- Large-scale evaluation
When NOT:
- Objective correctness
- Math/code (use computation)
- Safety-critical (use humans)
Judge Models:
- GPT-4 (best quality)
- Claude (excellent reasoning)
- GPT-3.5 (cheaper)
- Open-source (free but lower quality)
Prompt Patterns:
- Single-answer grading
- Pairwise comparison (more reliable)
- Multi-aspect (rubric)
- Chain-of-thought (increases reliability)
Bias Reduction:
- Position bias: Randomize order
- Length bias: Normalize or prompt
- Self-preference: External judge
Multi-Judge:
- Use multiple judges
- Aggregate (majority vote, average)
- Increases reliability
Cost Optimization:
- Cheap judge for filtering
- Expensive judge for borderline
- Cache results
Calibration:
- Compare to human scores
- Target correlation >0.7
- Adjust prompt if low
Limitations:
- Can be wrong (validate with humans)
- Expensive (API costs)
- Biased (careful prompting)