Ai-design-components evaluating-llms

Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for production deployment.

install
source · Clone the upstream repo
git clone https://github.com/ancoleman/ai-design-components
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ancoleman/ai-design-components "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/evaluating-llms" ~/.claude/skills/ancoleman-ai-design-components-evaluating-llms && rm -rf "$T"
manifest: skills/evaluating-llms/SKILL.md
source content

LLM Evaluation

Evaluate Large Language Model (LLM) systems using automated metrics, LLM-as-judge patterns, and standardized benchmarks to ensure production quality and safety.

When to Use This Skill

Apply this skill when:

  • Testing individual prompts for correctness and formatting
  • Validating RAG (Retrieval-Augmented Generation) pipeline quality
  • Measuring hallucinations, bias, or toxicity in LLM outputs
  • Comparing different models or prompt configurations (A/B testing)
  • Running benchmark tests (MMLU, HumanEval) to assess model capabilities
  • Setting up production monitoring for LLM applications
  • Integrating LLM quality checks into CI/CD pipelines

Common triggers:

  • "How do I test if my RAG system is working correctly?"
  • "How can I measure hallucinations in LLM outputs?"
  • "What metrics should I use to evaluate generation quality?"
  • "How do I compare GPT-4 vs Claude for my use case?"
  • "How do I detect bias in LLM responses?"

Evaluation Strategy Selection

Decision Framework: Which Evaluation Approach?

By Task Type:

Task TypePrimary ApproachMetricsTools
Classification (sentiment, intent)Automated metricsAccuracy, Precision, Recall, F1scikit-learn
Generation (summaries, creative text)LLM-as-judge + automatedBLEU, ROUGE, BERTScore, Quality rubricGPT-4/Claude for judging
Question AnsweringExact match + semantic similarityEM, F1, Cosine similarityCustom evaluators
RAG SystemsRAGAS frameworkFaithfulness, Answer/Context relevanceRAGAS library
Code GenerationUnit tests + executionPass@K, Test pass rateHumanEval, pytest
Multi-step AgentsTask completion + tool accuracySuccess rate, EfficiencyCustom evaluators

By Volume and Cost:

SamplesSpeedCostRecommended Approach
1,000+Immediate$0Automated metrics (regex, JSON validation)
100-1,000Minutes$0.01-0.10 eachLLM-as-judge (GPT-4, Claude)
< 100Hours$1-10 eachHuman evaluation (pairwise comparison)

Layered Approach (Recommended for Production):

  1. Layer 1: Automated metrics for all outputs (fast, cheap)
  2. Layer 2: LLM-as-judge for 10% sample (nuanced quality)
  3. Layer 3: Human review for 1% edge cases (validation)

Core Evaluation Patterns

Unit Evaluation (Individual Prompts)

Test single prompt-response pairs for correctness.

Methods:

  • Exact Match: Response exactly matches expected output
  • Regex Matching: Response follows expected pattern
  • JSON Schema Validation: Structured output validation
  • Keyword Presence: Required terms appear in response
  • LLM-as-Judge: Binary pass/fail using evaluation prompt

Example Use Cases:

  • Email classification (spam/not spam)
  • Entity extraction (dates, names, locations)
  • JSON output formatting validation
  • Sentiment analysis (positive/negative/neutral)

Quick Start (Python):

import pytest
from openai import OpenAI

client = OpenAI()

def classify_sentiment(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Classify sentiment as positive, negative, or neutral. Return only the label."},
            {"role": "user", "content": text}
        ],
        temperature=0
    )
    return response.choices[0].message.content.strip().lower()

def test_positive_sentiment():
    result = classify_sentiment("I love this product!")
    assert result == "positive"

For complete unit evaluation examples, see

examples/python/unit_evaluation.py
and
examples/typescript/unit-evaluation.ts
.

RAG (Retrieval-Augmented Generation) Evaluation

Evaluate RAG systems using RAGAS framework metrics.

Critical Metrics (Priority Order):

  1. Faithfulness (Target: > 0.8) - MOST CRITICAL

    • Measures: Is the answer grounded in retrieved context?
    • Prevents hallucinations
    • If failing: Adjust prompt to emphasize grounding, require citations
  2. Answer Relevance (Target: > 0.7)

    • Measures: How well does the answer address the query?
    • If failing: Improve prompt instructions, add few-shot examples
  3. Context Relevance (Target: > 0.7)

    • Measures: Are retrieved chunks relevant to the query?
    • If failing: Improve retrieval (better embeddings, hybrid search)
  4. Context Precision (Target: > 0.5)

    • Measures: Are relevant chunks ranked higher than irrelevant?
    • If failing: Add re-ranking step to retrieval pipeline
  5. Context Recall (Target: > 0.8)

    • Measures: Are all relevant chunks retrieved?
    • If failing: Increase retrieval count, improve chunking strategy

Quick Start (Python with RAGAS):

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
from datasets import Dataset

data = {
    "question": ["What is the capital of France?"],
    "answer": ["The capital of France is Paris."],
    "contexts": [["Paris is the capital of France."]],
    "ground_truth": ["Paris"]
}

dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_relevancy])
print(f"Faithfulness: {results['faithfulness']:.2f}")

For comprehensive RAG evaluation patterns, see

references/rag-evaluation.md
and
examples/python/ragas_example.py
.

LLM-as-Judge Evaluation

Use powerful LLMs (GPT-4, Claude Opus) to evaluate other LLM outputs.

When to Use:

  • Generation quality assessment (summaries, creative writing)
  • Nuanced evaluation criteria (tone, clarity, helpfulness)
  • Custom rubrics for domain-specific tasks
  • Medium-volume evaluation (100-1,000 samples)

Correlation with Human Judgment: 0.75-0.85 for well-designed rubrics

Best Practices:

  • Use clear, specific rubrics (1-5 scale with detailed criteria)
  • Include few-shot examples in evaluation prompt
  • Average multiple evaluations to reduce variance
  • Be aware of biases (position bias, verbosity bias, self-preference)

Quick Start (Python):

from openai import OpenAI

client = OpenAI()

def evaluate_quality(prompt: str, response: str) -> tuple[int, str]:
    """Returns (score 1-5, reasoning)"""
    eval_prompt = f"""
Rate the following LLM response on relevance and helpfulness.

USER PROMPT: {prompt}
LLM RESPONSE: {response}

Provide:
Score: [1-5, where 5 is best]
Reasoning: [1-2 sentences]
"""
    result = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": eval_prompt}],
        temperature=0.3
    )
    content = result.choices[0].message.content
    lines = content.strip().split('\n')
    score = int(lines[0].split(':')[1].strip())
    reasoning = lines[1].split(':', 1)[1].strip()
    return score, reasoning

For detailed LLM-as-judge patterns and prompt templates, see

references/llm-as-judge.md
and
examples/python/llm_as_judge.py
.

Safety and Alignment Evaluation

Measure hallucinations, bias, and toxicity in LLM outputs.

Hallucination Detection

Methods:

  1. Faithfulness to Context (RAG):

    • Use RAGAS faithfulness metric
    • LLM checks if claims are supported by context
    • Score: Supported claims / Total claims
  2. Factual Accuracy (Closed-Book):

    • LLM-as-judge with access to reliable sources
    • Fact-checking APIs (Google Fact Check)
    • Entity-level verification (dates, names, statistics)
  3. Self-Consistency:

    • Generate multiple responses to same question
    • Measure agreement between responses
    • Low consistency suggests hallucination

Bias Evaluation

Types of Bias:

  • Gender bias (stereotypical associations)
  • Racial/ethnic bias (discriminatory outputs)
  • Cultural bias (Western-centric assumptions)
  • Age/disability bias (ableist or ageist language)

Evaluation Methods:

  1. Stereotype Tests:

    • BBQ (Bias Benchmark for QA): 58,000 question-answer pairs
    • BOLD (Bias in Open-Ended Language Generation)
  2. Counterfactual Evaluation:

    • Generate responses with demographic swaps
    • Example: "Dr. Smith (he/she) recommended..." → compare outputs
    • Measure consistency across variations

Toxicity Detection

Tools:

  • Perspective API (Google): Toxicity, threat, insult scores
  • Detoxify (HuggingFace): Open-source toxicity classifier
  • OpenAI Moderation API: Hate, harassment, violence detection

For comprehensive safety evaluation patterns, see

references/safety-evaluation.md
.

Benchmark Testing

Assess model capabilities using standardized benchmarks.

Standard Benchmarks:

BenchmarkCoverageFormatDifficultyUse Case
MMLU57 subjects (STEM, humanities)Multiple choiceHigh school - professionalGeneral intelligence
HellaSwagSentence completionMultiple choiceCommon senseReasoning validation
GPQAPhD-level scienceMultiple choiceVery high (expert-level)Frontier model testing
HumanEval164 Python problemsCode generationMediumCode capability
MATH12,500 competition problemsMath solvingHigh school competitionsMath reasoning

Domain-Specific Benchmarks:

  • Medical: MedQA (USMLE), PubMedQA
  • Legal: LegalBench
  • Finance: FinQA, ConvFinQA

When to Use Benchmarks:

  • Comparing multiple models (GPT-4 vs Claude vs Llama)
  • Model selection for specific domains
  • Baseline capability assessment
  • Academic research and publication

Quick Start (lm-evaluation-harness):

pip install lm-eval

# Evaluate GPT-4 on MMLU
lm_eval --model openai-chat --model_args model=gpt-4 --tasks mmlu --num_fewshot 5

For detailed benchmark testing patterns, see

references/benchmarks.md
and
scripts/benchmark_runner.py
.

Production Evaluation

Monitor and optimize LLM quality in production environments.

A/B Testing

Compare two LLM configurations:

  • Variant A: GPT-4 (expensive, high quality)
  • Variant B: Claude Sonnet (cheaper, fast)

Metrics:

  • User satisfaction scores (thumbs up/down)
  • Task completion rates
  • Response time and latency
  • Cost per successful interaction

Online Evaluation

Real-time quality monitoring:

  • Response Quality: LLM-as-judge scoring every Nth response
  • User Feedback: Explicit ratings, thumbs up/down
  • Business Metrics: Conversion rates, support ticket resolution
  • Cost Tracking: Tokens used, inference costs

Human-in-the-Loop

Sample-based human evaluation:

  • Random Sampling: Evaluate 10% of responses
  • Confidence-Based: Evaluate low-confidence outputs
  • Error-Triggered: Flag suspicious responses for review

For production evaluation patterns and monitoring strategies, see

references/production-evaluation.md
.

Classification Task Evaluation

For tasks with discrete outputs (sentiment, intent, category).

Metrics:

  • Accuracy: Correct predictions / Total predictions
  • Precision: True positives / (True positives + False positives)
  • Recall: True positives / (True positives + False negatives)
  • F1 Score: Harmonic mean of precision and recall
  • Confusion Matrix: Detailed breakdown of prediction errors

Quick Start (Python):

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

y_true = ["positive", "negative", "neutral", "positive", "negative"]
y_pred = ["positive", "negative", "neutral", "neutral", "negative"]

accuracy = accuracy_score(y_true, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

For complete classification evaluation examples, see

examples/python/classification_metrics.py
.

Generation Task Evaluation

For open-ended text generation (summaries, creative writing, responses).

Automated Metrics (Use with Caution):

  • BLEU: N-gram overlap with reference text (0-1 score)
  • ROUGE: Recall-oriented overlap (ROUGE-1, ROUGE-L)
  • METEOR: Semantic similarity with stemming
  • BERTScore: Contextual embedding similarity (0-1 score)

Limitation: Automated metrics correlate weakly with human judgment for creative/subjective generation.

Recommended Approach:

  1. Automated metrics: Fast feedback for objective aspects (length, format)
  2. LLM-as-judge: Nuanced quality assessment (relevance, coherence, helpfulness)
  3. Human evaluation: Final validation for subjective criteria (preference, creativity)

For detailed generation evaluation patterns, see

references/evaluation-types.md
.

Quick Reference Tables

Evaluation Framework Selection

If Task Is...Use This FrameworkPrimary Metric
RAG systemRAGASFaithfulness > 0.8
Classificationscikit-learn metricsAccuracy, F1
Generation qualityLLM-as-judgeQuality rubric (1-5)
Code generationHumanEvalPass@1, Test pass rate
Model comparisonBenchmark testingMMLU, HellaSwag scores
Safety validationHallucination detectionFaithfulness, Fact-check
Production monitoringOnline evaluationUser feedback, Business KPIs

Python Library Recommendations

LibraryUse CaseInstallation
RAGASRAG evaluation
pip install ragas
DeepEvalGeneral LLM evaluation, pytest integration
pip install deepeval
LangSmithProduction monitoring, A/B testing
pip install langsmith
lm-evalBenchmark testing (MMLU, HumanEval)
pip install lm-eval
scikit-learnClassification metrics
pip install scikit-learn

Safety Evaluation Priority Matrix

ApplicationHallucination RiskBias RiskToxicity RiskEvaluation Priority
Customer SupportHighMediumHigh1. Faithfulness, 2. Toxicity, 3. Bias
Medical DiagnosisCriticalHighLow1. Factual Accuracy, 2. Hallucination, 3. Bias
Creative WritingLowMediumMedium1. Quality/Fluency, 2. Content Policy
Code GenerationMediumLowLow1. Functional Correctness, 2. Security
Content ModerationLowCriticalCritical1. Bias, 2. False Positives/Negatives

Detailed References

For comprehensive documentation on specific topics:

  • Evaluation types (classification, generation, QA, code):
    references/evaluation-types.md
  • RAG evaluation deep dive (RAGAS framework):
    references/rag-evaluation.md
  • Safety evaluation (hallucination, bias, toxicity):
    references/safety-evaluation.md
  • Benchmark testing (MMLU, HumanEval, domain benchmarks):
    references/benchmarks.md
  • LLM-as-judge best practices and prompts:
    references/llm-as-judge.md
  • Production evaluation (A/B testing, monitoring):
    references/production-evaluation.md
  • All metrics definitions and formulas:
    references/metrics-reference.md

Working Examples

Python Examples:

  • examples/python/unit_evaluation.py
    - Basic prompt testing with pytest
  • examples/python/ragas_example.py
    - RAGAS RAG evaluation
  • examples/python/deepeval_example.py
    - DeepEval framework usage
  • examples/python/llm_as_judge.py
    - GPT-4 as evaluator
  • examples/python/classification_metrics.py
    - Accuracy, precision, recall
  • examples/python/benchmark_testing.py
    - HumanEval example

TypeScript Examples:

  • examples/typescript/unit-evaluation.ts
    - Vitest + OpenAI
  • examples/typescript/llm-as-judge.ts
    - GPT-4 evaluation
  • examples/typescript/langsmith-integration.ts
    - Production monitoring

Executable Scripts

Run evaluations without loading code into context (token-free):

  • scripts/run_ragas_eval.py
    - Run RAGAS evaluation on dataset
  • scripts/compare_models.py
    - A/B test two models
  • scripts/benchmark_runner.py
    - Run MMLU/HumanEval benchmarks
  • scripts/hallucination_checker.py
    - Detect hallucinations in outputs

Example usage:

# Run RAGAS evaluation on custom dataset
python scripts/run_ragas_eval.py --dataset data/qa_dataset.json --output results.json

# Compare GPT-4 vs Claude on benchmark
python scripts/compare_models.py --model-a gpt-4 --model-b claude-3-opus --tasks mmlu,humaneval

Integration with Other Skills

Related Skills:

  • building-ai-chat
    :
    Evaluate AI chat applications (this skill tests what that skill builds)
  • prompt-engineering
    :
    Test prompt quality and effectiveness
  • testing-strategies
    :
    Apply testing pyramid to LLM evaluation (unit → integration → E2E)
  • observability
    :
    Production monitoring and alerting for LLM quality
  • building-ci-pipelines
    :
    Integrate LLM evaluation into CI/CD

Workflow Integration:

  1. Write prompt (use
    prompt-engineering
    skill)
  2. Unit test prompt (use
    llm-evaluation
    skill)
  3. Build AI feature (use
    building-ai-chat
    skill)
  4. Integration test RAG pipeline (use
    llm-evaluation
    skill)
  5. Deploy to production (use
    deploying-applications
    skill)
  6. Monitor quality (use
    llm-evaluation
    +
    observability
    skills)

Common Pitfalls

1. Over-reliance on Automated Metrics for Generation

  • BLEU/ROUGE correlate weakly with human judgment for creative text
  • Solution: Layer LLM-as-judge or human evaluation

2. Ignoring Faithfulness in RAG Systems

  • Hallucinations are the #1 RAG failure mode
  • Solution: Prioritize faithfulness metric (target > 0.8)

3. No Production Monitoring

  • Models can degrade over time, prompts can break with updates
  • Solution: Set up continuous evaluation (LangSmith, custom monitoring)

4. Biased LLM-as-Judge Evaluation

  • Evaluator LLMs have biases (position bias, verbosity bias)
  • Solution: Average multiple evaluations, use diverse evaluation prompts

5. Insufficient Benchmark Coverage

  • Single benchmark doesn't capture full model capability
  • Solution: Use 3-5 benchmarks across different domains

6. Missing Safety Evaluation

  • Production LLMs can generate harmful content
  • Solution: Add toxicity, bias, and hallucination checks to evaluation pipeline