Claude-skill-registry evaluating-rag

Evaluate RAG systems with hit rate, MRR, faithfulness metrics and compare retrieval strategies. Use when testing retrieval quality, generating evaluation datasets, comparing embeddings or retrievers, A/B testing, or measuring production RAG performance.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/evaluating-rag" ~/.claude/skills/majiayu000-claude-skill-registry-evaluating-rag && rm -rf "$T"
manifest: skills/data/evaluating-rag/SKILL.md
source content

Evaluating RAG Systems

Guide for measuring RAG performance, comparing strategies, and implementing continuous evaluation. Focus on key metrics and practical testing approaches.

When to Use This Skill

  • Testing retrieval quality and accuracy
  • Generating evaluation datasets for your domain
  • Comparing different retrieval strategies (vector vs BM25 vs hybrid)
  • A/B testing embedding models or rerankers
  • Measuring production RAG performance
  • Validating improvements after optimizations
  • Comparing your 7 retrieval strategies in
    src/
    or
    src-iLand/

Key Evaluation Metrics

Retrieval Metrics

Hit Rate: Fraction of queries where correct answer found in top-k

  • Perfect: 1.0 (all queries found relevant docs)
  • Good: 0.85+ (85%+ queries successful)
  • Needs work: <0.70

MRR (Mean Reciprocal Rank): Quality of ranking

  • Perfect: 1.0 (relevant doc always rank 1)
  • Good: 0.80+ (relevant doc typically in top 2-3)
  • Formula: Average of 1/rank across queries

Response Metrics

Faithfulness: No hallucinations, grounded in context Correctness: Factually accurate vs reference answer Relevancy: Directly addresses the query

Quick Decision Guide

When to Evaluate

  • After implementing → Baseline performance
  • After optimization → Validate improvements
  • Before production → Quality gate
  • In production → Continuous monitoring

What to Measure

  • Development → Hit rate + MRR (retrieval quality)
  • Production → All metrics (retrieval + response quality)
  • A/B testing → Comparative metrics

Dataset Size

  • Quick test → 20-50 Q&A pairs
  • Thorough eval → 100-200 pairs
  • Production → 500+ pairs

Quick Start Patterns

Pattern 1: Basic Retrieval Evaluation

from llama_index.core.evaluation import RetrieverEvaluator

# Create evaluator
evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"],
    retriever=retriever
)

# Run evaluation
eval_results = await evaluator.aevaluate_dataset(qa_dataset)

print(f"Hit Rate: {eval_results['hit_rate']:.3f}")
print(f"MRR: {eval_results['mrr']:.3f}")

Pattern 2: Generate Evaluation Dataset

from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.llms.openai import OpenAI

# Generate Q&A pairs from your documents
llm = OpenAI(model="gpt-4o-mini")
qa_dataset = generate_question_context_pairs(
    nodes,
    llm=llm,
    num_questions_per_chunk=2
)

# Filter invalid entries
qa_dataset = filter_qa_dataset(qa_dataset)

# Save for reuse
qa_dataset.save_json("evaluation_dataset.json")

Pattern 3: Compare Multiple Strategies

strategies = {
    "vector": vector_retriever,
    "bm25": bm25_retriever,
    "hybrid": hybrid_retriever,
    "metadata": metadata_retriever,
}

results = {}
for strategy_name, retriever in strategies.items():
    evaluator = RetrieverEvaluator.from_metric_names(
        ["mrr", "hit_rate"],
        retriever=retriever
    )
    eval_result = await evaluator.aevaluate_dataset(qa_dataset)
    results[strategy_name] = eval_result
    print(f"{strategy_name}: {eval_result}")

# Find best strategy
best_strategy = max(results, key=lambda x: results[x]['hit_rate'])
print(f"\nBest strategy: {best_strategy}")

Pattern 4: Compare With/Without Reranking

# Without reranking
retriever_no_rerank = index.as_retriever(similarity_top_k=5)

# With reranking
from llama_index.postprocessor.cohere_rerank import CohereRerank
retriever_with_rerank = index.as_retriever(
    similarity_top_k=10,
    node_postprocessors=[CohereRerank(top_n=5)]
)

# Evaluate both
for name, retriever in [("No Rerank", retriever_no_rerank),
                        ("With Rerank", retriever_with_rerank)]:
    evaluator = RetrieverEvaluator.from_metric_names(
        ["mrr", "hit_rate"],
        retriever=retriever
    )
    results = await evaluator.aevaluate_dataset(qa_dataset)
    print(f"{name}: Hit Rate={results['hit_rate']:.3f}, MRR={results['mrr']:.3f}")

# Calculate improvement
improvement = (rerank_results['hit_rate'] - no_rerank_results['hit_rate']) / no_rerank_results['hit_rate']
print(f"Improvement: {improvement * 100:.1f}%")

Pattern 5: Response Quality Evaluation

from llama_index.core.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator
)

# Initialize evaluators
faithfulness_evaluator = FaithfulnessEvaluator()
relevancy_evaluator = RelevancyEvaluator()

# Generate response
response = query_engine.query("What is machine learning?")

# Evaluate faithfulness (no hallucinations)
faithfulness_result = faithfulness_evaluator.evaluate_response(
    response=response
)
print(f"Faithfulness: {faithfulness_result.passing}")

# Evaluate relevancy
relevancy_result = relevancy_evaluator.evaluate_response(
    query="What is machine learning?",
    response=response
)
print(f"Relevancy: {relevancy_result.passing}")

Your Codebase Integration

For
src/
Pipeline (7 Strategies)

Compare All Strategies:

strategies = {
    "vector": "src/10_basic_query_engine.py",
    "summary": "src/11_document_summary_retriever.py",
    "recursive": "src/12_recursive_retriever.py",
    "metadata": "src/14_metadata_filtering.py",
    "chunk_decoupling": "src/15_chunk_decoupling.py",
    "hybrid": "src/16_hybrid_search.py",
    "planner": "src/17_query_planning_agent.py",
}

# Create evaluation framework to compare all 7

Baseline Performance:

  1. Generate Q&A dataset from your documents
  2. Evaluate each strategy
  3. Identify best performer
  4. Use as baseline for improvements

For
src-iLand/
Pipeline (Thai Land Deeds)

Thai-Specific Evaluation:

# Generate Thai Q&A pairs
llm = OpenAI(model="gpt-4o-mini")  # Supports Thai
qa_dataset = generate_question_context_pairs(
    thai_nodes,
    llm=llm,
    num_questions_per_chunk=2
)

# Test with Thai queries
thai_queries = [
    "โฉนดที่ดินในกรุงเทพ",  # Land deeds in Bangkok
    "นส.3 คืออะไร",  # What is NS.3
    "ที่ดินในสมุทรปราการ"  # Land in Samut Prakan
]

Router Evaluation (

src-iLand/retrieval/router.py
):

  • Test index classification accuracy
  • Test strategy selection appropriateness
  • Measure end-to-end performance

Fast Metadata Testing:

  • Validate <50ms response time
  • Test filtering accuracy
  • Compare with/without fast indexing

Detailed References

Load these when you need comprehensive details:

  • reference-metrics.md: Complete evaluation guide

    • All metrics (hit rate, MRR, faithfulness, correctness)
    • Dataset generation techniques
    • A/B testing frameworks
    • Production monitoring
    • Statistical significance testing
  • reference-agents.md: Advanced techniques

    • Agents (FunctionAgent, ReActAgent)
    • Multi-agent systems
    • Query engines (Router, SubQuestion)
    • Workflow orchestration
    • Observability and debugging

Common Workflows

Workflow 1: Create Evaluation Dataset

  • Step 1: Prepare representative documents

    • Sample from different categories
    • Include edge cases
  • Step 2: Generate Q&A pairs

    qa_dataset = generate_question_context_pairs(
        nodes, llm=llm, num_questions_per_chunk=2
    )
    
  • Step 3: Filter invalid entries

    • Remove auto-generated artifacts
    • Load
      reference-metrics.md
      for filtering code
  • Step 4: Manual review (optional)

    • Check 10-20 samples
    • Ensure question quality
  • Step 5: Save for reuse

    qa_dataset.save_json("eval_dataset.json")
    

Workflow 2: Compare Retrieval Strategies

  • Step 1: Load evaluation dataset

    from llama_index.core.llama_dataset import LabelledRagDataset
    qa_dataset = LabelledRagDataset.from_json("eval_dataset.json")
    
  • Step 2: Define strategies to compare

    • List all retrievers to test
    • For
      src/
      : All 7 strategies
    • For
      src-iLand/
      : Router + individual strategies
  • Step 3: Run evaluation for each

    for name, retriever in strategies.items():
        results[name] = evaluate(retriever, qa_dataset)
    
  • Step 4: Compare results

    • Identify best hit rate
    • Identify best MRR
    • Consider trade-offs (latency, cost)
  • Step 5: Document findings

    • Record baseline performance
    • Note best strategies for different query types

Workflow 3: A/B Test an Optimization

  • Step 1: Measure baseline

    baseline_results = evaluate(current_retriever, qa_dataset)
    
  • Step 2: Apply optimization

    • Add reranking
    • Change embedding model
    • Adjust chunk size
    • etc.
  • Step 3: Measure optimized version

    optimized_results = evaluate(optimized_retriever, qa_dataset)
    
  • Step 4: Calculate improvement

    improvement = (optimized - baseline) / baseline * 100
    print(f"Hit Rate improvement: {improvement:.1f}%")
    
  • Step 5: Decide based on data

    • If improvement > 5%: Deploy
    • If improvement < 2%: Consider cost/complexity
    • If negative: Rollback

Workflow 4: Production Monitoring

  • Step 1: Create production evaluation set

    • Sample real user queries
    • Include ground truth when available
  • Step 2: Set up continuous evaluation

    class ProductionEvaluator:
        def evaluate_query(self, query, response):
            # Log metrics
            # Track over time
    
  • Step 3: Define alerts

    • Hit rate < 0.80 → Alert
    • MRR < 0.70 → Alert
    • Latency p95 > 2s → Alert
  • Step 4: Monitor trends

    • Daily/weekly metrics
    • Detect degradation early
  • Step 5: Iterate based on data

    • Identify failure patterns
    • Generate new test cases
    • Improve weak areas

Workflow 5: Evaluate All 7 Strategies (src/)

  • Step 1: Generate comprehensive dataset

    • Cover different query types
    • Factual, summarization, comparison
  • Step 2: Run each strategy

    python src/10_basic_query_engine.py  # Vector
    python src/11_document_summary_retriever.py  # Summary
    python src/12_recursive_retriever.py  # Recursive
    python src/14_metadata_filtering.py  # Metadata
    python src/15_chunk_decoupling.py  # Chunk decoupling
    python src/16_hybrid_search.py  # Hybrid
    python src/17_query_planning_agent.py  # Planner
    
  • Step 3: Collect metrics

    • Hit rate for each
    • MRR for each
    • Latency for each
  • Step 4: Create comparison table

    StrategyHit RateMRRLatencyUse Case
    Vector.........General
    Hybrid.........Best overall
    ...............
  • Step 5: Document recommendations

    • Best for factual queries
    • Best for complex queries
    • Best for production (speed + quality)

Evaluation Metrics Reference

Hit Rate Interpretation

  • 1.0 → Perfect (all queries successful)
  • 0.90+ → Excellent
  • 0.80-0.89 → Good
  • 0.70-0.79 → Acceptable
  • <0.70 → Needs improvement

MRR Interpretation

  • 1.0 → Perfect ranking (relevant doc always #1)
  • 0.85+ → Excellent (relevant doc typically #1 or #2)
  • 0.70-0.84 → Good
  • 0.50-0.69 → Acceptable
  • <0.50 → Poor ranking quality

Latency Targets

  • <100ms → Excellent
  • 100-500ms → Good
  • 500ms-1s → Acceptable
  • >1s → Needs optimization

Performance Benchmarks

Embedding Model Comparison (from reference docs)

EmbeddingRerankerHit RateMRR
JinaAI Basebge-reranker-large0.9380.869
JinaAI BaseCohereRerank0.9330.874
OpenAICohereRerank0.9270.866
OpenAIbge-reranker-large0.9100.856

Typical Improvements

  • Adding reranking: +5-15% hit rate
  • Hybrid vs vector: +3-8% hit rate
  • Optimal chunk size: +2-5% hit rate
  • Better embeddings: +3-10% hit rate

Scripts

This skill includes utility scripts in the

scripts/
directory:

generate_qa_dataset.py

Generate evaluation Q&A pairs from documents:

python .claude/skills/evaluating-rag/scripts/generate_qa_dataset.py \
    --documents-dir ./data \
    --output eval_dataset.json \
    --num-questions-per-chunk 2

compare_retrievers.py

Compare multiple retrieval strategies:

python .claude/skills/evaluating-rag/scripts/compare_retrievers.py \
    --dataset eval_dataset.json \
    --strategies vector,bm25,hybrid \
    --output comparison_results.json

Outputs:

  • Hit rate and MRR for each strategy
  • Performance comparison table
  • Recommendations

run_evaluation.py

Run comprehensive evaluation:

python .claude/skills/evaluating-rag/scripts/run_evaluation.py \
    --retriever-config config.yaml \
    --dataset eval_dataset.json \
    --metrics hit_rate,mrr,faithfulness

Reports:

  • All requested metrics
  • Per-query breakdown
  • Summary statistics

Key Reminders

Dataset Quality:

  • Generate from your actual documents
  • Include diverse query types
  • Filter invalid auto-generated entries
  • Manual review recommended for critical domains

Evaluation Best Practices:

  • Start with baseline (before optimization)
  • Test one change at a time (for clear attribution)
  • Use same dataset for comparisons
  • Statistical significance matters (>5% improvement)

Production Monitoring:

  • Continuous evaluation on sample queries
  • Track trends over time
  • Alert on degradation
  • Regular dataset refresh

For Your Pipelines:

  • src/
    : Compare all 7 strategies systematically
  • src-iLand/
    : Test with Thai queries and metadata
  • Both: Establish baselines before optimizations

Next Steps

After evaluation:

  • Optimize: Use
    optimizing-rag
    skill to improve low scores
  • Implement: Use
    implementing-rag
    skill to rebuild weak components
  • Monitor: Set up continuous evaluation in production
  • Iterate: Regular evaluation → optimization → re-evaluation cycle