Claude-skill-registry evaluating-rag
Evaluate RAG systems with hit rate, MRR, faithfulness metrics and compare retrieval strategies. Use when testing retrieval quality, generating evaluation datasets, comparing embeddings or retrievers, A/B testing, or measuring production RAG performance.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/evaluating-rag" ~/.claude/skills/majiayu000-claude-skill-registry-evaluating-rag && rm -rf "$T"
skills/data/evaluating-rag/SKILL.mdEvaluating RAG Systems
Guide for measuring RAG performance, comparing strategies, and implementing continuous evaluation. Focus on key metrics and practical testing approaches.
When to Use This Skill
- Testing retrieval quality and accuracy
- Generating evaluation datasets for your domain
- Comparing different retrieval strategies (vector vs BM25 vs hybrid)
- A/B testing embedding models or rerankers
- Measuring production RAG performance
- Validating improvements after optimizations
- Comparing your 7 retrieval strategies in
orsrc/src-iLand/
Key Evaluation Metrics
Retrieval Metrics
Hit Rate: Fraction of queries where correct answer found in top-k
- Perfect: 1.0 (all queries found relevant docs)
- Good: 0.85+ (85%+ queries successful)
- Needs work: <0.70
MRR (Mean Reciprocal Rank): Quality of ranking
- Perfect: 1.0 (relevant doc always rank 1)
- Good: 0.80+ (relevant doc typically in top 2-3)
- Formula: Average of 1/rank across queries
Response Metrics
Faithfulness: No hallucinations, grounded in context Correctness: Factually accurate vs reference answer Relevancy: Directly addresses the query
Quick Decision Guide
When to Evaluate
- After implementing → Baseline performance
- After optimization → Validate improvements
- Before production → Quality gate
- In production → Continuous monitoring
What to Measure
- Development → Hit rate + MRR (retrieval quality)
- Production → All metrics (retrieval + response quality)
- A/B testing → Comparative metrics
Dataset Size
- Quick test → 20-50 Q&A pairs
- Thorough eval → 100-200 pairs
- Production → 500+ pairs
Quick Start Patterns
Pattern 1: Basic Retrieval Evaluation
from llama_index.core.evaluation import RetrieverEvaluator # Create evaluator evaluator = RetrieverEvaluator.from_metric_names( ["mrr", "hit_rate"], retriever=retriever ) # Run evaluation eval_results = await evaluator.aevaluate_dataset(qa_dataset) print(f"Hit Rate: {eval_results['hit_rate']:.3f}") print(f"MRR: {eval_results['mrr']:.3f}")
Pattern 2: Generate Evaluation Dataset
from llama_index.core.evaluation import generate_question_context_pairs from llama_index.llms.openai import OpenAI # Generate Q&A pairs from your documents llm = OpenAI(model="gpt-4o-mini") qa_dataset = generate_question_context_pairs( nodes, llm=llm, num_questions_per_chunk=2 ) # Filter invalid entries qa_dataset = filter_qa_dataset(qa_dataset) # Save for reuse qa_dataset.save_json("evaluation_dataset.json")
Pattern 3: Compare Multiple Strategies
strategies = { "vector": vector_retriever, "bm25": bm25_retriever, "hybrid": hybrid_retriever, "metadata": metadata_retriever, } results = {} for strategy_name, retriever in strategies.items(): evaluator = RetrieverEvaluator.from_metric_names( ["mrr", "hit_rate"], retriever=retriever ) eval_result = await evaluator.aevaluate_dataset(qa_dataset) results[strategy_name] = eval_result print(f"{strategy_name}: {eval_result}") # Find best strategy best_strategy = max(results, key=lambda x: results[x]['hit_rate']) print(f"\nBest strategy: {best_strategy}")
Pattern 4: Compare With/Without Reranking
# Without reranking retriever_no_rerank = index.as_retriever(similarity_top_k=5) # With reranking from llama_index.postprocessor.cohere_rerank import CohereRerank retriever_with_rerank = index.as_retriever( similarity_top_k=10, node_postprocessors=[CohereRerank(top_n=5)] ) # Evaluate both for name, retriever in [("No Rerank", retriever_no_rerank), ("With Rerank", retriever_with_rerank)]: evaluator = RetrieverEvaluator.from_metric_names( ["mrr", "hit_rate"], retriever=retriever ) results = await evaluator.aevaluate_dataset(qa_dataset) print(f"{name}: Hit Rate={results['hit_rate']:.3f}, MRR={results['mrr']:.3f}") # Calculate improvement improvement = (rerank_results['hit_rate'] - no_rerank_results['hit_rate']) / no_rerank_results['hit_rate'] print(f"Improvement: {improvement * 100:.1f}%")
Pattern 5: Response Quality Evaluation
from llama_index.core.evaluation import ( FaithfulnessEvaluator, RelevancyEvaluator ) # Initialize evaluators faithfulness_evaluator = FaithfulnessEvaluator() relevancy_evaluator = RelevancyEvaluator() # Generate response response = query_engine.query("What is machine learning?") # Evaluate faithfulness (no hallucinations) faithfulness_result = faithfulness_evaluator.evaluate_response( response=response ) print(f"Faithfulness: {faithfulness_result.passing}") # Evaluate relevancy relevancy_result = relevancy_evaluator.evaluate_response( query="What is machine learning?", response=response ) print(f"Relevancy: {relevancy_result.passing}")
Your Codebase Integration
For src/
Pipeline (7 Strategies)
src/Compare All Strategies:
strategies = { "vector": "src/10_basic_query_engine.py", "summary": "src/11_document_summary_retriever.py", "recursive": "src/12_recursive_retriever.py", "metadata": "src/14_metadata_filtering.py", "chunk_decoupling": "src/15_chunk_decoupling.py", "hybrid": "src/16_hybrid_search.py", "planner": "src/17_query_planning_agent.py", } # Create evaluation framework to compare all 7
Baseline Performance:
- Generate Q&A dataset from your documents
- Evaluate each strategy
- Identify best performer
- Use as baseline for improvements
For src-iLand/
Pipeline (Thai Land Deeds)
src-iLand/Thai-Specific Evaluation:
# Generate Thai Q&A pairs llm = OpenAI(model="gpt-4o-mini") # Supports Thai qa_dataset = generate_question_context_pairs( thai_nodes, llm=llm, num_questions_per_chunk=2 ) # Test with Thai queries thai_queries = [ "โฉนดที่ดินในกรุงเทพ", # Land deeds in Bangkok "นส.3 คืออะไร", # What is NS.3 "ที่ดินในสมุทรปราการ" # Land in Samut Prakan ]
Router Evaluation (
src-iLand/retrieval/router.py):
- Test index classification accuracy
- Test strategy selection appropriateness
- Measure end-to-end performance
Fast Metadata Testing:
- Validate <50ms response time
- Test filtering accuracy
- Compare with/without fast indexing
Detailed References
Load these when you need comprehensive details:
-
reference-metrics.md: Complete evaluation guide
- All metrics (hit rate, MRR, faithfulness, correctness)
- Dataset generation techniques
- A/B testing frameworks
- Production monitoring
- Statistical significance testing
-
reference-agents.md: Advanced techniques
- Agents (FunctionAgent, ReActAgent)
- Multi-agent systems
- Query engines (Router, SubQuestion)
- Workflow orchestration
- Observability and debugging
Common Workflows
Workflow 1: Create Evaluation Dataset
-
Step 1: Prepare representative documents
- Sample from different categories
- Include edge cases
-
Step 2: Generate Q&A pairs
qa_dataset = generate_question_context_pairs( nodes, llm=llm, num_questions_per_chunk=2 ) -
Step 3: Filter invalid entries
- Remove auto-generated artifacts
- Load
for filtering codereference-metrics.md
-
Step 4: Manual review (optional)
- Check 10-20 samples
- Ensure question quality
-
Step 5: Save for reuse
qa_dataset.save_json("eval_dataset.json")
Workflow 2: Compare Retrieval Strategies
-
Step 1: Load evaluation dataset
from llama_index.core.llama_dataset import LabelledRagDataset qa_dataset = LabelledRagDataset.from_json("eval_dataset.json") -
Step 2: Define strategies to compare
- List all retrievers to test
- For
: All 7 strategiessrc/ - For
: Router + individual strategiessrc-iLand/
-
Step 3: Run evaluation for each
for name, retriever in strategies.items(): results[name] = evaluate(retriever, qa_dataset) -
Step 4: Compare results
- Identify best hit rate
- Identify best MRR
- Consider trade-offs (latency, cost)
-
Step 5: Document findings
- Record baseline performance
- Note best strategies for different query types
Workflow 3: A/B Test an Optimization
-
Step 1: Measure baseline
baseline_results = evaluate(current_retriever, qa_dataset) -
Step 2: Apply optimization
- Add reranking
- Change embedding model
- Adjust chunk size
- etc.
-
Step 3: Measure optimized version
optimized_results = evaluate(optimized_retriever, qa_dataset) -
Step 4: Calculate improvement
improvement = (optimized - baseline) / baseline * 100 print(f"Hit Rate improvement: {improvement:.1f}%") -
Step 5: Decide based on data
- If improvement > 5%: Deploy
- If improvement < 2%: Consider cost/complexity
- If negative: Rollback
Workflow 4: Production Monitoring
-
Step 1: Create production evaluation set
- Sample real user queries
- Include ground truth when available
-
Step 2: Set up continuous evaluation
class ProductionEvaluator: def evaluate_query(self, query, response): # Log metrics # Track over time -
Step 3: Define alerts
- Hit rate < 0.80 → Alert
- MRR < 0.70 → Alert
- Latency p95 > 2s → Alert
-
Step 4: Monitor trends
- Daily/weekly metrics
- Detect degradation early
-
Step 5: Iterate based on data
- Identify failure patterns
- Generate new test cases
- Improve weak areas
Workflow 5: Evaluate All 7 Strategies (src/)
-
Step 1: Generate comprehensive dataset
- Cover different query types
- Factual, summarization, comparison
-
Step 2: Run each strategy
python src/10_basic_query_engine.py # Vector python src/11_document_summary_retriever.py # Summary python src/12_recursive_retriever.py # Recursive python src/14_metadata_filtering.py # Metadata python src/15_chunk_decoupling.py # Chunk decoupling python src/16_hybrid_search.py # Hybrid python src/17_query_planning_agent.py # Planner -
Step 3: Collect metrics
- Hit rate for each
- MRR for each
- Latency for each
-
Step 4: Create comparison table
Strategy Hit Rate MRR Latency Use Case Vector ... ... ... General Hybrid ... ... ... Best overall ... ... ... ... ... -
Step 5: Document recommendations
- Best for factual queries
- Best for complex queries
- Best for production (speed + quality)
Evaluation Metrics Reference
Hit Rate Interpretation
- 1.0 → Perfect (all queries successful)
- 0.90+ → Excellent
- 0.80-0.89 → Good
- 0.70-0.79 → Acceptable
- <0.70 → Needs improvement
MRR Interpretation
- 1.0 → Perfect ranking (relevant doc always #1)
- 0.85+ → Excellent (relevant doc typically #1 or #2)
- 0.70-0.84 → Good
- 0.50-0.69 → Acceptable
- <0.50 → Poor ranking quality
Latency Targets
- <100ms → Excellent
- 100-500ms → Good
- 500ms-1s → Acceptable
- >1s → Needs optimization
Performance Benchmarks
Embedding Model Comparison (from reference docs)
| Embedding | Reranker | Hit Rate | MRR |
|---|---|---|---|
| JinaAI Base | bge-reranker-large | 0.938 | 0.869 |
| JinaAI Base | CohereRerank | 0.933 | 0.874 |
| OpenAI | CohereRerank | 0.927 | 0.866 |
| OpenAI | bge-reranker-large | 0.910 | 0.856 |
Typical Improvements
- Adding reranking: +5-15% hit rate
- Hybrid vs vector: +3-8% hit rate
- Optimal chunk size: +2-5% hit rate
- Better embeddings: +3-10% hit rate
Scripts
This skill includes utility scripts in the
scripts/ directory:
generate_qa_dataset.py
Generate evaluation Q&A pairs from documents:
python .claude/skills/evaluating-rag/scripts/generate_qa_dataset.py \ --documents-dir ./data \ --output eval_dataset.json \ --num-questions-per-chunk 2
compare_retrievers.py
Compare multiple retrieval strategies:
python .claude/skills/evaluating-rag/scripts/compare_retrievers.py \ --dataset eval_dataset.json \ --strategies vector,bm25,hybrid \ --output comparison_results.json
Outputs:
- Hit rate and MRR for each strategy
- Performance comparison table
- Recommendations
run_evaluation.py
Run comprehensive evaluation:
python .claude/skills/evaluating-rag/scripts/run_evaluation.py \ --retriever-config config.yaml \ --dataset eval_dataset.json \ --metrics hit_rate,mrr,faithfulness
Reports:
- All requested metrics
- Per-query breakdown
- Summary statistics
Key Reminders
Dataset Quality:
- Generate from your actual documents
- Include diverse query types
- Filter invalid auto-generated entries
- Manual review recommended for critical domains
Evaluation Best Practices:
- Start with baseline (before optimization)
- Test one change at a time (for clear attribution)
- Use same dataset for comparisons
- Statistical significance matters (>5% improvement)
Production Monitoring:
- Continuous evaluation on sample queries
- Track trends over time
- Alert on degradation
- Regular dataset refresh
For Your Pipelines:
: Compare all 7 strategies systematicallysrc/
: Test with Thai queries and metadatasrc-iLand/- Both: Establish baselines before optimizations
Next Steps
After evaluation:
- Optimize: Use
skill to improve low scoresoptimizing-rag - Implement: Use
skill to rebuild weak componentsimplementing-rag - Monitor: Set up continuous evaluation in production
- Iterate: Regular evaluation → optimization → re-evaluation cycle