Claude-skill-registry eval-frameworks
Evaluation framework patterns for RAG and LLMs, including faithfulness metrics, synthetic dataset generation, and LLM-as-a-judge patterns. Triggers: ragas, deepeval, llm-eval, faithfulness, hallucination-check, synthetic-data.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/eval-frameworks" ~/.claude/skills/majiayu000-claude-skill-registry-eval-frameworks && rm -rf "$T"
manifest:
skills/data/eval-frameworks/SKILL.mdsource content
Evaluation Frameworks
Overview
Traditional software metrics (accuracy, F1) fail to capture the quality of LLM outputs. Evaluation frameworks like Ragas and DeepEval use "LLM-as-a-judge" to quantify subjective qualities like faithfulness, relevance, and professionalism.
When to Use
- RAG Benchmarking: To verify if answers are supported by retrieved context (Faithfulness).
- Regression Testing: Ensuring that a prompt change or model upgrade doesn't break existing behavior.
- Synthetic Benchmarking: Creating evaluation sets when manual gold-standard data is unavailable.
Decision Tree
- Do you want to check for hallucinations?
- YES: Run a Faithfulness metric.
- Is the retrieved context actually useful for the question?
- YES: Run a Retrieval Relevance metric.
- Do you need to scale evaluation without manual labeling?
- YES: Use Synthetic Data Generation.
Workflows
1. Evaluating RAG Faithfulness
- Capture the
, thequery
, and theretrieved_context
from the system.actual_output - Run a Faithfulness metric (from Ragas or LlamaIndex) which uses an LLM to verify if the output claims are supported by the context.
- If the score is low, investigate whether the context was irrelevant or the model hallucinated.
2. Unit Testing LLM Outputs (DeepEval)
- Install
and create a test filedeepeval
.test_example.py - Define an
with input, actual_output, and expected_output.LLMTestCase - Apply a
metric with a customGEval
(e.g., 'professionalism').criteria - Run
to assert that the score meets the defined threshold.deepeval test run
3. Automated Question Generation
- Point the evaluation framework (e.g., LlamaIndex) at a set of source documents.
- Use the
module to synthetically create test cases (question-context pairs).QuestionGeneration - Run the RAG pipeline against these generated questions to benchmark performance across the entire dataset.
Non-Obvious Insights
- LLM-as-a-Judge: A stronger model (GPT-4o) can effectively grade a smaller/faster model (Llama 3) with human-like accuracy using research-backed metrics like GEval.
- Separation of Concerns: Good evaluation splits into 'Response' (was the answer good?) and 'Retrieval' (did we find the right docs?). Fixing one doesn't always fix the other.
- Synthetic Scaling: Manual evaluation doesn't scale; using an LLM to generate 1000 edge cases from your data is the only way to reach high production confidence.
Evidence
- "Faithfulness: Evaluates if the answer is faithful to the retrieved contexts (in other words, whether if there’s hallucination)." - LlamaIndex
- "GEval is a research-backed metric... for you to evaluate your LLM output's on any custom metric with human-like accuracy." - DeepEval
- "Traditional evaluation metrics don't capture what matters for LLM applications." - Ragas
Scripts
: Script defining a Faithfulness evaluation loop.scripts/eval-frameworks_tool.py
: Node.js simulation for calculating relevance scores.scripts/eval-frameworks_tool.js
Dependencies
ragasdeepevalllama-index