Claude-skill-registry eval-frameworks

Evaluation framework patterns for RAG and LLMs, including faithfulness metrics, synthetic dataset generation, and LLM-as-a-judge patterns. Triggers: ragas, deepeval, llm-eval, faithfulness, hallucination-check, synthetic-data.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/eval-frameworks" ~/.claude/skills/majiayu000-claude-skill-registry-eval-frameworks && rm -rf "$T"

manifest: skills/data/eval-frameworks/SKILL.md

source content

Evaluation Frameworks

Overview

Traditional software metrics (accuracy, F1) fail to capture the quality of LLM outputs. Evaluation frameworks like Ragas and DeepEval use "LLM-as-a-judge" to quantify subjective qualities like faithfulness, relevance, and professionalism.

When to Use

RAG Benchmarking: To verify if answers are supported by retrieved context (Faithfulness).
Regression Testing: Ensuring that a prompt change or model upgrade doesn't break existing behavior.
Synthetic Benchmarking: Creating evaluation sets when manual gold-standard data is unavailable.

Decision Tree

Do you want to check for hallucinations?
- YES: Run a Faithfulness metric.
Is the retrieved context actually useful for the question?
- YES: Run a Retrieval Relevance metric.
Do you need to scale evaluation without manual labeling?
- YES: Use Synthetic Data Generation.

Workflows

1. Evaluating RAG Faithfulness

Capture the
```
query
```
, the
```
retrieved_context
```
, and the
```
actual_output
```
from the system.
Run a Faithfulness metric (from Ragas or LlamaIndex) which uses an LLM to verify if the output claims are supported by the context.
If the score is low, investigate whether the context was irrelevant or the model hallucinated.

2. Unit Testing LLM Outputs (DeepEval)

Install
```
deepeval
```
and create a test file
```
test_example.py
```
.
Define an
```
LLMTestCase
```
with input, actual_output, and expected_output.
Apply a
```
GEval
```
metric with a custom
```
criteria
```
(e.g., 'professionalism').
Run
```
deepeval test run
```
to assert that the score meets the defined threshold.

3. Automated Question Generation

Point the evaluation framework (e.g., LlamaIndex) at a set of source documents.
Use the
```
QuestionGeneration
```
module to synthetically create test cases (question-context pairs).
Run the RAG pipeline against these generated questions to benchmark performance across the entire dataset.

Non-Obvious Insights

LLM-as-a-Judge: A stronger model (GPT-4o) can effectively grade a smaller/faster model (Llama 3) with human-like accuracy using research-backed metrics like GEval.
Separation of Concerns: Good evaluation splits into 'Response' (was the answer good?) and 'Retrieval' (did we find the right docs?). Fixing one doesn't always fix the other.
Synthetic Scaling: Manual evaluation doesn't scale; using an LLM to generate 1000 edge cases from your data is the only way to reach high production confidence.

Evidence

"Faithfulness: Evaluates if the answer is faithful to the retrieved contexts (in other words, whether if there’s hallucination)." - LlamaIndex
"GEval is a research-backed metric... for you to evaluate your LLM output's on any custom metric with human-like accuracy." - DeepEval
"Traditional evaluation metrics don't capture what matters for LLM applications." - Ragas

Scripts

```
scripts/eval-frameworks_tool.py
```
: Script defining a Faithfulness evaluation loop.
```
scripts/eval-frameworks_tool.js
```
: Node.js simulation for calculating relevance scores.

Dependencies

```
ragas
```
```
deepeval
```
```
llama-index
```

References

references/README.md