Claude-skill-registry libeval

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/libeval" ~/.claude/skills/majiayu000-claude-skill-registry-libeval && rm -rf "$T"

manifest: skills/data/libeval/SKILL.md

libeval Skill

When to Use

Evaluating RAG agent response quality
Measuring retrieval recall and precision
Running automated quality assessments
Benchmarking agent performance over time

Key Concepts

Evaluator: Main orchestrator that runs test cases through the agent and collects metrics.

CriteriaEvaluator: Uses LLM-as-judge to score responses against defined criteria and rubrics.

RecallEvaluator: Measures how well the retrieval system returns relevant documents.

TraceEvaluator: Analyzes execution traces for performance and correctness.

Usage Patterns

Pattern 1: Run evaluation suite

import { Evaluator } from "@copilot-ld/libeval";

const evaluator = new Evaluator(config);
const results = await evaluator.run(testCases);
console.log(results.summary);

Pattern 2: Criteria-based evaluation

import { CriteriaEvaluator } from "@copilot-ld/libeval";

const criteria = new CriteriaEvaluator(llmClient);
const score = await criteria.evaluate(response, rubric);

Integration

Configured via config/eval.yml. Run via

make eval

. Uses libllm for LLM-as-judge.