OpenJudge openjudge
install
source · Clone the upstream repo
git clone https://github.com/agentscope-ai/OpenJudge
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/agentscope-ai/OpenJudge "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/openjudge" ~/.claude/skills/agentscope-ai-openjudge-openjudge && rm -rf "$T"
manifest:
skills/openjudge/SKILL.mdsource content
OpenJudge Skill
Build evaluation pipelines for LLM applications using the
openjudge library.
When to Use This Skill
- User wants to evaluate LLM output quality (correctness, relevance, hallucination, etc.)
- User wants to compare two or more models and rank them
- User wants to design a scoring rubric and automate evaluation
- User wants to analyze evaluation results statistically
- User wants to build a reward model or quality filter
Sub-documents — Read When Relevant
| Topic | File | Read when… |
|---|---|---|
| Grader selection & configuration | | User needs to pick or configure an evaluator |
| Batch evaluation pipeline | | User needs to run evaluation over a dataset |
| Auto-generate graders from data | | No rubric yet; generate from labeled examples |
| Analyze & compare results | | User wants win rates, statistics, or metrics |
Read the relevant sub-document before writing any code.
Install
pip install py-openjudge
Architecture Overview
Dataset (List[dict]) │ ▼ GradingRunner ← orchestrates everything │ ├─► Grader A ──► EvaluationStrategy ──► _aevaluate() ──► GraderScore / GraderRank ├─► Grader B ──► EvaluationStrategy ──► _aevaluate() ──► GraderScore / GraderRank └─► Grader C ... │ ├─► Aggregator (optional) ← combine multiple grader scores into one │ └─► RunnerResult ← {grader_name: [GraderScore, ...]} │ ▼ Analyzer ← statistics, win rates, validation metrics
5-Minute Quick Start
Evaluate responses for correctness using a built-in grader:
import asyncio from openjudge.models.openai_chat_model import OpenAIChatModel from openjudge.graders.common.correctness import CorrectnessGrader from openjudge.runner.grading_runner import GradingRunner # 1. Configure the judge model (OpenAI-compatible endpoint) model = OpenAIChatModel( model="qwen-plus", api_key="sk-xxx", base_url="https://dashscope.aliyuncs.com/compatible-mode/v1", ) # 2. Instantiate a grader grader = CorrectnessGrader(model=model) # 3. Prepare dataset dataset = [ { "query": "What is the capital of France?", "response": "Paris is the capital of France.", "reference_response": "Paris.", }, { "query": "What is 2 + 2?", "response": "The answer is five.", "reference_response": "4.", }, ] # 4. Run evaluation async def main(): runner = GradingRunner( grader_configs={"correctness": grader}, max_concurrency=8, ) results = await runner.arun(dataset) for i, result in enumerate(results["correctness"]): print(f"[{i}] score={result.score} reason={result.reason}") asyncio.run(main())
Expected output:
[0] score=5 reason=The response accurately states Paris as capital... [1] score=1 reason=The response gives the wrong answer (five vs 4)...
Key Data Types
| Type | Description |
|---|---|
| Pointwise result: (float), (str), (dict) |
| Listwise result: (List[int]), (str), (dict) |
| Error during evaluation: (str), (str) |
| — keyed by grader name |
Result Handling Pattern
from openjudge.graders.schema import GraderScore, GraderRank, GraderError for grader_name, grader_results in results.items(): for i, result in enumerate(grader_results): if isinstance(result, GraderScore): print(f"{grader_name}[{i}]: score={result.score}") elif isinstance(result, GraderRank): print(f"{grader_name}[{i}]: rank={result.rank}") elif isinstance(result, GraderError): print(f"{grader_name}[{i}]: ERROR — {result.error}")
Model Configuration
All LLM-based graders accept either a
BaseChatModel instance or a dict config:
# Option A: instance from openjudge.models.openai_chat_model import OpenAIChatModel model = OpenAIChatModel(model="gpt-4o", api_key="sk-...") # Option B: dict (auto-creates OpenAIChatModel) model_cfg = {"model": "gpt-4o", "api_key": "sk-..."} grader = CorrectnessGrader(model=model_cfg) # OpenAI-compatible endpoints (DashScope / local / etc.) model = OpenAIChatModel( model="qwen-plus", api_key="sk-xxx", base_url="https://dashscope.aliyuncs.com/compatible-mode/v1", )