OpenJudge ref-hallucination-arena
install
source · Clone the upstream repo
git clone https://github.com/agentscope-ai/OpenJudge
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/agentscope-ai/OpenJudge "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/ref-hallucination-arena" ~/.claude/skills/agentscope-ai-openjudge-ref-hallucination-arena && rm -rf "$T"
manifest:
skills/ref-hallucination-arena/SKILL.mdsource content
Reference Hallucination Arena Skill
Evaluate how accurately LLMs recommend real academic references using the OpenJudge
RefArenaPipeline:
- Load queries — from JSON/JSONL dataset
- Collect responses — BibTeX-formatted references from target models
- Extract references — parse BibTeX entries from model output
- Verify references — cross-check against Crossref / PubMed / arXiv / DBLP
- Score & rank — compute verification rate, per-field accuracy, discipline breakdown
- Generate report — Markdown report + visualization charts
Prerequisites
# Install OpenJudge pip install py-openjudge # Extra dependency for ref_hallucination_arena (chart generation) pip install matplotlib
Gather from user before running
| Info | Required? | Notes |
|---|---|---|
| Config YAML path | Yes | Defines endpoints, dataset, verification settings |
| Dataset path | Yes | JSON/JSONL file with queries (can be set in config) |
| API keys | Yes | Env vars: , , etc. |
| CrossRef email | No | Improves API rate limits for verification |
| PubMed API key | No | Improves PubMed rate limits |
| Output directory | No | Default: |
| Report language | No | (default) or |
| Tavily API key | No | Required only if using tool-augmented mode |
Quick start
CLI
# Run evaluation with config file python -m cookbooks.ref_hallucination_arena --config config.yaml --save # Resume from checkpoint (default behavior) python -m cookbooks.ref_hallucination_arena --config config.yaml --save # Start fresh, ignore checkpoint python -m cookbooks.ref_hallucination_arena --config config.yaml --fresh --save # Override output directory python -m cookbooks.ref_hallucination_arena --config config.yaml \ --output_dir ./my_results --save
Python API
import asyncio from cookbooks.ref_hallucination_arena.pipeline import RefArenaPipeline async def main(): pipeline = RefArenaPipeline.from_config("config.yaml") result = await pipeline.evaluate() for rank, (model, score) in enumerate(result.rankings, 1): print(f"{rank}. {model}: {score:.1%}") asyncio.run(main())
CLI options
| Flag | Default | Description |
|---|---|---|
| — | Path to YAML configuration file (required) |
| config value | Override output directory |
| | Save results to file |
| | Start fresh, ignore checkpoint |
Minimal config file
task: description: "Evaluate LLM reference recommendation capabilities" dataset: path: "./data/queries.json" target_endpoints: model_a: base_url: "https://api.openai.com/v1" api_key: "${OPENAI_API_KEY}" model: "gpt-4" system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format. Only recommend papers you are confident actually exist." model_b: base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1" api_key: "${DASHSCOPE_API_KEY}" model: "qwen3-max" system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format. Only recommend papers you are confident actually exist."
Full config reference
task
| Field | Required | Description |
|---|---|---|
| Yes | Evaluation task description |
| No | Usage scenario |
dataset
| Field | Default | Description |
|---|---|---|
| — | Path to JSON/JSONL dataset file (required) |
| | Shuffle queries before evaluation |
| | Max queries to use ( = all) |
target_endpoints.<name>
| Field | Default | Description |
|---|---|---|
| — | API base URL (required) |
| — | API key, supports (required) |
| — | Model name (required) |
| built-in | System prompt; use placeholder |
| | Max concurrent requests for this endpoint |
| — | Extra API request params (e.g. ) |
| | Enable ReAct agent with Tavily web search |
| env var | Tavily API key |
| | Max ReAct iterations (1–30) |
| | or |
verification
| Field | Default | Description |
|---|---|---|
| — | Email for Crossref polite pool |
| — | PubMed API key |
| | Concurrent verification threads (1–50) |
| | Per-request timeout in seconds |
| | Min composite score to count as VERIFIED |
evaluation
| Field | Default | Description |
|---|---|---|
| | Model API request timeout in seconds |
| | Number of retry attempts |
output
| Field | Default | Description |
|---|---|---|
| | Output directory |
| | Save loaded queries |
| | Save model responses |
| | Save verification details |
report
| Field | Default | Description |
|---|---|---|
| | Enable report generation |
| | Report language: or |
| | Examples per section (1–10) |
| | Generate charts |
| | or |
| | Show values on bars |
| | Highlight best model |
Dataset format
Each query in the JSON/JSONL dataset:
{ "query": "Please recommend papers on Transformer architectures for NLP.", "discipline": "computer_science", "num_refs": 5, "language": "en", "year_constraint": {"min_year": 2020} }
| Field | Required | Description |
|---|---|---|
| Yes | Prompt for reference recommendation |
| No | , , , , , , |
| No | Expected number of references (default: 5) |
| No | or (default: ) |
| No | , , , or |
Official dataset: OpenJudge/ref-hallucination-arena
Interpreting results
Overall accuracy (verification rate):
- > 75% — Excellent: model rarely hallucinates references
- 60–75% — Good: most references are real, some fabrication
- 40–60% — Fair: significant hallucination, use with caution
- < 40% — Poor: model frequently fabricates references
Per-field accuracy:
— % of titles matching real paperstitle_accuracy
— % of correct author listsauthor_accuracy
— % of correct publication yearsyear_accuracy
— % of valid DOIsdoi_accuracy
Verification status:
— title + author + year all exactly match a real paperVERIFIED
— partial match (e.g. title matches but authors differ)SUSPECT
— no match in any databaseNOT_FOUND
— API timeout or network failureERROR
Ranking order: overall accuracy → year compliance rate → avg confidence → completeness
Output files
evaluation_results/ref_hallucination_arena/ ├── evaluation_report.md # Detailed Markdown report ├── evaluation_results.json # Rankings, per-field accuracy, scores ├── verification_chart.png # Per-field accuracy bar chart ├── discipline_chart.png # Per-discipline accuracy chart ├── queries.json # Loaded evaluation queries ├── responses.json # Raw model responses ├── extracted_refs.json # Extracted BibTeX references ├── verification_results.json # Per-reference verification details └── checkpoint.json # Pipeline checkpoint for resume
API key by model
| Model prefix | Environment variable |
|---|---|
, , | |
| |
, | |
| |
| Custom endpoint | set + in config |
Additional resources
- Full config examples: cookbooks/ref_hallucination_arena/examples/
- Documentation: docs/validating_graders/ref_hallucination_arena.md
- Official dataset: HuggingFace
- Leaderboard: openjudge.me/leaderboard