OpenJudge ref-hallucination-arena

install
source · Clone the upstream repo
git clone https://github.com/agentscope-ai/OpenJudge
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/agentscope-ai/OpenJudge "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/ref-hallucination-arena" ~/.claude/skills/agentscope-ai-openjudge-ref-hallucination-arena && rm -rf "$T"
manifest: skills/ref-hallucination-arena/SKILL.md
source content

Reference Hallucination Arena Skill

Evaluate how accurately LLMs recommend real academic references using the OpenJudge

RefArenaPipeline
:

  1. Load queries — from JSON/JSONL dataset
  2. Collect responses — BibTeX-formatted references from target models
  3. Extract references — parse BibTeX entries from model output
  4. Verify references — cross-check against Crossref / PubMed / arXiv / DBLP
  5. Score & rank — compute verification rate, per-field accuracy, discipline breakdown
  6. Generate report — Markdown report + visualization charts

Prerequisites

# Install OpenJudge
pip install py-openjudge

# Extra dependency for ref_hallucination_arena (chart generation)
pip install matplotlib

Gather from user before running

InfoRequired?Notes
Config YAML pathYesDefines endpoints, dataset, verification settings
Dataset pathYesJSON/JSONL file with queries (can be set in config)
API keysYesEnv vars:
OPENAI_API_KEY
,
DASHSCOPE_API_KEY
, etc.
CrossRef emailNoImproves API rate limits for verification
PubMed API keyNoImproves PubMed rate limits
Output directoryNoDefault:
./evaluation_results/ref_hallucination_arena
Report languageNo
"en"
(default) or
"zh"
Tavily API keyNoRequired only if using tool-augmented mode

Quick start

CLI

# Run evaluation with config file
python -m cookbooks.ref_hallucination_arena --config config.yaml --save

# Resume from checkpoint (default behavior)
python -m cookbooks.ref_hallucination_arena --config config.yaml --save

# Start fresh, ignore checkpoint
python -m cookbooks.ref_hallucination_arena --config config.yaml --fresh --save

# Override output directory
python -m cookbooks.ref_hallucination_arena --config config.yaml \
  --output_dir ./my_results --save

Python API

import asyncio
from cookbooks.ref_hallucination_arena.pipeline import RefArenaPipeline

async def main():
    pipeline = RefArenaPipeline.from_config("config.yaml")
    result = await pipeline.evaluate()

    for rank, (model, score) in enumerate(result.rankings, 1):
        print(f"{rank}. {model}: {score:.1%}")

asyncio.run(main())

CLI options

FlagDefaultDescription
--config
Path to YAML configuration file (required)
--output_dir
config valueOverride output directory
--save
False
Save results to file
--fresh
False
Start fresh, ignore checkpoint

Minimal config file

task:
  description: "Evaluate LLM reference recommendation capabilities"

dataset:
  path: "./data/queries.json"

target_endpoints:
  model_a:
    base_url: "https://api.openai.com/v1"
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4"
    system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format. Only recommend papers you are confident actually exist."

  model_b:
    base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    api_key: "${DASHSCOPE_API_KEY}"
    model: "qwen3-max"
    system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format. Only recommend papers you are confident actually exist."

Full config reference

task

FieldRequiredDescription
description
YesEvaluation task description
scenario
NoUsage scenario

dataset

FieldDefaultDescription
path
Path to JSON/JSONL dataset file (required)
shuffle
false
Shuffle queries before evaluation
max_queries
null
Max queries to use (
null
= all)

target_endpoints.<name>

FieldDefaultDescription
base_url
API base URL (required)
api_key
API key, supports
${ENV_VAR}
(required)
model
Model name (required)
system_prompt
built-inSystem prompt; use
{num_refs}
placeholder
max_concurrency
5
Max concurrent requests for this endpoint
extra_params
Extra API request params (e.g.
temperature
)
tool_config.enabled
false
Enable ReAct agent with Tavily web search
tool_config.tavily_api_key
env varTavily API key
tool_config.max_iterations
10
Max ReAct iterations (1–30)
tool_config.search_depth
"advanced"
"basic"
or
"advanced"

verification

FieldDefaultDescription
crossref_mailto
Email for Crossref polite pool
pubmed_api_key
PubMed API key
max_workers
10
Concurrent verification threads (1–50)
timeout
30
Per-request timeout in seconds
verified_threshold
0.7
Min composite score to count as VERIFIED

evaluation

FieldDefaultDescription
timeout
120
Model API request timeout in seconds
retry_times
3
Number of retry attempts

output

FieldDefaultDescription
output_dir
./evaluation_results/ref_hallucination_arena
Output directory
save_queries
true
Save loaded queries
save_responses
true
Save model responses
save_details
true
Save verification details

report

FieldDefaultDescription
enabled
true
Enable report generation
language
"zh"
Report language:
"zh"
or
"en"
include_examples
3
Examples per section (1–10)
chart.enabled
true
Generate charts
chart.orientation
"vertical"
"horizontal"
or
"vertical"
chart.show_values
true
Show values on bars
chart.highlight_best
true
Highlight best model

Dataset format

Each query in the JSON/JSONL dataset:

{
  "query": "Please recommend papers on Transformer architectures for NLP.",
  "discipline": "computer_science",
  "num_refs": 5,
  "language": "en",
  "year_constraint": {"min_year": 2020}
}
FieldRequiredDescription
query
YesPrompt for reference recommendation
discipline
No
computer_science
,
biomedical
,
physics
,
chemistry
,
social_science
,
interdisciplinary
,
other
num_refs
NoExpected number of references (default: 5)
language
No
"zh"
or
"en"
(default:
"zh"
)
year_constraint
No
{"exact": 2023}
,
{"min_year": 2020}
,
{"max_year": 2015}
, or
{"min_year": 2020, "max_year": 2024}

Official dataset: OpenJudge/ref-hallucination-arena

Interpreting results

Overall accuracy (verification rate):

  • > 75% — Excellent: model rarely hallucinates references
  • 60–75% — Good: most references are real, some fabrication
  • 40–60% — Fair: significant hallucination, use with caution
  • < 40% — Poor: model frequently fabricates references

Per-field accuracy:

  • title_accuracy
    — % of titles matching real papers
  • author_accuracy
    — % of correct author lists
  • year_accuracy
    — % of correct publication years
  • doi_accuracy
    — % of valid DOIs

Verification status:

  • VERIFIED
    — title + author + year all exactly match a real paper
  • SUSPECT
    — partial match (e.g. title matches but authors differ)
  • NOT_FOUND
    — no match in any database
  • ERROR
    — API timeout or network failure

Ranking order: overall accuracy → year compliance rate → avg confidence → completeness

Output files

evaluation_results/ref_hallucination_arena/
├── evaluation_report.md          # Detailed Markdown report
├── evaluation_results.json       # Rankings, per-field accuracy, scores
├── verification_chart.png        # Per-field accuracy bar chart
├── discipline_chart.png          # Per-discipline accuracy chart
├── queries.json                  # Loaded evaluation queries
├── responses.json                # Raw model responses
├── extracted_refs.json           # Extracted BibTeX references
├── verification_results.json     # Per-reference verification details
└── checkpoint.json               # Pipeline checkpoint for resume

API key by model

Model prefixEnvironment variable
gpt-*
,
o1-*
,
o3-*
OPENAI_API_KEY
claude-*
ANTHROPIC_API_KEY
qwen-*
,
dashscope/*
DASHSCOPE_API_KEY
deepseek-*
DEEPSEEK_API_KEY
Custom endpointset
api_key
+
base_url
in config

Additional resources