OpenJudge auto-arena
install
source · Clone the upstream repo
git clone https://github.com/agentscope-ai/OpenJudge
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/agentscope-ai/OpenJudge "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/auto-arena" ~/.claude/skills/agentscope-ai-openjudge-auto-arena && rm -rf "$T"
manifest:
skills/auto-arena/SKILL.mdsource content
Auto Arena Skill
End-to-end automated model comparison using the OpenJudge
AutoArenaPipeline:
- Generate queries — LLM creates diverse test queries from task description
- Collect responses — query all target endpoints concurrently
- Generate rubrics — LLM produces evaluation criteria from task + sample queries
- Pairwise evaluation — judge model compares every model pair (with position-bias swap)
- Analyze & rank — compute win rates, win matrix, and rankings
- Report & charts — Markdown report + win-rate bar chart + optional matrix heatmap
Prerequisites
# Install OpenJudge pip install py-openjudge # Extra dependency for auto_arena (chart generation) pip install matplotlib
Gather from user before running
| Info | Required? | Notes |
|---|---|---|
| Task description | Yes | What the models/agents should do (set in config YAML) |
| Target endpoints | Yes | At least 2 OpenAI-compatible endpoints to compare |
| Judge endpoint | Yes | Strong model for pairwise evaluation (e.g. , ) |
| API keys | Yes | Env vars: , , etc. |
| Number of queries | No | Default: |
| Seed queries | No | Example queries to guide generation style |
| System prompts | No | Per-endpoint system prompts |
| Output directory | No | Default: |
| Report language | No | (default) or |
Quick start
CLI
# Run evaluation python -m cookbooks.auto_arena --config config.yaml --save # Use pre-generated queries python -m cookbooks.auto_arena --config config.yaml \ --queries_file queries.json --save # Start fresh, ignore checkpoint python -m cookbooks.auto_arena --config config.yaml --fresh --save # Re-run only pairwise evaluation with new judge model # (keeps queries, responses, and rubrics) python -m cookbooks.auto_arena --config config.yaml --rerun-judge --save
Python API
import asyncio from cookbooks.auto_arena.auto_arena_pipeline import AutoArenaPipeline async def main(): pipeline = AutoArenaPipeline.from_config("config.yaml") result = await pipeline.evaluate() print(f"Best model: {result.best_pipeline}") for rank, (model, win_rate) in enumerate(result.rankings, 1): print(f"{rank}. {model}: {win_rate:.1%}") asyncio.run(main())
Minimal Python API (no config file)
import asyncio from cookbooks.auto_arena.auto_arena_pipeline import AutoArenaPipeline from cookbooks.auto_arena.schema import OpenAIEndpoint async def main(): pipeline = AutoArenaPipeline( task_description="Customer service chatbot for e-commerce", target_endpoints={ "gpt4": OpenAIEndpoint( base_url="https://api.openai.com/v1", api_key="sk-...", model="gpt-4", ), "qwen": OpenAIEndpoint( base_url="https://dashscope.aliyuncs.com/compatible-mode/v1", api_key="sk-...", model="qwen-max", ), }, judge_endpoint=OpenAIEndpoint( base_url="https://api.openai.com/v1", api_key="sk-...", model="gpt-4", ), num_queries=20, ) result = await pipeline.evaluate() print(f"Best: {result.best_pipeline}") asyncio.run(main())
CLI options
| Flag | Default | Description |
|---|---|---|
| — | Path to YAML configuration file (required) |
| config value | Override output directory |
| — | Path to pre-generated queries JSON (skip generation) |
| | Save results to file |
| | Start fresh, ignore checkpoint |
| | Re-run pairwise evaluation only (keep queries/responses/rubrics) |
Minimal config file
task: description: "Academic GPT assistant for research and writing tasks" target_endpoints: model_v1: base_url: "https://api.openai.com/v1" api_key: "${OPENAI_API_KEY}" model: "gpt-4" model_v2: base_url: "https://api.openai.com/v1" api_key: "${OPENAI_API_KEY}" model: "gpt-3.5-turbo" judge_endpoint: base_url: "https://api.openai.com/v1" api_key: "${OPENAI_API_KEY}" model: "gpt-4"
Full config reference
task
| Field | Required | Description |
|---|---|---|
| Yes | Clear description of the task models will be tested on |
| No | Usage scenario for additional context |
target_endpoints.<name>
| Field | Default | Description |
|---|---|---|
| — | API base URL (required) |
| — | API key, supports (required) |
| — | Model name (required) |
| — | System prompt for this endpoint |
| — | Extra API params (e.g. , ) |
judge_endpoint
Same fields as
target_endpoints.<name>. Use a strong model (e.g. gpt-4, qwen-max) with low temperature (~0.1) for consistent judgments.
query_generation
| Field | Default | Description |
|---|---|---|
| | Total number of queries to generate |
| — | Example queries to guide generation |
| — | Query categories with weights for stratified generation |
| judge endpoint | Custom endpoint for query generation |
| | Queries generated per API call (1–50) |
| | Parallel generation batches |
| | Sampling temperature (0.0–2.0) |
| | Top-p sampling (0.0–1.0) |
| | Dedup similarity threshold (0.0–1.0) |
| | Enable Evol-Instruct complexity evolution |
| | Evolution rounds (0–3) |
| | Evolution strategies |
evaluation
| Field | Default | Description |
|---|---|---|
| | Max concurrent API requests |
| | Request timeout in seconds |
| | Retry attempts for failed requests |
output
| Field | Default | Description |
|---|---|---|
| | Output directory |
| | Save generated queries |
| | Save model responses |
| | Save detailed results |
report
| Field | Default | Description |
|---|---|---|
| | Enable Markdown report generation |
| | Report language: or |
| | Examples per section (1–10) |
| | Generate win-rate chart |
| | or |
| | Show values on bars |
| | Highlight best model |
| | Generate win-rate matrix heatmap |
| | Chart format: , , or |
Interpreting results
Win rate: percentage of pairwise comparisons a model wins. Each pair is evaluated in both orders (original + swapped) to eliminate position bias.
Rankings example:
1. gpt4_baseline [################----] 80.0% 2. qwen_candidate [############--------] 60.0% 3. llama_finetuned [##########----------] 50.0%
Win matrix:
win_matrix[A][B] = how often model A beats model B across all queries.
Checkpoint & resume
The pipeline saves progress after each step. Interrupted runs resume automatically:
— ignore checkpoint, start from scratch--fresh
— re-run only the pairwise evaluation step (useful when switching judge models); keeps queries, responses, and rubrics intact--rerun-judge- Adding new endpoints to config triggers incremental response collection; existing responses are preserved
Output files
evaluation_results/ ├── evaluation_results.json # Rankings, win rates, win matrix ├── evaluation_report.md # Detailed Markdown report (if enabled) ├── win_rate_chart.png # Win-rate bar chart (if enabled) ├── win_rate_matrix.png # Matrix heatmap (if matrix_enabled) ├── queries.json # Generated test queries ├── responses.json # All model responses ├── rubrics.json # Generated evaluation rubrics ├── comparison_details.json # Pairwise comparison details └── checkpoint.json # Pipeline checkpoint
API key by model
| Model prefix | Environment variable |
|---|---|
, , | |
| |
, | |
| |
| Custom endpoint | set + in config |
Additional resources
- Full config examples: cookbooks/auto_arena/examples/
- Documentation: Auto Arena Guide