Showdown-claude-skill judge
Judge mode — models evaluate each other's showdown responses blind
git clone https://github.com/vanderheijden86/showdown-claude-skill
T=$(mktemp -d) && git clone --depth=1 https://github.com/vanderheijden86/showdown-claude-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/judge" ~/.claude/skills/vanderheijden86-showdown-claude-skill-judge && rm -rf "$T"
judge/SKILL.mdYou are executing the
/showdown judge skill. Models will cross-judge each other's responses from a previous showdown, blind and anonymized.
Step 1: Determine Input Source
Check if a
/showdown was run earlier in this conversation and you still have the responses in context.
- If yes: Use the in-memory responses (prompt + 3 model responses). Skip to Step 2.
- If no (or user provided a file path as argument):
- If
contains a file path, read that file$ARGUMENTS - Otherwise, list available showdown files:
ls -t ./showdown-output/showdown-*.md 2>/dev/null | head -10 - If files found, present a numbered list using
and let the user pickAskUserQuestion - If no files found, tell the user: "No showdown output files found. Run
first."/showdown - Read the chosen file and extract: the original prompt, and each model's full response (between the
separators)---
- If
Step 2: Choose Custom Dimension
Based on the original prompt topic, choose ONE custom scoring dimension that's relevant. Examples:
- Technical prompt → "Technical Correctness"
- Creative prompt → "Creativity"
- Business/strategy → "Actionability"
- Debate/opinion → "Persuasiveness"
- Code-related → "Code Quality"
The 4 fixed dimensions are always: Accuracy, Depth, Clarity, Originality.
Announce the 5 dimensions to the user before proceeding.
Step 3: Anonymize Responses
Randomly assign letters A, B, C to the three models. Record the mapping internally (e.g., A=Claude, B=GPT, C=Gemini). The assignment MUST be randomized — do not always use the same order.
To randomize, use:
echo "A B C" | tr ' ' '\n' | sort -R | tr '\n' ' '
Assign the first letter to the first model in models.conf order, second to second, etc.
Step 4: Construct Judge Prompts
For each judge model, construct a prompt containing ONLY the 2 responses from the OTHER models (not the judge's own). Use the anonymous letters.
Judge prompt template:
You are evaluating two AI-generated responses to the following prompt. --- ORIGINAL PROMPT: {original_prompt} --- RESPONSE {letter_1}: {response_from_other_model_1} --- RESPONSE {letter_2}: {response_from_other_model_2} --- SCORING RUBRIC: Rate each response 1-10 on the following dimensions. For each score, provide a 1-2 sentence justification. 1. **Accuracy** — Factual correctness and absence of hallucinations 2. **Depth** — Thoroughness of analysis, nuance, and insight 3. **Clarity** — How well-organized, readable, and understandable the response is 4. **Originality** — Novel framing, unique insights, or creative approach 5. **{custom_dimension}** — {custom_description} IMPORTANT: Evaluate solely on content quality. Do not attempt to identify which model wrote which response. One of these may share your architecture — judge purely on merit. OUTPUT FORMAT (use exactly this structure): ### Response {letter_1} - Accuracy: X/10 — justification - Depth: X/10 — justification - Clarity: X/10 — justification - Originality: X/10 — justification - {custom_dimension}: X/10 — justification **Overall:** 2-3 sentence assessment ### Response {letter_2} - Accuracy: X/10 — justification - Depth: X/10 — justification - Clarity: X/10 — justification - Originality: X/10 — justification - {custom_dimension}: X/10 — justification **Overall:** 2-3 sentence assessment ### Winner: {letter} (or Tie) **Reasoning:** 2-3 sentences
Step 5: Fire Judge Calls in Parallel
Build the JSON input for judge.sh and pipe it in:
echo '<json>' | bash ~/.claude/skills/showdown/scripts/judge.sh
The JSON format:
{ "judges": [ { "model": "claude-opus-4-6", "display_name": "Claude Opus 4.6", "prompt": "<constructed judge prompt for Claude>" }, { "model": "gpt-5.3-codex", "display_name": "GPT-5.3 Codex", "prompt": "<constructed judge prompt for GPT>" }, { "model": "gemini-3-pro-preview", "display_name": "Gemini 3 Pro", "prompt": "<constructed judge prompt for Gemini>" } ] }
Important: Use a temp file for the JSON input since it may be very large:
# Write JSON to temp file, then pipe TMPJSON=$(mktemp) cat > "$TMPJSON" << 'ENDJSON' {...} ENDJSON cat "$TMPJSON" | bash ~/.claude/skills/showdown/scripts/judge.sh rm "$TMPJSON"
Step 6: Present Anonymized Verdicts
Show each judge's verdict as returned, keeping responses anonymous (A/B/C):
## Judge Verdicts (Anonymized) ### Judge: Claude Opus 4.6 (judging responses {letter_x} and {letter_y}) — <duration>s <judge response verbatim> --- ### Judge: GPT-5.3 Codex (judging responses {letter_x} and {letter_y}) — <duration>s <judge response verbatim> --- ### Judge: Gemini 3 Pro (judging responses {letter_x} and {letter_y}) — <duration>s <judge response verbatim>
If a judge failed, note:
**<Judge>**: Failed — <error message>
Step 7: Reveal & Leaderboard
After ALL verdicts are shown, reveal the mapping and generate the leaderboard.
## Reveal | Letter | Model | |--------|-------| | A | <model name> | | B | <model name> | | C | <model name> |
Then parse all judge scores and compute the leaderboard. For each model, average the scores it received from the 2 judges that evaluated it.
### Leaderboard | Model | Accuracy | Depth | Clarity | Originality | {Custom} | Avg | Wins | |-------|----------|-------|---------|-------------|----------|-----|------| | ... | ... | ... | ... | ... | ... | ... | ... | *Scores averaged across judges. Wins = times picked as winner by judges.*
Then write a Narrative Synthesis (3-5 sentences):
- Who won overall and why
- Where judges agreed/disagreed
- Any surprising patterns (e.g., a model rated highest by its competitors)
- Whether the blind judging aligned with or diverged from the original comparison analysis
Step 8: Save (Append to Existing File)
Ask the user if they want to save the judge results using
AskUserQuestion.
If yes:
- If the source was a markdown file from
, append the entire judge section (Steps 6-7 output) to that same file./showdown-output/ - If the source was in-memory (same session), check if a showdown markdown was saved earlier. If so, append to it. If not, save as a new file:
showdown-YYYY-MM-DD-HHMMSS-judge.md
The appended section should be:
--- ## Judge Verdicts **Custom Dimension:** {custom_dimension} — {custom_description} **Anonymization:** A={model}, B={model}, C={model} ### Judge: <name> (judging {letters}) — <duration>s <full verdict> --- ### Judge: <name> (judging {letters}) — <duration>s <full verdict> --- ### Judge: <name> (judging {letters}) — <duration>s <full verdict> --- ### Leaderboard | Model | Accuracy | Depth | Clarity | Originality | {Custom} | Avg | Wins | |-------|----------|-------|---------|-------------|----------|-----|------| | ... | ... | ... | ... | ... | ... | ... | ... | ### Narrative Synthesis <3-5 sentence synthesis>
Tell the user the file path after saving.