Showdown-claude-skill judge

Judge mode — models evaluate each other's showdown responses blind

install
source · Clone the upstream repo
git clone https://github.com/vanderheijden86/showdown-claude-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/vanderheijden86/showdown-claude-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/judge" ~/.claude/skills/vanderheijden86-showdown-claude-skill-judge && rm -rf "$T"
manifest: judge/SKILL.md
source content

You are executing the

/showdown judge
skill. Models will cross-judge each other's responses from a previous showdown, blind and anonymized.

Step 1: Determine Input Source

Check if a

/showdown
was run earlier in this conversation and you still have the responses in context.

  • If yes: Use the in-memory responses (prompt + 3 model responses). Skip to Step 2.
  • If no (or user provided a file path as argument):
    1. If
      $ARGUMENTS
      contains a file path, read that file
    2. Otherwise, list available showdown files:
      ls -t ./showdown-output/showdown-*.md 2>/dev/null | head -10
      
    3. If files found, present a numbered list using
      AskUserQuestion
      and let the user pick
    4. If no files found, tell the user: "No showdown output files found. Run
      /showdown
      first."
    5. Read the chosen file and extract: the original prompt, and each model's full response (between the
      ---
      separators)

Step 2: Choose Custom Dimension

Based on the original prompt topic, choose ONE custom scoring dimension that's relevant. Examples:

  • Technical prompt → "Technical Correctness"
  • Creative prompt → "Creativity"
  • Business/strategy → "Actionability"
  • Debate/opinion → "Persuasiveness"
  • Code-related → "Code Quality"

The 4 fixed dimensions are always: Accuracy, Depth, Clarity, Originality.

Announce the 5 dimensions to the user before proceeding.

Step 3: Anonymize Responses

Randomly assign letters A, B, C to the three models. Record the mapping internally (e.g., A=Claude, B=GPT, C=Gemini). The assignment MUST be randomized — do not always use the same order.

To randomize, use:

echo "A B C" | tr ' ' '\n' | sort -R | tr '\n' ' '

Assign the first letter to the first model in models.conf order, second to second, etc.

Step 4: Construct Judge Prompts

For each judge model, construct a prompt containing ONLY the 2 responses from the OTHER models (not the judge's own). Use the anonymous letters.

Judge prompt template:

You are evaluating two AI-generated responses to the following prompt.

---
ORIGINAL PROMPT:
{original_prompt}
---

RESPONSE {letter_1}:
{response_from_other_model_1}

---

RESPONSE {letter_2}:
{response_from_other_model_2}

---

SCORING RUBRIC:
Rate each response 1-10 on the following dimensions. For each score, provide a 1-2 sentence justification.

1. **Accuracy** — Factual correctness and absence of hallucinations
2. **Depth** — Thoroughness of analysis, nuance, and insight
3. **Clarity** — How well-organized, readable, and understandable the response is
4. **Originality** — Novel framing, unique insights, or creative approach
5. **{custom_dimension}** — {custom_description}

IMPORTANT: Evaluate solely on content quality. Do not attempt to identify which model wrote which response. One of these may share your architecture — judge purely on merit.

OUTPUT FORMAT (use exactly this structure):

### Response {letter_1}
- Accuracy: X/10 — justification
- Depth: X/10 — justification
- Clarity: X/10 — justification
- Originality: X/10 — justification
- {custom_dimension}: X/10 — justification
**Overall:** 2-3 sentence assessment

### Response {letter_2}
- Accuracy: X/10 — justification
- Depth: X/10 — justification
- Clarity: X/10 — justification
- Originality: X/10 — justification
- {custom_dimension}: X/10 — justification
**Overall:** 2-3 sentence assessment

### Winner: {letter} (or Tie)
**Reasoning:** 2-3 sentences

Step 5: Fire Judge Calls in Parallel

Build the JSON input for judge.sh and pipe it in:

echo '<json>' | bash ~/.claude/skills/showdown/scripts/judge.sh

The JSON format:

{
  "judges": [
    {
      "model": "claude-opus-4-6",
      "display_name": "Claude Opus 4.6",
      "prompt": "<constructed judge prompt for Claude>"
    },
    {
      "model": "gpt-5.3-codex",
      "display_name": "GPT-5.3 Codex",
      "prompt": "<constructed judge prompt for GPT>"
    },
    {
      "model": "gemini-3-pro-preview",
      "display_name": "Gemini 3 Pro",
      "prompt": "<constructed judge prompt for Gemini>"
    }
  ]
}

Important: Use a temp file for the JSON input since it may be very large:

# Write JSON to temp file, then pipe
TMPJSON=$(mktemp)
cat > "$TMPJSON" << 'ENDJSON'
{...}
ENDJSON
cat "$TMPJSON" | bash ~/.claude/skills/showdown/scripts/judge.sh
rm "$TMPJSON"

Step 6: Present Anonymized Verdicts

Show each judge's verdict as returned, keeping responses anonymous (A/B/C):

## Judge Verdicts (Anonymized)

### Judge: Claude Opus 4.6 (judging responses {letter_x} and {letter_y}) — <duration>s

<judge response verbatim>

---

### Judge: GPT-5.3 Codex (judging responses {letter_x} and {letter_y}) — <duration>s

<judge response verbatim>

---

### Judge: Gemini 3 Pro (judging responses {letter_x} and {letter_y}) — <duration>s

<judge response verbatim>

If a judge failed, note:

**<Judge>**: Failed — <error message>

Step 7: Reveal & Leaderboard

After ALL verdicts are shown, reveal the mapping and generate the leaderboard.

## Reveal

| Letter | Model |
|--------|-------|
| A | <model name> |
| B | <model name> |
| C | <model name> |

Then parse all judge scores and compute the leaderboard. For each model, average the scores it received from the 2 judges that evaluated it.

### Leaderboard

| Model | Accuracy | Depth | Clarity | Originality | {Custom} | Avg | Wins |
|-------|----------|-------|---------|-------------|----------|-----|------|
| ... | ... | ... | ... | ... | ... | ... | ... |

*Scores averaged across judges. Wins = times picked as winner by judges.*

Then write a Narrative Synthesis (3-5 sentences):

  • Who won overall and why
  • Where judges agreed/disagreed
  • Any surprising patterns (e.g., a model rated highest by its competitors)
  • Whether the blind judging aligned with or diverged from the original comparison analysis

Step 8: Save (Append to Existing File)

Ask the user if they want to save the judge results using

AskUserQuestion
.

If yes:

  • If the source was a markdown file from
    ./showdown-output/
    , append the entire judge section (Steps 6-7 output) to that same file
  • If the source was in-memory (same session), check if a showdown markdown was saved earlier. If so, append to it. If not, save as a new file:
    showdown-YYYY-MM-DD-HHMMSS-judge.md

The appended section should be:


---

## Judge Verdicts

**Custom Dimension:** {custom_dimension} — {custom_description}
**Anonymization:** A={model}, B={model}, C={model}

### Judge: <name> (judging {letters}) — <duration>s

<full verdict>

---

### Judge: <name> (judging {letters}) — <duration>s

<full verdict>

---

### Judge: <name> (judging {letters}) — <duration>s

<full verdict>

---

### Leaderboard

| Model | Accuracy | Depth | Clarity | Originality | {Custom} | Avg | Wins |
|-------|----------|-------|---------|-------------|----------|-----|------|
| ... | ... | ... | ... | ... | ... | ... | ... |

### Narrative Synthesis

<3-5 sentence synthesis>

Tell the user the file path after saving.