PaperOrchestra paper-autoraters

Run the four paper-quality autoraters from PaperOrchestra (arXiv:2604.05018, App. F.3) — Citation F1 (P0/P1 partition + Precision/Recall/F1), Literature Review Quality (6-axis 0-100 with anti-inflation rules), SxS Overall Paper Quality (side-by-side), and SxS Literature Review Quality (side-by-side). TRIGGER when the user asks to "score this paper draft", "evaluate against the benchmark", "compare two papers", or "run the autoraters".

install
source · Clone the upstream repo
git clone https://github.com/Ar9av/PaperOrchestra
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Ar9av/PaperOrchestra "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/paper-autoraters" ~/.claude/skills/ar9av-paperorchestra-paper-autoraters && rm -rf "$T"
manifest: skills/paper-autoraters/SKILL.md
source content

Paper Autoraters (App. F.3)

Faithful implementation of the four LLM-as-judge autoraters used in PaperOrchestra (Song et al., 2026, arXiv:2604.05018, §5 and App. F.3).

These are the metrics the paper uses to demonstrate that PaperOrchestra beats single-agent and AI-Scientist-v2 baselines. Use them to:

  1. Score a generated paper against a ground-truth paper.
  2. Compare two paper-writing pipelines side-by-side.
  3. Validate your own host-agent execution of the paper-orchestra pipeline.

The four autoraters

AutoraterWhat it doesInputsOutput
Citation F1 — P0/P1 partitionPartitions reference list into P0 (must-cite) and P1 (good-to-cite) given the paper textone paper text + its references listJSON
{ref_num: "P0"|"P1"}
Literature Review Quality6-axis 0-100 score for Intro+Related Work, with anti-inflation hard capsone paper PDF/text + reference avg citation countJSON with
axis_scores
,
penalties
,
summary
,
overall_score
SxS Overall Paper QualityHolistic side-by-side preference judgmenttwo papers (PDF or text)JSON with
winner
∈ {paper_1, paper_2, tie}
SxS Literature Review QualitySide-by-side preference, Intro+Related Work onlytwo papersJSON with
winner
∈ {paper_1, paper_2, tie}

The paper uses Gemini-3.1-Pro and GPT-5 as judges, set to temperature 0.0 (Gemini) or default 1.0 (GPT-5, which doesn't allow temperature adjustment). Use whatever your host LLM is.

Workflow

Citation F1 (compute Precision / Recall / F1 vs ground truth)

This is a two-step procedure:

Step 1: Partition the reference lists into P0 / P1

For both the ground-truth paper AND the generated paper, run the LLM with

references/citation-f1-prompt.md
:

inputs:
  paper_text:    full paper LaTeX or markdown
  references_str: numbered reference list (e.g., "1. Vaswani et al. (2017)
                  Attention Is All You Need. NeurIPS. 2. He et al. (2016)
                  Deep Residual Learning for Image Recognition. CVPR. ...")

output: JSON {"1": "P0", "2": "P1", "3": "P0", ...}

Save both partitions:

  • bench/<paper_id>/gt_partition.json
  • bench/<paper_id>/gen_partition.json

Step 2: Resolve references to entity IDs and compute F1

The paper uses Semantic Scholar paper IDs to match references between the two lists. The

compute_f1.py
script does this deterministically given two input lists:

python skills/paper-autoraters/scripts/compute_f1.py \
    --gt-partition gt_partition.json \
    --gt-refs gt_refs.json \
    --gen-partition gen_partition.json \
    --gen-refs gen_refs.json \
    --out f1_report.json

Where

gt_refs.json
and
gen_refs.json
are lists of
{ref_num, paper_id, title}
produced by your host's S2-resolution pass (the same fuzzy match + S2 verification used by
literature-review-agent/scripts/
).

Output JSON contains P0 / P1 / overall Precision, Recall, F1.

Literature Review Quality (single paper, 6 axes)

Load

references/litreview-quality-prompt.md
. Inputs:

  • The full paper PDF (or LaTeX/markdown if your host lacks PDF input)
  • avg_citation_count
    for the venue/field (used as the baseline for citation count anchoring, e.g., 58.52 for CVPR 2025, 59.18 for ICLR 2025 per the paper)

The prompt instructs the model to evaluate ONLY the literature-review function of the paper (Introduction + Related Work / Background sections). It produces a strict JSON output with per-axis scores and justifications.

Critical anti-inflation rules baked into the prompt:

RuleCap
Default expectationoverall 45-70
> 85 requires strong evidence on ALL axes
> 90 extremely rare (near-survey-level mastery)
Any axis < 50 → overall rarely > 75
Mostly descriptive reviewCritical Analysis ≤ 60
Novelty asserted without comparisonPositioning ≤ 60
Sparse/inconsistent citationsCitation Rigor ≤ 60
Citation count < 50% of avgCoverage ≤ 55
Citation count > 120% of avgCoverage = "strong"

Plus penalty table:

PenaltyRange
Overclaiming novelty-5 to -15
Missing key recent work-5 to -15
Mostly descriptive review-5 to -10
Weak gap statements-5 to -10
Citation dumping-5 to -10

Save the output to

litreview_quality_score.json
. The score JSON is the same shape used by
content-refinement-agent/scripts/score_delta.py
, so you can re-use the halt-rule logic to compare iterations.

SxS Overall Paper Quality (side-by-side, full paper)

Load

references/sxs-paper-quality-prompt.md
. Inputs:

  • Two paper PDFs or LaTeX files (call them
    paper_1
    and
    paper_2
    )

The prompt produces a JSON with

paper_1_holistic_analysis
,
paper_2_holistic_analysis
,
comparison_justification
, and
winner ∈ {paper_1, paper_2, tie}
.

To mitigate LLM positional bias (the paper notes this in §5.4), run the comparison twice with the order swapped:

call_1: paper_A → paper_1, paper_B → paper_2  → winner1
call_2: paper_B → paper_1, paper_A → paper_2  → winner2

Final outcome: a

win
(both calls agree on paper A),
tie
(one win + one tie, or two ties), or
loss
(both agree on paper B). The paper uses this exact ordering protocol.

SxS Literature Review Quality (side-by-side, Intro+RW only)

Load

references/sxs-litreview-prompt.md
. Same input/output shape as the SxS paper quality autorater, but the model is instructed to evaluate only the Introduction and Related Work / Background sections of each paper. Same positional-bias mitigation: run twice, swap order.

Resources

  • references/citation-f1-prompt.md
    — verbatim P0/P1 partition prompt from App. F.3
  • references/litreview-quality-prompt.md
    — verbatim 6-axis litreview rubric from App. F.3
  • references/sxs-paper-quality-prompt.md
    — verbatim SxS paper-quality prompt from App. F.3
  • references/sxs-litreview-prompt.md
    — verbatim SxS litreview prompt from App. F.3
  • scripts/compute_f1.py
    — Precision / Recall / F1 from two partition JSONs