PaperOrchestra paper-autoraters

Run the four paper-quality autoraters from PaperOrchestra (arXiv:2604.05018, App. F.3) — Citation F1 (P0/P1 partition + Precision/Recall/F1), Literature Review Quality (6-axis 0-100 with anti-inflation rules), SxS Overall Paper Quality (side-by-side), and SxS Literature Review Quality (side-by-side). TRIGGER when the user asks to "score this paper draft", "evaluate against the benchmark", "compare two papers", or "run the autoraters".

install

source · Clone the upstream repo

git clone https://github.com/Ar9av/PaperOrchestra

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/Ar9av/PaperOrchestra "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/paper-autoraters" ~/.claude/skills/ar9av-paperorchestra-paper-autoraters && rm -rf "$T"

manifest: skills/paper-autoraters/SKILL.md

source content

Paper Autoraters (App. F.3)

Faithful implementation of the four LLM-as-judge autoraters used in PaperOrchestra (Song et al., 2026, arXiv:2604.05018, §5 and App. F.3).

These are the metrics the paper uses to demonstrate that PaperOrchestra beats single-agent and AI-Scientist-v2 baselines. Use them to:

Score a generated paper against a ground-truth paper.
Compare two paper-writing pipelines side-by-side.
Validate your own host-agent execution of the paper-orchestra pipeline.

The four autoraters

Autorater	What it does	Inputs	Output
Citation F1 — P0/P1 partition	Partitions reference list into P0 (must-cite) and P1 (good-to-cite) given the paper text	one paper text + its references list	JSON `{ref_num: "P0"\|"P1"}`
Literature Review Quality	6-axis 0-100 score for Intro+Related Work, with anti-inflation hard caps	one paper PDF/text + reference avg citation count	JSON with `axis_scores` , `penalties` , `summary` , `overall_score`
SxS Overall Paper Quality	Holistic side-by-side preference judgment	two papers (PDF or text)	JSON with `winner` ∈ {paper_1, paper_2, tie}
SxS Literature Review Quality	Side-by-side preference, Intro+Related Work only	two papers	JSON with `winner` ∈ {paper_1, paper_2, tie}

The paper uses Gemini-3.1-Pro and GPT-5 as judges, set to temperature 0.0 (Gemini) or default 1.0 (GPT-5, which doesn't allow temperature adjustment). Use whatever your host LLM is.

Workflow

Citation F1 (compute Precision / Recall / F1 vs ground truth)

This is a two-step procedure:

Step 1: Partition the reference lists into P0 / P1

For both the ground-truth paper AND the generated paper, run the LLM with

references/citation-f1-prompt.md

inputs:
  paper_text:    full paper LaTeX or markdown
  references_str: numbered reference list (e.g., "1. Vaswani et al. (2017)
                  Attention Is All You Need. NeurIPS. 2. He et al. (2016)
                  Deep Residual Learning for Image Recognition. CVPR. ...")

output: JSON {"1": "P0", "2": "P1", "3": "P0", ...}

Save both partitions:

```
bench/<paper_id>/gt_partition.json
```
```
bench/<paper_id>/gen_partition.json
```

Step 2: Resolve references to entity IDs and compute F1

The paper uses Semantic Scholar paper IDs to match references between the two lists. The

compute_f1.py

script does this deterministically given two input lists:

python skills/paper-autoraters/scripts/compute_f1.py \
    --gt-partition gt_partition.json \
    --gt-refs gt_refs.json \
    --gen-partition gen_partition.json \
    --gen-refs gen_refs.json \
    --out f1_report.json

Where

gt_refs.json

and

gen_refs.json

are lists of

{ref_num, paper_id, title}

produced by your host's S2-resolution pass (the same fuzzy match + S2 verification used by

literature-review-agent/scripts/

Output JSON contains P0 / P1 / overall Precision, Recall, F1.

Literature Review Quality (single paper, 6 axes)

Load

references/litreview-quality-prompt.md

. Inputs:

The full paper PDF (or LaTeX/markdown if your host lacks PDF input)
```
avg_citation_count
```
for the venue/field (used as the baseline for citation count anchoring, e.g., 58.52 for CVPR 2025, 59.18 for ICLR 2025 per the paper)

The prompt instructs the model to evaluate ONLY the literature-review function of the paper (Introduction + Related Work / Background sections). It produces a strict JSON output with per-axis scores and justifications.

Critical anti-inflation rules baked into the prompt:

Rule	Cap
Default expectation	overall 45-70
> 85 requires strong evidence on ALL axes	—
> 90 extremely rare (near-survey-level mastery)	—
Any axis < 50 → overall rarely > 75	—
Mostly descriptive review	Critical Analysis ≤ 60
Novelty asserted without comparison	Positioning ≤ 60
Sparse/inconsistent citations	Citation Rigor ≤ 60
Citation count < 50% of avg	Coverage ≤ 55
Citation count > 120% of avg	Coverage = "strong"

Plus penalty table:

Penalty	Range
Overclaiming novelty	-5 to -15
Missing key recent work	-5 to -15
Mostly descriptive review	-5 to -10
Weak gap statements	-5 to -10
Citation dumping	-5 to -10

Save the output to

litreview_quality_score.json

. The score JSON is the same shape used by

content-refinement-agent/scripts/score_delta.py

, so you can re-use the halt-rule logic to compare iterations.

SxS Overall Paper Quality (side-by-side, full paper)

Load

references/sxs-paper-quality-prompt.md

. Inputs:

Two paper PDFs or LaTeX files (call them
```
paper_1
```
and
```
paper_2
```
)

The prompt produces a JSON with

paper_1_holistic_analysis

paper_2_holistic_analysis

comparison_justification

, and

winner ∈ {paper_1, paper_2, tie}

To mitigate LLM positional bias (the paper notes this in §5.4), run the comparison twice with the order swapped:

call_1: paper_A → paper_1, paper_B → paper_2  → winner1
call_2: paper_B → paper_1, paper_A → paper_2  → winner2

Final outcome: a

win

(both calls agree on paper A),

tie

(one win + one tie, or two ties), or

loss

(both agree on paper B). The paper uses this exact ordering protocol.

SxS Literature Review Quality (side-by-side, Intro+RW only)

Load

references/sxs-litreview-prompt.md

. Same input/output shape as the SxS paper quality autorater, but the model is instructed to evaluate only the Introduction and Related Work / Background sections of each paper. Same positional-bias mitigation: run twice, swap order.

Resources

```
references/citation-f1-prompt.md
```
— verbatim P0/P1 partition prompt from App. F.3
```
references/litreview-quality-prompt.md
```
— verbatim 6-axis litreview rubric from App. F.3
```
references/sxs-paper-quality-prompt.md
```
— verbatim SxS paper-quality prompt from App. F.3
```
references/sxs-litreview-prompt.md
```
— verbatim SxS litreview prompt from App. F.3
```
scripts/compute_f1.py
```
— Precision / Recall / F1 from two partition JSONs