PaperOrchestra paper-writing-bench

Reverse-engineer raw materials (Sparse idea, Dense idea, experimental log) from an existing AI research paper to build a benchmark case for evaluating paper-writing pipelines. Replicates the PaperWritingBench dataset construction procedure from arXiv:2604.05018 §3 / App. C. TRIGGER when the user asks to "build a benchmark case from this paper", "reverse-engineer raw materials", or "evaluate my pipeline against PaperWritingBench".

install

source · Clone the upstream repo

git clone https://github.com/Ar9av/PaperOrchestra

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/Ar9av/PaperOrchestra "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/paper-writing-bench" ~/.claude/skills/ar9av-paperorchestra-paper-writing-bench && rm -rf "$T"

manifest: skills/paper-writing-bench/SKILL.md

source content

PaperWritingBench (§3)

Faithful implementation of the PaperWritingBench dataset construction procedure from PaperOrchestra (Song et al., 2026, arXiv:2604.05018, §3 and App. C, F.2).

The original benchmark contains 200 papers (100 CVPR 2025 + 100 ICLR 2025). For each paper, the authors reverse-engineer the (I, E) tuple by stripping narrative flow from the original PDF using the three prompts in App. F.2. You can use this skill to reverse-engineer your own benchmark cases from any paper PDF.

What this skill does

Given an existing AI research paper (PDF or markdown extract), produce:

```
idea.md
```
(Sparse variant) — high-level concept note, no math, no experimental results
```
idea.md
```
(Dense variant) — detailed technical proposal with LaTeX equations and variable definitions, but still no experimental results
```
experimental_log.md
```
— exhaustive raw experimental setup, numeric data, and qualitative observations, with all narrative references stripped

These three files form a complete (I, E) input pair for the paper-orchestra pipeline. You can then run the pipeline and compare its output to the original paper using

paper-autoraters

Inputs

A paper PDF or extracted markdown text. The paper uses MinerU (Wang et al., 2024) for PDF→markdown extraction; you (the host agent) should use whatever PDF extractor your environment provides.
For controlled experiments, you may also extract figures separately (PDFFigures 2.0 in the paper).

Outputs

```
bench/<paper_id>/idea_sparse.md
```
— Sparse variant
```
bench/<paper_id>/idea_dense.md
```
— Dense variant
```
bench/<paper_id>/experimental_log.md
```
— Experimental log

Workflow

For each paper, run three independent LLM calls using the verbatim prompts below:

1. Sparse idea generation

Load

references/sparse-idea-prompt.md

. Pass the paper text (or markdown extract) as

{paper_content}

. The prompt instructs the model to:

Stop extracting at empirical verification (no Experiments / Results / Comparisons)
Use first-person future tense ("We propose to explore...")
Avoid LaTeX math; describe components by function
Anonymize authors and titles

Output:

idea_sparse.md

with the four sections (Problem Statement, Core Hypothesis, Proposed Methodology high-level, Expected Contribution).

2. Dense idea generation

Load

references/dense-idea-prompt.md

. Same input. The prompt instructs the model to:

Preserve mathematical formulations using LaTeX
Define every variable used in equations
Include specific architectural choices and dimensions
Same exclusion zone (no experiments)

Output:

idea_dense.md

with the four sections (Problem Statement, Core Hypothesis, Proposed Methodology detailed, Expected Contribution).

3. Experimental log generation

Load

references/experimental-log-prompt.md

. Same input. The prompt instructs the model to:

Use past-tense persona ("We ran...", "The results were...")
Strip all references to figure/table numbers
Deconstruct tables into raw numeric data
Log figure findings as factual observations
Anonymize authors

Output:

experimental_log.md

with sections for Setup, Raw Numeric Data, and Qualitative Observations.

Critical rules from the prompts

These are excerpted from App. F.2. The host agent MUST honor them:

No citations. None of the three outputs may contain
```
\cite
```
, reference numbers, or author names from the source paper.
No URLs. Strip all hyperlinks.
Anonymize. Author identities, affiliations, acknowledgements all removed.
Self-contained. Each file must make sense without the original paper.
No experimental leakage in idea files. The Sparse and Dense ideas must stop where empirical verification begins. They describe what will be done, not what was done.
No table/figure references in experimental log. No "as shown in Table 1", "see Fig. 5". The downstream paper-orchestra pipeline will generate its own figures and tables — the log must not assume any particular ones exist.
100% numeric accuracy in experimental log. This becomes the ground truth for the section-writing-agent and content-refinement-agent's hallucination check.

How the bench is used

After producing

(idea_sparse.md, idea_dense.md, experimental_log.md)

for a paper:

Pick a variant (Sparse or Dense) — the paper ablates both, with Dense producing more rigorous methodology and Sparse exercising the system's robustness on under-specified inputs.
Drop the chosen
```
idea.md
```
, plus
```
experimental_log.md
```
, plus a
```
template.tex
```
for the target conference, plus a
```
conference_guidelines.md
```
, into a paper-orchestra workspace.
Run the pipeline.
Compare the generated paper against the original using
```
paper-autoraters
```
(citation F1, lit review quality, SxS paper quality).

Resources

```
references/bench-overview.md
```
— the 200-paper bench, venue cutoffs, sizes
```
references/sparse-idea-prompt.md
```
— verbatim from App. F.2
```
references/dense-idea-prompt.md
```
— verbatim from App. F.2
```
references/experimental-log-prompt.md
```
— verbatim from App. F.2