PaperOrchestra paper-writing-bench
Reverse-engineer raw materials (Sparse idea, Dense idea, experimental log) from an existing AI research paper to build a benchmark case for evaluating paper-writing pipelines. Replicates the PaperWritingBench dataset construction procedure from arXiv:2604.05018 §3 / App. C. TRIGGER when the user asks to "build a benchmark case from this paper", "reverse-engineer raw materials", or "evaluate my pipeline against PaperWritingBench".
git clone https://github.com/Ar9av/PaperOrchestra
T=$(mktemp -d) && git clone --depth=1 https://github.com/Ar9av/PaperOrchestra "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/paper-writing-bench" ~/.claude/skills/ar9av-paperorchestra-paper-writing-bench && rm -rf "$T"
skills/paper-writing-bench/SKILL.mdPaperWritingBench (§3)
Faithful implementation of the PaperWritingBench dataset construction procedure from PaperOrchestra (Song et al., 2026, arXiv:2604.05018, §3 and App. C, F.2).
The original benchmark contains 200 papers (100 CVPR 2025 + 100 ICLR 2025). For each paper, the authors reverse-engineer the (I, E) tuple by stripping narrative flow from the original PDF using the three prompts in App. F.2. You can use this skill to reverse-engineer your own benchmark cases from any paper PDF.
What this skill does
Given an existing AI research paper (PDF or markdown extract), produce:
(Sparse variant) — high-level concept note, no math, no experimental resultsidea.md
(Dense variant) — detailed technical proposal with LaTeX equations and variable definitions, but still no experimental resultsidea.md
— exhaustive raw experimental setup, numeric data, and qualitative observations, with all narrative references strippedexperimental_log.md
These three files form a complete (I, E) input pair for the paper-orchestra pipeline. You can then run the pipeline and compare its output to the original paper using
paper-autoraters.
Inputs
- A paper PDF or extracted markdown text. The paper uses MinerU (Wang et al., 2024) for PDF→markdown extraction; you (the host agent) should use whatever PDF extractor your environment provides.
- For controlled experiments, you may also extract figures separately (PDFFigures 2.0 in the paper).
Outputs
— Sparse variantbench/<paper_id>/idea_sparse.md
— Dense variantbench/<paper_id>/idea_dense.md
— Experimental logbench/<paper_id>/experimental_log.md
Workflow
For each paper, run three independent LLM calls using the verbatim prompts below:
1. Sparse idea generation
Load
references/sparse-idea-prompt.md. Pass the paper text (or
markdown extract) as {paper_content}. The prompt instructs the model to:
- Stop extracting at empirical verification (no Experiments / Results / Comparisons)
- Use first-person future tense ("We propose to explore...")
- Avoid LaTeX math; describe components by function
- Anonymize authors and titles
Output:
idea_sparse.md with the four sections (Problem Statement, Core
Hypothesis, Proposed Methodology high-level, Expected Contribution).
2. Dense idea generation
Load
references/dense-idea-prompt.md. Same input. The prompt instructs
the model to:
- Preserve mathematical formulations using LaTeX
- Define every variable used in equations
- Include specific architectural choices and dimensions
- Same exclusion zone (no experiments)
Output:
idea_dense.md with the four sections (Problem Statement, Core
Hypothesis, Proposed Methodology detailed, Expected Contribution).
3. Experimental log generation
Load
references/experimental-log-prompt.md. Same input. The prompt
instructs the model to:
- Use past-tense persona ("We ran...", "The results were...")
- Strip all references to figure/table numbers
- Deconstruct tables into raw numeric data
- Log figure findings as factual observations
- Anonymize authors
Output:
experimental_log.md with sections for Setup, Raw Numeric Data,
and Qualitative Observations.
Critical rules from the prompts
These are excerpted from App. F.2. The host agent MUST honor them:
- No citations. None of the three outputs may contain
, reference numbers, or author names from the source paper.\cite - No URLs. Strip all hyperlinks.
- Anonymize. Author identities, affiliations, acknowledgements all removed.
- Self-contained. Each file must make sense without the original paper.
- No experimental leakage in idea files. The Sparse and Dense ideas must stop where empirical verification begins. They describe what will be done, not what was done.
- No table/figure references in experimental log. No "as shown in Table 1", "see Fig. 5". The downstream paper-orchestra pipeline will generate its own figures and tables — the log must not assume any particular ones exist.
- 100% numeric accuracy in experimental log. This becomes the ground truth for the section-writing-agent and content-refinement-agent's hallucination check.
How the bench is used
After producing
(idea_sparse.md, idea_dense.md, experimental_log.md) for
a paper:
- Pick a variant (Sparse or Dense) — the paper ablates both, with Dense producing more rigorous methodology and Sparse exercising the system's robustness on under-specified inputs.
- Drop the chosen
, plusidea.md
, plus aexperimental_log.md
for the target conference, plus atemplate.tex
, into a paper-orchestra workspace.conference_guidelines.md - Run the pipeline.
- Compare the generated paper against the original using
(citation F1, lit review quality, SxS paper quality).paper-autoraters
Resources
— the 200-paper bench, venue cutoffs, sizesreferences/bench-overview.md
— verbatim from App. F.2references/sparse-idea-prompt.md
— verbatim from App. F.2references/dense-idea-prompt.md
— verbatim from App. F.2references/experimental-log-prompt.md