git clone https://github.com/dwzhu-pku/PaperBanana
T=$(mktemp -d) && git clone --depth=1 https://github.com/dwzhu-pku/PaperBanana "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skill" ~/.claude/skills/dwzhu-pku-paperbanana-paperbanana && rm -rf "$T"
skill/SKILL.mdPaperBanana
Generate publication-quality academic diagrams and pipeline figures from a paper's methodology section and figure caption. PaperBanana orchestrates a multi-agent pipeline (Retriever, Planner, Stylist, Visualizer, Critic) to produce camera-ready figures suitable for venues like NeurIPS, ICML, and ACL.
Environment Setup
cd <repo-root> uv pip install -r requirements.txt
Set your API key via environment variable or in
configs/model_config.yaml.
Option 1 (Recommended): OpenRouter API key — one key for both text reasoning and image generation:
export OPENROUTER_API_KEY="sk-or-v1-..."
Option 2: Google API key — direct access to Gemini API:
export GOOGLE_API_KEY="your-key-here"
If both keys are configured, OpenRouter is used by default.
Usage
python skill/run.py \ --content "METHOD_TEXT" \ --caption "FIGURE_CAPTION" \ --task diagram \ --output output.png
Parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
| Yes* | Method section text to visualize | |
| Yes* | Path to a file containing the method text (alternative to ) | |
| Yes | Figure caption or visual intent | |
| No | | Task type: |
| No | | Output image file path |
| No | | Aspect ratio: , , or |
| No | | Maximum critic refinement iterations |
| No | | Number of parallel candidates to generate |
| No | | Retrieval mode: , , , or |
| No | | Main model for VLM agents. Provider auto-detected from configured API key |
| No | | Model for image generation. Also supports |
| No | | Pipeline: (with Stylist) or (without Stylist) |
*One of
--content or --content-file is required.
When
--num-candidates > 1, output files are named <stem>_0.png, <stem>_1.png, etc.
Output
The absolute path of each saved image is printed to stdout, one per line.
Examples
Diagram
python skill/run.py \ --content "We propose a transformer-based encoder-decoder architecture. The encoder consists of 12 self-attention layers with residual connections. The decoder uses cross-attention to attend to encoder outputs and generates the target sequence autoregressively." \ --caption "Figure 1: Overview of the proposed transformer architecture" \ --task diagram \ --output architecture.png
Important Notes
- Runtime: A single candidate typically takes 3-10 minutes depending on model and network conditions. With the default 10 candidates running in parallel, expect ~10-30 minutes total. Plan accordingly.
- API calls: Each candidate involves multiple LLM calls (Retriever + Planner + Stylist + Visualizer + up to 3 Critic rounds). Candidates run in parallel for efficiency.
- Image generation: The Visualizer agent calls an image generation model (Gemini Image) to render diagrams.
About
PaperBanana is based on the PaperVizAgent framework, a reference-driven multi-agent system for automated academic illustration. It was developed as part of the research paper:
PaperBanana: Automating Academic Illustration for AI Scientists Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, Jinsung Yoon arXiv:2601.23265
The framework introduces a collaborative team of five specialized agents — Retriever, Planner, Stylist, Visualizer, and Critic — to transform raw scientific content into publication-quality diagrams. Evaluation is conducted on the PaperBananaBench benchmark.