Learn-skills.dev paper-fetch

Use when the user wants to download a paper PDF from a DOI, title, or URL via legal open-access sources. Tries Unpaywall, arXiv, bioRxiv/medRxiv, PubMed Central, and Semantic Scholar in order. Never uses Sci-Hub or paywall bypass.

install
source · Clone the upstream repo
git clone https://github.com/NeverSight/learn-skills.dev
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/agents365-ai/paper-fetch/paper-fetch" ~/.claude/skills/neversight-learn-skills-dev-paper-fetch && rm -rf "$T"
manifest: data/skills-md/agents365-ai/paper-fetch/paper-fetch/SKILL.md
source content

paper-fetch

Fetch the legal open-access PDF for a paper given a DOI (or title). Tries multiple OA sources in priority order and stops at the first hit.

Resolution order

  1. Unpaywall
    https://api.unpaywall.org/v2/{doi}?email=$UNPAYWALL_EMAIL
    , read
    best_oa_location.url_for_pdf
    (skipped if
    UNPAYWALL_EMAIL
    not set)
  2. Semantic Scholar
    https://api.semanticscholar.org/graph/v1/paper/DOI:{doi}?fields=openAccessPdf,externalIds
  3. arXiv — if
    externalIds.ArXiv
    present,
    https://arxiv.org/pdf/{arxiv_id}.pdf
  4. PubMed Central OA — if PMCID present,
    https://www.ncbi.nlm.nih.gov/pmc/articles/{pmcid}/pdf/
  5. bioRxiv / medRxiv — if DOI prefix is
    10.1101
    , query
    https://api.biorxiv.org/details/{server}/{doi}
    for the latest version PDF URL
  6. Otherwise → report failure with title/authors so the user can request via ILL

If only a title is given, resolve to a DOI first via Semantic Scholar

search_paper_by_title
(asta MCP) or Crossref.

Usage

python scripts/fetch.py <DOI> [--out DIR] [--dry-run] [--format json|text]

Flags

FlagDefaultDescription
doi
DOI to fetch (positional, e.g.
10.1038/s41586-020-2649-2
)
--batch FILE
File with one DOI per line for bulk download
--out DIR
pdfs
Output directory
--dry-run
offResolve sources without downloading; preview the PDF URL and filename
--format
json
Output format:
json
(for agents) or
text
(for humans)

Output contract

stdout emits a single JSON object (when

--format json
):

Success (all DOIs resolved):

{
  "ok": true,
  "data": {
    "results": [
      {
        "doi": "10.1038/s41586-020-2649-2",
        "success": true,
        "source": "unpaywall",
        "pdf_url": "https://...",
        "file": "pdfs/Author_2020_Title.pdf",
        "meta": {"title": "...", "year": 2020, "author": "Smith"}
      }
    ],
    "summary": {"total": 1, "succeeded": 1, "failed": 0}
  }
}

Partial failure (batch mode — some DOIs failed, exit code 1):

{
  "ok": true,
  "data": {
    "results": [
      {
        "doi": "10.1038/s41586-020-2649-2",
        "success": true,
        "source": "semantic_scholar",
        "pdf_url": "https://...",
        "file": "pdfs/Harris_2020_Array_programming_with_NumPy.pdf",
        "meta": {"title": "Array programming with NumPy", "year": 2020, "author": "Charles R. Harris"}
      },
      {
        "doi": "10.1234/nonexistent",
        "success": false,
        "source": null,
        "pdf_url": null,
        "file": null,
        "meta": {},
        "error": {"code": "not_found", "message": "No open-access PDF found", "retryable": false}
      }
    ],
    "summary": {"total": 2, "succeeded": 1, "failed": 1}
  }
}

Top-level failure (bad arguments, exit code 3):

{
  "ok": false,
  "error": {
    "code": "validation_error",
    "message": "Provide a DOI or --batch file",
    "retryable": false
  }
}

stderr carries human-readable progress diagnostics (source attempts, download status).

Exit codes

CodeMeaning
0
All DOIs resolved successfully
1
Runtime error (some DOIs failed, network/download issues)
3
Validation error (bad arguments, missing input)

Error codes in JSON

CodeMeaningRetryable
validation_error
Bad arguments or empty inputNo
not_found
No open-access PDF foundNo
download_network_error
Network failure during downloadYes
download_not_a_pdf
Response was not a PDF (HTML landing page)No
download_host_not_allowed
PDF URL host not in allowlistNo
download_size_exceeded
Response exceeded 50 MB limitNo
download_io_error
Local filesystem write failedNo
internal_error
Unexpected errorNo

Examples

# Single DOI (JSON output for agents)
python scripts/fetch.py 10.1038/s41586-020-2649-2

# Dry-run preview (resolve without downloading)
python scripts/fetch.py 10.1038/s41586-020-2649-2 --dry-run

# Human-readable output
python scripts/fetch.py 10.1038/s41586-020-2649-2 --format text

# Batch download
python scripts/fetch.py --batch dois.txt --out ./papers

# Works without UNPAYWALL_EMAIL (skips Unpaywall, uses remaining 4 sources)
python scripts/fetch.py 10.1038/s41586-020-2649-2

Notes

  • UNPAYWALL_EMAIL
    is optional but recommended. Set it once:
    export UNPAYWALL_EMAIL=you@example.com
    (e.g. in
    ~/.zshrc
    ). Without it, Unpaywall is skipped and the remaining 4 sources are still tried.
  • Downloads are restricted to a host allowlist of known OA providers, with a 50 MB size limit per PDF.
  • Never attempts to bypass paywalls. If no OA copy exists, the skill reports failure — do not suggest Sci-Hub or similar.
  • Default output directory:
    ./pdfs/
    . Filenames:
    {first_author}_{year}_{short_title}.pdf
    .

Auto-update

When installed via

git clone
, the skill keeps itself in sync with upstream automatically. On each invocation,
fetch.py
spawns a detached background
git pull --ff-only
in the skill directory:

  • Non-blocking — the current invocation is not delayed; the pull runs in a new session and is fully detached
  • Silent — all output goes to
    /dev/null
    , JSON contract on stdout is never polluted
  • Throttled — at most once every 24 hours (stamped via
    .git/.paper-fetch-last-update
    )
  • Safe
    --ff-only
    refuses to merge if you have local edits; conflicts never happen
  • Convergence — updates apply on the next invocation, not the current one (because the pull is backgrounded)

Environment variables

VariableDefaultPurpose
PAPER_FETCH_NO_AUTO_UPDATE
unsetSet to any value to completely disable auto-update
PAPER_FETCH_UPDATE_INTERVAL
86400
Cooldown in seconds between update attempts

Auto-update is a no-op when the skill is not a git checkout (e.g. tarball install), when

git
is unavailable, or when the cooldown stamp is fresh. Force an immediate check with
rm <skill_dir>/.git/.paper-fetch-last-update
.