Medsci-skills fulltext-retrieval
Batch download open-access PDFs by DOI using legitimate OA APIs (Unpaywall, PMC, OpenAlex, Crossref). Optional PDF→Markdown conversion for token-efficient LLM analysis.
git clone https://github.com/Aperivue/medsci-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aperivue/medsci-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/fulltext-retrieval" ~/.claude/skills/aperivue-medsci-skills-fulltext-retrieval && rm -rf "$T"
skills/fulltext-retrieval/SKILL.mdFulltext Retrieval Skill
Batch download open-access full-text PDFs from a DOI list using legitimate OA APIs only.
Pipeline
DOI list → Unpaywall → PMC (Europe PMC / OA FTP / web) → OpenAlex → Crossref → landing page
Each DOI goes through these sources in order until a valid PDF (≥10 KB,
%PDF- header) is found.
Quick Start
# Prepare a DOI list (one per line) cat > dois.txt << 'EOF' 10.1007/s00330-010-1783-x 10.1002/mp.12524 10.1148/radiol.13131265 EOF # Run python fetch_oa.py dois.txt --output pdfs/ --email your@email.com # Verbose mode for debugging python fetch_oa.py dois.txt -o pdfs/ -e your@email.com --verbose
Input Formats
Plain text — one DOI per line:
10.1007/s00330-010-1783-x 10.1002/mp.12524
TSV with header — must contain a
DOI column, optional PMID column:
ID Title DOI PMID Year 1 Some paper 10.1007/s00330-010-1783-x 20628747 2010
When a PMID is available, the PMC lookup is more reliable (PMID → PMCID conversion).
PMC Download (JS-Challenge Resistant)
PMC web pages may block automated downloads with JavaScript proof-of-work challenges. This tool uses three fallback methods:
Method A: Europe PMC REST API (most reliable)
PMCID="PMC9733600" curl -sLo output.pdf \ "https://europepmc.org/backend/ptpmcrender.fcgi?accid=${PMCID}&blobtype=pdf"
Method B: PMC OA FTP Service
curl -s "https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=${PMCID}" | \ grep -oE 'href="[^"]*\.pdf"' | head -1 | \ sed 's/href="//;s/"//' | xargs curl -sLo output.pdf
DOI/PMID → PMCID Conversion
# Works with both DOI and PMID curl -s "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids=${DOI}&format=json" | \ python3 -c "import sys,json; print(json.load(sys.stdin)['records'][0].get('pmcid',''))"
Output
- PDFs saved as
(slashes replaced with underscores){DOI_safe}.pdf
— DOIs that could not be retrieved via OAmanual_needed.txt- Summary with OA/PMC/fail/skip counts
Requirements
- Python 3.10+ (stdlib only, no pip dependencies)
- Contact email (required by Unpaywall Terms of Service)
API Policies
| Source | Rate Limit | Notes |
|---|---|---|
| Unpaywall | 100 req/sec | Email required |
| NCBI PMC | 3 req/sec without API key | Add for higher limits |
| OpenAlex | 100k req/day | Polite pool with email in User-Agent |
| Crossref | 50 req/sec with email | Plus service with in UA |
| Europe PMC | No documented limit | Be polite, ≤1 req/sec recommended |
The script uses 0.3–0.5 second delays between requests.
PDF → Markdown Conversion (Optional)
After downloading PDFs, convert them to LLM-friendly Markdown for token-efficient repeated analysis. Uses pymupdf4llm — optimized for academic papers with two-column layout handling and table preservation.
Quick Start
# Install (one-time) pip install pymupdf4llm # Convert all PDFs in a directory python pdf_to_md.py pdfs/ # Convert with verbose output python pdf_to_md.py pdfs/ -v # Custom output directory python pdf_to_md.py pdfs/ -o markdown/ # First 10 pages only (useful for long supplements) python pdf_to_md.py pdfs/ --pages 0-9 # Overwrite existing conversions python pdf_to_md.py pdfs/ --force
Combined Workflow
# Step 1: Download PDFs python fetch_oa.py dois.txt -o pdfs/ -e your@email.com # Step 2: Convert to Markdown (only successful downloads) python pdf_to_md.py pdfs/ -v
After conversion,
.md files sit alongside .pdf files. Claude Code can then use Read for full content or Grep for targeted extraction — significantly more token-efficient than re-reading PDFs.
When to Convert
| Scenario | Recommendation |
|---|---|
| Screening/triage (read once) | Skip — read PDF directly |
| Data extraction from k≥5 studies | Convert — repeated reads save tokens |
| Meta-analysis full pipeline | Convert — papers referenced across multiple phases |
| Single paper deep review | Optional — marginal benefit |
Academic Paper Defaults
- Images: Skipped (saves tokens; figures referenced by caption text)
- Tables:
strategy (preserves grid-line tables accurately)lines_strict - Layout: Two-column academic layout handled automatically
- Headers/footers: Removed by pymupdf4llm
Dependency Note
pdf_to_md.py requires pymupdf4llm (AGPL-3.0). This is an optional dependency — fetch_oa.py remains stdlib-only with zero external dependencies. The AGPL license applies to pymupdf4llm itself, not to this skill.
Limitations
- Only retrieves open-access articles. Paywalled articles require institutional access.
- Landing page scraping may fail on publisher-specific JavaScript-heavy pages.
- Some recent articles may not yet be indexed by OA sources.
- PDF→Markdown quality depends on the PDF's text layer. Scanned-only PDFs may produce poor output.
Anti-Hallucination
- Never fabricate file paths, URLs, DOIs, or package names. Verify existence before recommending.
- Never invent journal metadata, impact factors, or submission policies without verification at the journal's website.
- If a tool, package, or resource does not exist or you are unsure, say so explicitly rather than guessing.