Awesome-Agent-Skills-for-Empirical-Research pmc-ftp-bulk-download
Bulk download PMC Open Access articles via FTP for large-scale mining
install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/literature/fulltext/pmc-ftp-bulk-download" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-pmc-ftp-bulk-down && rm -rf "$T"
manifest:
skills/43-wentorai-research-plugins/skills/literature/fulltext/pmc-ftp-bulk-download/SKILL.mdsource content
PMC FTP Bulk Download
Overview
The PMC FTP Service provides bulk download access to millions of full-text articles from PubMed Central's Open Access Subset. Unlike the single-article APIs (E-utilities, BioC), the FTP service is designed for large-scale corpus construction — downloading entire collections for text mining, NLP training, systematic reviews, and bibliometric analysis. Free, no authentication required.
Note: PMC is migrating to AWS-based Cloud Service in August 2026. FTP paths may change; check official docs for updates.
FTP Access Points
Connection
# FTP (classic) ftp ftp.ncbi.nlm.nih.gov # Navigate to: /pub/pmc # HTTPS alternative (recommended) # Base: https://ftp.ncbi.nlm.nih.gov/pub/pmc/
Available Datasets
| Dataset | Path | Content | Format |
|---|---|---|---|
| OA Commercial | | CC BY/CC0 articles (commercial use OK) | .tar.gz packages |
| OA Non-Commercial | | CC BY-NC articles | .tar.gz packages |
| OA Other | | Other open licenses | .tar.gz packages |
| Author Manuscripts | | NIH-funded manuscripts | .tar.gz packages |
| Historical OCR | | Pre-digital scanned articles | .tar.gz |
| File lists | | Index of all OA articles | CSV |
File List Index
Download the master index to plan your downloads:
# Download the OA file list (CSV, ~200MB) wget https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.csv # CSV columns: # File, Article Citation, AccessionID, LastUpdated, PMID, License
Download Strategies
Strategy 1: Download Specific Articles
import requests import tarfile import io import csv def download_article_package(pmcid: str, base_url: str = "https://ftp.ncbi.nlm.nih.gov/pub/pmc"): """Download and extract a specific PMC article package.""" # First, look up the file path from the file list # (In practice, you'd load this once and index by PMCID) file_list_url = f"{base_url}/oa_file_list.csv" # ... lookup pmcid in file list to get path ... # Download the tar.gz package resp = requests.get(f"{base_url}/{file_path}", stream=True) resp.raise_for_status() # Extract with tarfile.open(fileobj=io.BytesIO(resp.content), mode="r:gz") as tar: tar.extractall(path=f"./articles/{pmcid}") print(f"Extracted {pmcid}")
Strategy 2: Bulk Download by License
#!/bin/bash # Download all commercial-use articles (CC BY / CC0) # WARNING: This is ~100GB+ compressed mkdir -p pmc_corpus/commercial cd pmc_corpus/commercial # Download the baseline (all current articles) wget -r -np -nH --cut-dirs=3 \ https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_comm/xml/ # Incremental updates (run periodically) wget -r -np -nH --cut-dirs=3 -N \ https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_comm/xml/
Strategy 3: Filtered Download via File List
import csv import requests from pathlib import Path def download_filtered_corpus(file_list_path: str, output_dir: str, license_filter: str = "CC BY", max_articles: int = 1000): """Download articles matching a license filter.""" output = Path(output_dir) output.mkdir(parents=True, exist_ok=True) base = "https://ftp.ncbi.nlm.nih.gov/pub/pmc" downloaded = 0 with open(file_list_path) as f: reader = csv.DictReader(f) for row in reader: if license_filter and license_filter not in row.get("License", ""): continue if downloaded >= max_articles: break file_path = row["File"] url = f"{base}/{file_path}" local_path = output / Path(file_path).name if local_path.exists(): continue resp = requests.get(url, stream=True, timeout=60) if resp.status_code == 200: local_path.write_bytes(resp.content) downloaded += 1 if downloaded % 100 == 0: print(f"Downloaded {downloaded} articles...") print(f"Total downloaded: {downloaded}")
PMC ID Cross-Referencing
Convert between different article identifiers:
# PMID → PMCID → DOI conversion curl "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids=29346600&format=json" # Batch conversion (up to 200 IDs) curl "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids=29346600,30266829,31048553&format=json"
Package Contents
Each article package (.tar.gz) typically contains:
PMC1234567/ ├── PMC1234567.xml # Full text in JATS XML ├── PMC1234567.pdf # PDF (if available) ├── figure1.jpg # Figures ├── figure2.jpg ├── table1.html # Tables (sometimes) └── supplement1.pdf # Supplementary materials
Best Practices
- Start with the file list: Download
first and filter locallyoa_file_list.csv - Respect rate limits: Space requests 0.3s apart for individual downloads
- Use incremental updates: After initial download, use
flag to only get new/updated files-N - Check licenses: OA Commercial (CC BY) allows any use; Non-Commercial restricts commercial applications
- Storage planning: Full OA Subset is ~500GB+ uncompressed