Awesome-Agent-Skills-for-Empirical-Research bioc-pmc-api

Access PMC Open Access articles in BioC format for text mining

install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/literature/fulltext/bioc-pmc-api" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-bioc-pmc-api && rm -rf "$T"
manifest: skills/43-wentorai-research-plugins/skills/literature/fulltext/bioc-pmc-api/SKILL.md
source content

BioC API for PMC Open Access

Overview

The BioC API provides full-text articles from PubMed Central (PMC) in the BioC format — a simplified XML/JSON structure designed specifically for biomedical text mining. Unlike the standard PMC OAI service (which returns JATS XML), BioC pre-segments text into passages with offset annotations, making it ideal for NLP pipelines, named entity recognition, relation extraction, and other text mining tasks. Free, no authentication required.

API Endpoints

Base URL

https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{PMCID}/unicode

Retrieve by PMC ID

# JSON format (recommended for programmatic use)
curl "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/PMC6267067/unicode"

# XML format
curl "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_xml/PMC6267067/unicode"

# ASCII encoding (strips non-ASCII characters)
curl "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/PMC6267067/ascii"

Retrieve by PubMed ID

# Convert PMID to PMCID first, then query
curl "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids=29346600&format=json"
# Returns: {"records": [{"pmid": "29346600", "pmcid": "PMC6267067", ...}]}

BioC JSON Structure

{
  "source": "PMC",
  "date": "2024-01-15",
  "key": "collection.key",
  "documents": [
    {
      "id": "PMC6267067",
      "passages": [
        {
          "infons": {
            "section_type": "TITLE",
            "type": "title"
          },
          "offset": 0,
          "text": "Article Title Here"
        },
        {
          "infons": {
            "section_type": "ABSTRACT",
            "type": "abstract"
          },
          "offset": 25,
          "text": "Background: This study investigates..."
        },
        {
          "infons": {
            "section_type": "INTRO",
            "type": "paragraph"
          },
          "offset": 350,
          "text": "The introduction text..."
        }
      ]
    }
  ]
}

Key fields:

  • passages[].infons.section_type
    : TITLE, ABSTRACT, INTRO, METHODS, RESULTS, DISCUSS, CONCL, REF, FIG, TABLE
  • passages[].offset
    : Character offset from document start
  • passages[].text
    : Plain text content of the passage

Python Usage

import requests
import json

def get_bioc_article(pmcid: str, fmt: str = "json") -> dict:
    """Fetch a PMC article in BioC format."""
    url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_{fmt}/{pmcid}/unicode"
    resp = requests.get(url, timeout=30)
    resp.raise_for_status()
    return resp.json() if fmt == "json" else resp.text

def extract_sections(bioc_doc: dict) -> dict:
    """Extract text organized by section type."""
    sections = {}
    for doc in bioc_doc.get("documents", []):
        for passage in doc.get("passages", []):
            section = passage.get("infons", {}).get("section_type", "OTHER")
            text = passage.get("text", "")
            sections.setdefault(section, []).append(text)
    return {k: "\n".join(v) for k, v in sections.items()}

# Example: fetch and parse
article = get_bioc_article("PMC6267067")
sections = extract_sections(article)
print(f"Title: {sections.get('TITLE', 'N/A')}")
print(f"Abstract length: {len(sections.get('ABSTRACT', ''))} chars")
print(f"Sections found: {list(sections.keys())}")

Data Coverage

  • PMC Open Access Subset: ~4M+ articles with CC licenses
  • Author Manuscript Collection: NIH-funded author manuscripts
  • Updates: New articles added daily

Rate Limits

  • Follow NCBI standard: 3 requests per second
  • For bulk access, use the PMC FTP service instead
  • Add
    tool=your_tool_name&email=your@email.com
    to requests for priority queue

Citation

When using this API in publications, cite:

Comeau DC, Wei CH, Islamaj Dogan R, Lu Z. PMC text mining subset in BioC: about 3 million full text articles and growing. Bioinformatics, btz070, 2019.

References