Awesome-Agent-Skills-for-Empirical-Research repository-harvesting-guide

Harvest metadata from open repositories using OAI-PMH protocol

install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/tools/scraping/repository-harvesting-guide" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-repository-harves && rm -rf "$T"
manifest: skills/43-wentorai-research-plugins/skills/tools/scraping/repository-harvesting-guide/SKILL.md
source content

Repository Harvesting Guide

A skill for harvesting metadata from open access repositories using the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) protocol. Covers protocol fundamentals, building harvesters in Python, handling resumption tokens for large collections, metadata format parsing (Dublin Core, MARC, METS), selective harvesting by date and set, and integrating harvested data into research workflows.

OAI-PMH Protocol Fundamentals

What Is OAI-PMH

OAI-PMH is a standardized protocol that allows metadata to be harvested from repository systems. It is the backbone of library interoperability and is supported by virtually every institutional repository, preprint server, and digital library worldwide.

OAI-PMH Architecture:

Data Providers (repositories):
  - Expose metadata through a standardized HTTP interface
  - Must support Dublin Core as minimum metadata format
  - May support additional formats (MARC, MODS, DataCite, etc.)
  - Examples: arXiv, PubMed Central, DSpace repositories,
    EPrints, institutional repositories

Service Providers (harvesters):
  - Send HTTP requests to data providers
  - Collect, aggregate, and index metadata
  - Build search services, union catalogs, analytics
  - Examples: BASE (Bielefeld), CORE, OpenDOAR

Protocol Version: 2.0 (current, since 2002)
Transport: HTTP GET or POST
Response format: XML
Base URL example: https://arxiv.org/oai2

Six OAI-PMH Verbs

OAI-PMH defines exactly six request types (verbs):

1. Identify
   Purpose: Describe the repository
   URL: baseURL?verb=Identify
   Returns: repository name, admin email, earliest datestamp,
            granularity, compression support

2. ListMetadataFormats
   Purpose: List available metadata formats
   URL: baseURL?verb=ListMetadataFormats
   Returns: format prefixes (oai_dc, marc21, datacite, etc.)
   Optional: identifier parameter to check formats for one record

3. ListSets
   Purpose: List available sets (collections/categories)
   URL: baseURL?verb=ListSets
   Returns: set names and specs for selective harvesting
   Example sets: physics:hep-th, cs:AI, math:AG

4. ListIdentifiers
   Purpose: List record identifiers (headers only, no metadata)
   URL: baseURL?verb=ListIdentifiers&metadataPrefix=oai_dc
   Optional: from, until, set parameters
   Returns: identifiers, datestamps, set memberships

5. ListRecords
   Purpose: Harvest full metadata records
   URL: baseURL?verb=ListRecords&metadataPrefix=oai_dc
   Optional: from, until, set parameters
   Returns: complete metadata records in requested format

6. GetRecord
   Purpose: Retrieve a single record by identifier
   URL: baseURL?verb=GetRecord&identifier=oai:arxiv:2301.00001
         &metadataPrefix=oai_dc
   Returns: one complete metadata record

Building a Harvester in Python

Basic Harvester

import requests
import xml.etree.ElementTree as ET
import time

OAI_NS = "http://www.openarchives.org/OAI/2.0/"
DC_NS = "http://purl.org/dc/elements/1.1/"

def harvest_records(base_url, metadata_prefix="oai_dc",
                    from_date=None, until_date=None,
                    set_spec=None):
    """
    Harvest all records from an OAI-PMH endpoint.
    Handles resumption tokens for paginated results.

    Args:
        base_url: OAI-PMH base URL
        metadata_prefix: metadata format (default: oai_dc)
        from_date: selective harvest start (YYYY-MM-DD)
        until_date: selective harvest end (YYYY-MM-DD)
        set_spec: restrict to a specific set
    """
    params = {
        "verb": "ListRecords",
        "metadataPrefix": metadata_prefix,
    }

    if from_date:
        params["from"] = from_date
    if until_date:
        params["until"] = until_date
    if set_spec:
        params["set"] = set_spec

    all_records = []
    request_count = 0

    while True:
        response = requests.get(base_url, params=params, timeout=30)
        response.raise_for_status()
        request_count += 1

        root = ET.fromstring(response.content)

        # Parse records from this page
        records = root.findall(
            f".//{{{OAI_NS}}}record"
        )

        for record in records:
            parsed = parse_dublin_core(record)
            if parsed:
                all_records.append(parsed)

        # Check for resumption token
        token_elem = root.find(
            f".//{{{OAI_NS}}}resumptionToken"
        )

        if token_elem is not None and token_elem.text:
            params = {
                "verb": "ListRecords",
                "resumptionToken": token_elem.text,
            }
            # Polite delay between requests
            time.sleep(2)
        else:
            break

    print(f"Harvested {len(all_records)} records "
          f"in {request_count} requests")
    return all_records


def parse_dublin_core(record_element):
    """
    Parse a Dublin Core metadata record into a dictionary.
    """
    header = record_element.find(f"{{{OAI_NS}}}header")
    metadata = record_element.find(f"{{{OAI_NS}}}metadata")

    if header is None or metadata is None:
        return None

    # Check if record is deleted
    status = header.get("status", "")
    if status == "deleted":
        return None

    identifier = header.findtext(f"{{{OAI_NS}}}identifier", "")
    datestamp = header.findtext(f"{{{OAI_NS}}}datestamp", "")

    dc = metadata.find(f".//{{{DC_NS}}}../")

    result = {
        "oai_identifier": identifier,
        "datestamp": datestamp,
        "title": find_dc_text(metadata, "title"),
        "creator": find_dc_all(metadata, "creator"),
        "subject": find_dc_all(metadata, "subject"),
        "description": find_dc_text(metadata, "description"),
        "date": find_dc_text(metadata, "date"),
        "type": find_dc_text(metadata, "type"),
        "identifier": find_dc_all(metadata, "identifier"),
        "language": find_dc_text(metadata, "language"),
        "rights": find_dc_text(metadata, "rights"),
    }

    return result


def find_dc_text(metadata, element_name):
    """Find first Dublin Core element text."""
    elem = metadata.find(f".//{{{DC_NS}}}{element_name}")
    return elem.text if elem is not None else ""


def find_dc_all(metadata, element_name):
    """Find all values of a Dublin Core element."""
    elems = metadata.findall(f".//{{{DC_NS}}}{element_name}")
    return [e.text for e in elems if e.text]

Selective Harvesting

By Date Range

Incremental harvesting strategy:

First harvest: Get everything
  from_date = None (or repository's earliestDatestamp)
  until_date = today

Subsequent harvests: Get only new/modified records
  from_date = last_harvest_date
  until_date = today

Date granularity:
  - Day-level: YYYY-MM-DD (most common)
  - Second-level: YYYY-MM-DDThh:mm:ssZ (some repositories)
  - Check the Identify response for supported granularity

Important: OAI-PMH datestamps reflect the date the METADATA
was last modified, not the publication date. A record edited
yesterday to fix a typo will appear in a harvest with
from=yesterday, even if the paper was published in 2015.

By Set (Collection)

Common set structures by repository type:

arXiv:
  physics, physics:hep-th, cs, cs:AI, math, math:AG, etc.

DSpace repositories:
  com_12345_1 (community), col_12345_2 (collection)
  Hierarchical: department -> collection

PubMed Central:
  By journal: pmc-journal-name
  By funder: pmc-funder-name

Strategy:
  1. Call ListSets to see available sets
  2. Identify sets relevant to your research topic
  3. Harvest only those sets to reduce data volume
  4. Store the set membership for each record

Data Quality and Deduplication

Common Quality Issues

Quality problems in harvested metadata:

1. Duplicate records:
   - Same paper in multiple repositories
   - Same paper in multiple sets within one repository
   - Solution: Deduplicate by DOI, then by title similarity

2. Incomplete metadata:
   - Missing abstracts (very common)
   - Missing author identifiers
   - Missing dates or using inconsistent date formats
   - Solution: Enrich with Crossref or OpenAlex lookups

3. Encoding issues:
   - Non-UTF-8 characters in older repositories
   - HTML entities in text fields
   - Solution: Normalize encoding, strip HTML tags

4. Inconsistent formats:
   - Dates as "2023", "2023-01", "2023-01-15", "January 2023"
   - Author names as "Smith, John" vs "John Smith" vs "J. Smith"
   - Solution: Parse and normalize to canonical formats

Notable OAI-PMH Endpoints

Major repositories with OAI-PMH support:

arXiv:          https://export.arxiv.org/oai2
PubMed Central: https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi
Europeana:      https://oai.europeana.eu/oai
HAL (France):   https://api.archives-ouvertes.fr/oai/hal
DBLP:           https://dblp.org/oai
CiteSeerX:      https://citeseerx.ist.psu.edu/oai2

To find more endpoints:
  - OpenDOAR directory: https://v2.sherpa.ac.uk/opendoar/
  - ROAR (Registry of Open Access Repositories)
  - BASE (Bielefeld Academic Search Engine) source list

OAI-PMH harvesting remains the most reliable method for building comprehensive metadata collections from open repositories. While newer APIs like ResourceSync and Signposting offer richer functionality, OAI-PMH's universal adoption and simplicity make it the practical choice for most academic metadata collection tasks.