Awesome-Agent-Skills-for-Empirical-Research repository-harvesting-guide
Harvest metadata from open repositories using OAI-PMH protocol
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/tools/scraping/repository-harvesting-guide" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-repository-harves && rm -rf "$T"
skills/43-wentorai-research-plugins/skills/tools/scraping/repository-harvesting-guide/SKILL.mdRepository Harvesting Guide
A skill for harvesting metadata from open access repositories using the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) protocol. Covers protocol fundamentals, building harvesters in Python, handling resumption tokens for large collections, metadata format parsing (Dublin Core, MARC, METS), selective harvesting by date and set, and integrating harvested data into research workflows.
OAI-PMH Protocol Fundamentals
What Is OAI-PMH
OAI-PMH is a standardized protocol that allows metadata to be harvested from repository systems. It is the backbone of library interoperability and is supported by virtually every institutional repository, preprint server, and digital library worldwide.
OAI-PMH Architecture: Data Providers (repositories): - Expose metadata through a standardized HTTP interface - Must support Dublin Core as minimum metadata format - May support additional formats (MARC, MODS, DataCite, etc.) - Examples: arXiv, PubMed Central, DSpace repositories, EPrints, institutional repositories Service Providers (harvesters): - Send HTTP requests to data providers - Collect, aggregate, and index metadata - Build search services, union catalogs, analytics - Examples: BASE (Bielefeld), CORE, OpenDOAR Protocol Version: 2.0 (current, since 2002) Transport: HTTP GET or POST Response format: XML Base URL example: https://arxiv.org/oai2
Six OAI-PMH Verbs
OAI-PMH defines exactly six request types (verbs): 1. Identify Purpose: Describe the repository URL: baseURL?verb=Identify Returns: repository name, admin email, earliest datestamp, granularity, compression support 2. ListMetadataFormats Purpose: List available metadata formats URL: baseURL?verb=ListMetadataFormats Returns: format prefixes (oai_dc, marc21, datacite, etc.) Optional: identifier parameter to check formats for one record 3. ListSets Purpose: List available sets (collections/categories) URL: baseURL?verb=ListSets Returns: set names and specs for selective harvesting Example sets: physics:hep-th, cs:AI, math:AG 4. ListIdentifiers Purpose: List record identifiers (headers only, no metadata) URL: baseURL?verb=ListIdentifiers&metadataPrefix=oai_dc Optional: from, until, set parameters Returns: identifiers, datestamps, set memberships 5. ListRecords Purpose: Harvest full metadata records URL: baseURL?verb=ListRecords&metadataPrefix=oai_dc Optional: from, until, set parameters Returns: complete metadata records in requested format 6. GetRecord Purpose: Retrieve a single record by identifier URL: baseURL?verb=GetRecord&identifier=oai:arxiv:2301.00001 &metadataPrefix=oai_dc Returns: one complete metadata record
Building a Harvester in Python
Basic Harvester
import requests import xml.etree.ElementTree as ET import time OAI_NS = "http://www.openarchives.org/OAI/2.0/" DC_NS = "http://purl.org/dc/elements/1.1/" def harvest_records(base_url, metadata_prefix="oai_dc", from_date=None, until_date=None, set_spec=None): """ Harvest all records from an OAI-PMH endpoint. Handles resumption tokens for paginated results. Args: base_url: OAI-PMH base URL metadata_prefix: metadata format (default: oai_dc) from_date: selective harvest start (YYYY-MM-DD) until_date: selective harvest end (YYYY-MM-DD) set_spec: restrict to a specific set """ params = { "verb": "ListRecords", "metadataPrefix": metadata_prefix, } if from_date: params["from"] = from_date if until_date: params["until"] = until_date if set_spec: params["set"] = set_spec all_records = [] request_count = 0 while True: response = requests.get(base_url, params=params, timeout=30) response.raise_for_status() request_count += 1 root = ET.fromstring(response.content) # Parse records from this page records = root.findall( f".//{{{OAI_NS}}}record" ) for record in records: parsed = parse_dublin_core(record) if parsed: all_records.append(parsed) # Check for resumption token token_elem = root.find( f".//{{{OAI_NS}}}resumptionToken" ) if token_elem is not None and token_elem.text: params = { "verb": "ListRecords", "resumptionToken": token_elem.text, } # Polite delay between requests time.sleep(2) else: break print(f"Harvested {len(all_records)} records " f"in {request_count} requests") return all_records def parse_dublin_core(record_element): """ Parse a Dublin Core metadata record into a dictionary. """ header = record_element.find(f"{{{OAI_NS}}}header") metadata = record_element.find(f"{{{OAI_NS}}}metadata") if header is None or metadata is None: return None # Check if record is deleted status = header.get("status", "") if status == "deleted": return None identifier = header.findtext(f"{{{OAI_NS}}}identifier", "") datestamp = header.findtext(f"{{{OAI_NS}}}datestamp", "") dc = metadata.find(f".//{{{DC_NS}}}../") result = { "oai_identifier": identifier, "datestamp": datestamp, "title": find_dc_text(metadata, "title"), "creator": find_dc_all(metadata, "creator"), "subject": find_dc_all(metadata, "subject"), "description": find_dc_text(metadata, "description"), "date": find_dc_text(metadata, "date"), "type": find_dc_text(metadata, "type"), "identifier": find_dc_all(metadata, "identifier"), "language": find_dc_text(metadata, "language"), "rights": find_dc_text(metadata, "rights"), } return result def find_dc_text(metadata, element_name): """Find first Dublin Core element text.""" elem = metadata.find(f".//{{{DC_NS}}}{element_name}") return elem.text if elem is not None else "" def find_dc_all(metadata, element_name): """Find all values of a Dublin Core element.""" elems = metadata.findall(f".//{{{DC_NS}}}{element_name}") return [e.text for e in elems if e.text]
Selective Harvesting
By Date Range
Incremental harvesting strategy: First harvest: Get everything from_date = None (or repository's earliestDatestamp) until_date = today Subsequent harvests: Get only new/modified records from_date = last_harvest_date until_date = today Date granularity: - Day-level: YYYY-MM-DD (most common) - Second-level: YYYY-MM-DDThh:mm:ssZ (some repositories) - Check the Identify response for supported granularity Important: OAI-PMH datestamps reflect the date the METADATA was last modified, not the publication date. A record edited yesterday to fix a typo will appear in a harvest with from=yesterday, even if the paper was published in 2015.
By Set (Collection)
Common set structures by repository type: arXiv: physics, physics:hep-th, cs, cs:AI, math, math:AG, etc. DSpace repositories: com_12345_1 (community), col_12345_2 (collection) Hierarchical: department -> collection PubMed Central: By journal: pmc-journal-name By funder: pmc-funder-name Strategy: 1. Call ListSets to see available sets 2. Identify sets relevant to your research topic 3. Harvest only those sets to reduce data volume 4. Store the set membership for each record
Data Quality and Deduplication
Common Quality Issues
Quality problems in harvested metadata: 1. Duplicate records: - Same paper in multiple repositories - Same paper in multiple sets within one repository - Solution: Deduplicate by DOI, then by title similarity 2. Incomplete metadata: - Missing abstracts (very common) - Missing author identifiers - Missing dates or using inconsistent date formats - Solution: Enrich with Crossref or OpenAlex lookups 3. Encoding issues: - Non-UTF-8 characters in older repositories - HTML entities in text fields - Solution: Normalize encoding, strip HTML tags 4. Inconsistent formats: - Dates as "2023", "2023-01", "2023-01-15", "January 2023" - Author names as "Smith, John" vs "John Smith" vs "J. Smith" - Solution: Parse and normalize to canonical formats
Notable OAI-PMH Endpoints
Major repositories with OAI-PMH support: arXiv: https://export.arxiv.org/oai2 PubMed Central: https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi Europeana: https://oai.europeana.eu/oai HAL (France): https://api.archives-ouvertes.fr/oai/hal DBLP: https://dblp.org/oai CiteSeerX: https://citeseerx.ist.psu.edu/oai2 To find more endpoints: - OpenDOAR directory: https://v2.sherpa.ac.uk/opendoar/ - ROAR (Registry of Open Access Repositories) - BASE (Bielefeld Academic Search Engine) source list
OAI-PMH harvesting remains the most reliable method for building comprehensive metadata collections from open repositories. While newer APIs like ResourceSync and Signposting offer richer functionality, OAI-PMH's universal adoption and simplicity make it the practical choice for most academic metadata collection tasks.