Awesome-Agent-Skills-for-Empirical-Research base-academic-search

Search 400M+ open access documents via the BASE search engine API

install

source · Clone the upstream repo

git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/literature/search/base-academic-search" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-base-academic-sea && rm -rf "$T"

manifest: skills/43-wentorai-research-plugins/skills/literature/search/base-academic-search/SKILL.md

source content

BASE (Bielefeld Academic Search Engine) API

Overview

BASE is one of the world's largest search engines for academic open access web resources. Operated by Bielefeld University Library, it indexes 400M+ documents from 11,000+ content providers including institutional repositories, preprint servers, and digital libraries. Unlike Google Scholar, BASE provides structured metadata, license information, and full-text links. The API is free with registration.

API Endpoints

Base URL

https://api.base-search.net/cgi-bin/BaseHttpSearchInterface.fcgi

Search

# Basic keyword search (JSON response)
curl "https://api.base-search.net/cgi-bin/BaseHttpSearchInterface.fcgi?\
func=PerformSearch&query=climate+change+adaptation&format=json&hits=20"

# Search with field filters
curl "https://api.base-search.net/cgi-bin/BaseHttpSearchInterface.fcgi?\
func=PerformSearch&query=dctitle:transformer+AND+dcsubject:NLP&format=json"

# Filter by document type and year
curl "https://api.base-search.net/cgi-bin/BaseHttpSearchInterface.fcgi?\
func=PerformSearch&query=deep+learning&dctypenorm=121&dcyear:2024&format=json"

# Open access only
curl "https://api.base-search.net/cgi-bin/BaseHttpSearchInterface.fcgi?\
func=PerformSearch&query=CRISPR&dcrights:open&format=json"

Search Fields

Field	Description	Example
`dctitle`	Title	`dctitle:attention+mechanism`
`dccreator`	Author	`dccreator:vaswani`
`dcsubject`	Subject/keywords	`dcsubject:machine+learning`
`dcdescription`	Abstract	`dcdescription:neural+network`
`dcyear`	Publication year	`dcyear:2024`
`dctype`	Document type text	`dctype:article`
`dctypenorm`	Normalized type code	`121` (journal article)
`dcrights`	Access rights	`dcrights:open`
`dclang`	Language	`dclang:eng`
`dclink`	Source URL	`dclink:arxiv.org`
`dcoa`	Open access status	`dcoa:1` (OA), `dcoa:2` (restricted)
`dcprovider`	Content provider	`dcprovider:arxiv.org`

Document Type Codes

Code	Type
`121`	Journal article
`122`	Book / monograph
`14`	Conference paper
`15`	Thesis / dissertation
`17`	Report
`18`	Preprint

Query Parameters

Parameter	Description	Default
`func`	Must be `PerformSearch`	Required
`query`	Search query with optional field prefixes	Required
`format`	Response format: `json` or `xml`	`xml`
`hits`	Results per page (max 125)	10
`offset`	Pagination offset	0
`sortby`	Sort: `dcyear desc` , `score desc`	relevance

Response Structure

{
  "response": {
    "numFound": 45200,
    "start": 0,
    "docs": [
      {
        "dctitle": "Attention Is All You Need",
        "dccreator": ["Ashish Vaswani", "Noam Shazeer"],
        "dcyear": "2017",
        "dcsubject": ["machine learning", "attention mechanism"],
        "dcdescription": "The dominant sequence transduction models...",
        "dcidentifier": "https://arxiv.org/abs/1706.03762",
        "dcsource": "arXiv.org",
        "dcprovider": "arxiv.org",
        "dcdocid": "abc123xyz",
        "dcoa": 1,
        "dctypenorm": ["18"],
        "dclang": ["eng"]
      }
    ]
  }
}

Python Usage

import requests

BASE_URL = "https://api.base-search.net/cgi-bin/BaseHttpSearchInterface.fcgi"


def search_base(query: str, hits: int = 20,
                doc_type: int = None, oa_only: bool = False) -> list:
    """Search BASE for academic open access documents."""
    q = query
    if doc_type:
        q += f" AND dctypenorm:{doc_type}"
    if oa_only:
        q += " AND dcoa:1"

    params = {
        "func": "PerformSearch",
        "query": q,
        "format": "json",
        "hits": hits,
        "sortby": "dcyear desc",
    }

    resp = requests.get(BASE_URL, params=params)
    resp.raise_for_status()
    data = resp.json()

    results = []
    for doc in data.get("response", {}).get("docs", []):
        results.append({
            "title": doc.get("dctitle"),
            "authors": doc.get("dccreator", []),
            "year": doc.get("dcyear"),
            "source": doc.get("dcsource"),
            "url": doc.get("dcidentifier"),
            "abstract": (doc.get("dcdescription") or "")[:300],
            "open_access": doc.get("dcoa") == 1,
            "type": doc.get("dctypenorm", []),
        })
    return results


def search_dissertations(topic: str, lang: str = "eng") -> list:
    """Find dissertations and theses on a topic."""
    query = f"{topic} AND dctypenorm:15 AND dclang:{lang}"
    return search_base(query, hits=50)


def search_by_provider(query: str, provider: str) -> list:
    """Search within a specific content provider."""
    full_query = f"{query} AND dcprovider:{provider}"
    return search_base(full_query)


# Example: find recent open access ML papers
papers = search_base("transformer self-attention", hits=10, oa_only=True)
for p in papers:
    oa = "OA" if p["open_access"] else "restricted"
    print(f"[{p['year']}] {p['title']} ({oa}) — {p['source']}")

# Example: find dissertations on climate modeling
theses = search_dissertations("climate modeling ocean")
for t in theses:
    print(f"[{t['year']}] {t['title']} — {', '.join(t['authors'][:2])}")

BASE vs Other Search Engines

Feature	BASE	Google Scholar	OpenAlex
Records	400M+	Unknown	250M+
Open access focus	Yes	No	Yes
Structured API	Yes	No official API	Yes
License metadata	Yes	No	Partial
Dissertation coverage	Excellent	Good	Limited
Repository-level filtering	Yes	No	No