Awesome-Agent-Skills-for-Empirical-Research base-academic-search

Search 400M+ open access documents via the BASE search engine API

install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/literature/search/base-academic-search" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-base-academic-sea && rm -rf "$T"
manifest: skills/43-wentorai-research-plugins/skills/literature/search/base-academic-search/SKILL.md
source content

BASE (Bielefeld Academic Search Engine) API

Overview

BASE is one of the world's largest search engines for academic open access web resources. Operated by Bielefeld University Library, it indexes 400M+ documents from 11,000+ content providers including institutional repositories, preprint servers, and digital libraries. Unlike Google Scholar, BASE provides structured metadata, license information, and full-text links. The API is free with registration.

API Endpoints

Base URL

https://api.base-search.net/cgi-bin/BaseHttpSearchInterface.fcgi

Search

# Basic keyword search (JSON response)
curl "https://api.base-search.net/cgi-bin/BaseHttpSearchInterface.fcgi?\
func=PerformSearch&query=climate+change+adaptation&format=json&hits=20"

# Search with field filters
curl "https://api.base-search.net/cgi-bin/BaseHttpSearchInterface.fcgi?\
func=PerformSearch&query=dctitle:transformer+AND+dcsubject:NLP&format=json"

# Filter by document type and year
curl "https://api.base-search.net/cgi-bin/BaseHttpSearchInterface.fcgi?\
func=PerformSearch&query=deep+learning&dctypenorm=121&dcyear:2024&format=json"

# Open access only
curl "https://api.base-search.net/cgi-bin/BaseHttpSearchInterface.fcgi?\
func=PerformSearch&query=CRISPR&dcrights:open&format=json"

Search Fields

FieldDescriptionExample
dctitle
Title
dctitle:attention+mechanism
dccreator
Author
dccreator:vaswani
dcsubject
Subject/keywords
dcsubject:machine+learning
dcdescription
Abstract
dcdescription:neural+network
dcyear
Publication year
dcyear:2024
dctype
Document type text
dctype:article
dctypenorm
Normalized type code
121
(journal article)
dcrights
Access rights
dcrights:open
dclang
Language
dclang:eng
dclink
Source URL
dclink:arxiv.org
dcoa
Open access status
dcoa:1
(OA),
dcoa:2
(restricted)
dcprovider
Content provider
dcprovider:arxiv.org

Document Type Codes

CodeType
121
Journal article
122
Book / monograph
14
Conference paper
15
Thesis / dissertation
17
Report
18
Preprint

Query Parameters

ParameterDescriptionDefault
func
Must be
PerformSearch
Required
query
Search query with optional field prefixesRequired
format
Response format:
json
or
xml
xml
hits
Results per page (max 125)10
offset
Pagination offset0
sortby
Sort:
dcyear desc
,
score desc
relevance

Response Structure

{
  "response": {
    "numFound": 45200,
    "start": 0,
    "docs": [
      {
        "dctitle": "Attention Is All You Need",
        "dccreator": ["Ashish Vaswani", "Noam Shazeer"],
        "dcyear": "2017",
        "dcsubject": ["machine learning", "attention mechanism"],
        "dcdescription": "The dominant sequence transduction models...",
        "dcidentifier": "https://arxiv.org/abs/1706.03762",
        "dcsource": "arXiv.org",
        "dcprovider": "arxiv.org",
        "dcdocid": "abc123xyz",
        "dcoa": 1,
        "dctypenorm": ["18"],
        "dclang": ["eng"]
      }
    ]
  }
}

Python Usage

import requests

BASE_URL = "https://api.base-search.net/cgi-bin/BaseHttpSearchInterface.fcgi"


def search_base(query: str, hits: int = 20,
                doc_type: int = None, oa_only: bool = False) -> list:
    """Search BASE for academic open access documents."""
    q = query
    if doc_type:
        q += f" AND dctypenorm:{doc_type}"
    if oa_only:
        q += " AND dcoa:1"

    params = {
        "func": "PerformSearch",
        "query": q,
        "format": "json",
        "hits": hits,
        "sortby": "dcyear desc",
    }

    resp = requests.get(BASE_URL, params=params)
    resp.raise_for_status()
    data = resp.json()

    results = []
    for doc in data.get("response", {}).get("docs", []):
        results.append({
            "title": doc.get("dctitle"),
            "authors": doc.get("dccreator", []),
            "year": doc.get("dcyear"),
            "source": doc.get("dcsource"),
            "url": doc.get("dcidentifier"),
            "abstract": (doc.get("dcdescription") or "")[:300],
            "open_access": doc.get("dcoa") == 1,
            "type": doc.get("dctypenorm", []),
        })
    return results


def search_dissertations(topic: str, lang: str = "eng") -> list:
    """Find dissertations and theses on a topic."""
    query = f"{topic} AND dctypenorm:15 AND dclang:{lang}"
    return search_base(query, hits=50)


def search_by_provider(query: str, provider: str) -> list:
    """Search within a specific content provider."""
    full_query = f"{query} AND dcprovider:{provider}"
    return search_base(full_query)


# Example: find recent open access ML papers
papers = search_base("transformer self-attention", hits=10, oa_only=True)
for p in papers:
    oa = "OA" if p["open_access"] else "restricted"
    print(f"[{p['year']}] {p['title']} ({oa}) — {p['source']}")

# Example: find dissertations on climate modeling
theses = search_dissertations("climate modeling ocean")
for t in theses:
    print(f"[{t['year']}] {t['title']} — {', '.join(t['authors'][:2])}")

BASE vs Other Search Engines

FeatureBASEGoogle ScholarOpenAlex
Records400M+Unknown250M+
Open access focusYesNoYes
Structured APIYesNo official APIYes
License metadataYesNoPartial
Dissertation coverageExcellentGoodLimited
Repository-level filteringYesNoNo

References