SciAgent-Skills openalex-database
Query OpenAlex REST API for scholarly literature — 250M+ works, authors, institutions, journals, and concepts. Search by title/abstract keywords, author, DOI, ORCID, or OpenAlex ID. Filter by year, open access status, citation count, or field. Retrieve citations, references, and author disambiguation. Free, no authentication required. For PubMed biomedical search use pubmed-database; for bioRxiv preprints use biorxiv-database.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/scientific-writing/openalex-database" ~/.claude/skills/jaechang-hits-sciagent-skills-openalex-database && rm -rf "$T"
skills/scientific-writing/openalex-database/SKILL.mdOpenAlex Scholarly Database
Overview
OpenAlex is a free, open-access index of 250M+ scholarly works, 90M+ authors, 110,000+ journals, and 10,000+ institutions. It succeeds Microsoft Academic Graph and provides rich metadata: abstracts, open-access URLs, citation counts, referenced works, author disambiguated IDs (ORCID), and concept tags. The REST API requires no authentication for up to 100,000 requests/day; a polite pool (email parameter) gives priority processing.
When to Use
- Building systematic literature review corpora by searching across all academic disciplines (not just biomedical)
- Retrieving citation networks for bibliometric analysis, co-citation clustering, or reference graph traversal
- Disambiguating author identities across institutions using ORCID/OpenAlex author IDs
- Finding open-access full-text URLs for a set of DOIs to build downloadable paper corpora
- Analyzing publication trends by year, institution, country, or research concept
- Enriching a paper list with metadata (citation count, abstract, venue) from DOIs or titles
- For PubMed-indexed biomedical literature use
; for bioRxiv preprints usepubmed-databasebiorxiv-database
Prerequisites
- Python packages:
,requestspandas - Data requirements: DOIs, OpenAlex Work IDs (W…), author names, ORCID IDs, or search terms
- Environment: internet connection; no API key required
- Rate limits: 10 req/s anonymous; add
query param to join polite pool (higher priority, same limit)mailto=your@email.com
pip install requests pandas
Quick Start
import requests BASE = "https://api.openalex.org" # Search for works on CRISPR r = requests.get(f"{BASE}/works", params={"search": "CRISPR gene editing", "filter": "publication_year:2023", "per_page": 5, "mailto": "your@email.com"}) r.raise_for_status() data = r.json() print(f"Total results: {data['meta']['count']}") for work in data["results"][:3]: print(f" {work['title'][:80]} ({work['publication_year']}) cites={work['cited_by_count']}")
Core API
Query 1: Works Search
Search works by title/abstract keywords with filters.
import requests, pandas as pd BASE = "https://api.openalex.org" def search_works(query, filters=None, per_page=25, mailto="your@email.com"): params = {"search": query, "per_page": per_page, "mailto": mailto} if filters: params["filter"] = ",".join(f"{k}:{v}" for k, v in filters.items()) r = requests.get(f"{BASE}/works", params=params) r.raise_for_status() return r.json() # Search with filters data = search_works("single-cell RNA sequencing", filters={"publication_year": "2020-2024", "open_access.is_oa": "true"}, per_page=10) print(f"Open-access scRNA-seq papers 2020-2024: {data['meta']['count']}") rows = [] for w in data["results"]: rows.append({ "title": w["title"], "year": w["publication_year"], "citations": w["cited_by_count"], "doi": w.get("doi"), "oa_url": w.get("open_access", {}).get("oa_url"), }) df = pd.DataFrame(rows) print(df[["title", "year", "citations"]].head())
# Paginate through all results def paginate_works(query, filters=None, max_results=200, mailto="your@email.com"): """Retrieve up to max_results works, paginating automatically.""" all_results = [] cursor = "*" while len(all_results) < max_results: params = {"search": query, "per_page": 200, "cursor": cursor, "mailto": mailto} if filters: params["filter"] = ",".join(f"{k}:{v}" for k, v in filters.items()) r = requests.get(f"{BASE}/works", params=params) data = r.json() all_results.extend(data["results"]) cursor = data["meta"].get("next_cursor") if not cursor: break return all_results[:max_results] papers = paginate_works("transformer protein structure", max_results=100) print(f"Retrieved {len(papers)} papers")
Query 2: Lookup by DOI or OpenAlex ID
Retrieve a single work by DOI or OpenAlex ID.
import requests BASE = "https://api.openalex.org" # By DOI doi = "10.1038/s41592-019-0458-z" # Scanpy paper r = requests.get(f"{BASE}/works/https://doi.org/{doi}", params={"mailto": "your@email.com"}) r.raise_for_status() work = r.json() print(f"Title : {work['title']}") print(f"Year : {work['publication_year']}") print(f"Citations: {work['cited_by_count']}") print(f"Journal : {work.get('primary_location', {}).get('source', {}).get('display_name')}") abstract = work.get("abstract_inverted_index") if abstract: # Reconstruct abstract from inverted index words = {pos: word for word, positions in abstract.items() for pos in positions} text = " ".join(words[i] for i in sorted(words)) print(f"Abstract (first 200): {text[:200]}")
Query 3: Author Search and ORCID Lookup
Find author records, resolve ORCID identifiers, retrieve publication lists.
import requests, pandas as pd BASE = "https://api.openalex.org" # Search for an author r = requests.get(f"{BASE}/authors", params={"search": "Jennifer Doudna", "per_page": 5, "mailto": "your@email.com"}) authors = r.json()["results"] for a in authors[:3]: print(f"Author: {a['display_name']}") print(f" OpenAlex ID : {a['id']}") print(f" ORCID : {a.get('orcid', 'n/a')}") print(f" Institution : {a.get('last_known_institution', {}).get('display_name', 'n/a')}") print(f" Works count : {a['works_count']}") print(f" h-index : {a['summary_stats'].get('h_index', 'n/a')}") print()
# Get all papers by an author (by ORCID) orcid = "0000-0001-8742-3594" # Jennifer Doudna r = requests.get(f"{BASE}/works", params={"filter": f"author.orcid:{orcid}", "sort": "cited_by_count:desc", "per_page": 10, "mailto": "your@email.com"}) papers = r.json()["results"] for p in papers[:5]: print(f" [{p['publication_year']}] {p['title'][:70]} (cites: {p['cited_by_count']})")
Query 4: Citation Network Retrieval
Get referenced works and citing works for a paper.
import requests, pandas as pd BASE = "https://api.openalex.org" work_id = "W2018426904" # CRISPR paper # Get what this paper references r = requests.get(f"{BASE}/works/{work_id}", params={"select": "referenced_works,cited_by_count,title", "mailto": "your@email.com"}) work = r.json() ref_ids = work.get("referenced_works", []) print(f"'{work['title']}' cites {len(ref_ids)} papers") print(f"Total citations: {work['cited_by_count']}") # Fetch metadata for references (batch) if ref_ids: ids_str = "|".join(id.split("/")[-1] for id in ref_ids[:10]) r2 = requests.get(f"{BASE}/works", params={"filter": f"openalex_id:{ids_str}", "per_page": 10, "mailto": "your@email.com"}) refs = r2.json()["results"] for ref in refs[:5]: print(f" [{ref['publication_year']}] {ref['title'][:70]}")
Query 5: Concept/Topic Filtering and Trend Analysis
Filter by research concepts and analyze publication trends.
import requests, pandas as pd BASE = "https://api.openalex.org" # Get concept ID for "Machine Learning" r = requests.get(f"{BASE}/concepts", params={"search": "machine learning biology", "per_page": 3, "mailto": "your@email.com"}) concepts = r.json()["results"] for c in concepts[:3]: print(f"Concept: {c['display_name']} (ID: {c['id']}, level: {c['level']})") # Count papers per year for a concept concept_id = "C154945302" # Machine learning (OpenAlex ID) r2 = requests.get(f"{BASE}/works", params={"filter": f"concepts.id:{concept_id},publication_year:2015-2024", "group_by": "publication_year", "per_page": 200, "mailto": "your@email.com"}) groups = r2.json()["group_by"] df = pd.DataFrame(groups).rename(columns={"key": "year", "count": "papers"}) df = df.sort_values("year") print(df.tail(5).to_string(index=False))
Query 6: Institution and Venue Queries
Retrieve papers from a specific institution, journal, or conference.
import requests, pandas as pd BASE = "https://api.openalex.org" # Papers from a specific journal in the last year r = requests.get(f"{BASE}/works", params={ "filter": "primary_location.source.issn:0028-0836,publication_year:2023", "per_page": 10, "sort": "cited_by_count:desc", "mailto": "your@email.com" }) data = r.json() print(f"Nature papers 2023: {data['meta']['count']}") for w in data["results"][:5]: print(f" [{w['cited_by_count']} cites] {w['title'][:70]}")
Key Concepts
Inverted Index Abstracts
OpenAlex stores abstracts as inverted indexes (word → list of positions) rather than plain text due to copyright restrictions. Reconstruct with:
" ".join(words[i] for i in sorted({pos: w for w, ps in inv.items() for pos in ps})).
Cursor-Based Pagination
OpenAlex uses cursor-based pagination (
cursor parameter) instead of offset. Start with cursor="*" and use the next_cursor from each response. Maximum 200 results per page; cursor pagination supports up to 10,000 results.
Common Workflows
Workflow 1: Systematic Literature Search
Goal: Download all papers matching a topic query with metadata for systematic review.
import requests, time, pandas as pd BASE = "https://api.openalex.org" MAILTO = "your@email.com" def systematic_search(query, year_from, year_to, max_results=500): """Paginate through results and return a DataFrame.""" all_results = [] cursor = "*" filters = f"publication_year:{year_from}-{year_to}" while len(all_results) < max_results: r = requests.get(f"{BASE}/works", params={"search": query, "filter": filters, "per_page": 200, "cursor": cursor, "mailto": MAILTO, "select": "id,doi,title,publication_year,cited_by_count,open_access"}) r.raise_for_status() data = r.json() all_results.extend(data["results"]) cursor = data["meta"].get("next_cursor") if not cursor: break time.sleep(0.1) rows = [] for w in all_results[:max_results]: rows.append({ "openalex_id": w["id"], "doi": w.get("doi"), "title": w.get("title"), "year": w.get("publication_year"), "citations": w.get("cited_by_count"), "is_oa": w.get("open_access", {}).get("is_oa"), "oa_url": w.get("open_access", {}).get("oa_url"), }) return pd.DataFrame(rows) # Example: papers on drug repurposing 2019-2024 df = systematic_search("drug repurposing machine learning", 2019, 2024, max_results=200) df.to_csv("drug_repurposing_literature.csv", index=False) print(f"Retrieved {len(df)} papers") print(df[["title", "year", "citations", "is_oa"]].head(5).to_string(index=False))
Workflow 2: Author Collaboration Network
Goal: Map co-authors for a researcher to analyze their collaboration network.
import requests, time, pandas as pd from collections import defaultdict BASE = "https://api.openalex.org" MAILTO = "your@email.com" def get_author_works(orcid, max_papers=50): r = requests.get(f"{BASE}/works", params={"filter": f"author.orcid:{orcid}", "sort": "cited_by_count:desc", "per_page": min(max_papers, 200), "mailto": MAILTO}) r.raise_for_status() return r.json()["results"] def extract_collaborators(works): collab_count = defaultdict(int) for work in works: for authorship in work.get("authorships", []): author = authorship.get("author", {}) name = author.get("display_name") if name: collab_count[name] += 1 return collab_count # Map collaborators for a researcher orcid = "0000-0001-8742-3594" works = get_author_works(orcid, max_papers=50) collabs = extract_collaborators(works) top_collabs = sorted(collabs.items(), key=lambda x: -x[1]) df = pd.DataFrame(top_collabs, columns=["collaborator", "papers_together"]) df = df[df["collaborator"] != "Jennifer A. Doudna"] # exclude self print("Top collaborators:") print(df.head(10).to_string(index=False)) df.to_csv("collaboration_network.csv", index=False)
Key Parameters
| Parameter | Module | Default | Range / Options | Effect |
|---|---|---|---|---|
| All | — | text string | Full-text search across title+abstract |
| All | — | | Structured filters (AND logic) |
| All | | – | Results per page |
| Pagination | | cursor string | Cursor for pagination |
| Works | | , | Result ordering |
| All | all fields | comma-separated field names | Limit response fields (faster) |
| Works | — | field name | Aggregate counts by field |
| All | — | email address | Polite pool access (prioritized) |
Best Practices
-
Always include
: Addmailto
to all requests to join the polite pool and receive priority processing without rate throttling.mailto=your@email.com -
Use
for large paginations: When paginating through thousands of results, specify only needed fields (select
) to reduce response size and speed up parsing.select=id,doi,title,cited_by_count -
Use cursor pagination, not offset: OpenAlex does not support offset pagination beyond 10,000 results. Use cursor-based pagination (
parameter) for deep traversals.cursor -
Reconstruct abstracts from inverted index: Not all works have abstracts; check
before reconstructing to avoid KeyError.abstract_inverted_index is not None -
Cache by work ID: OpenAlex Work IDs (W…) are stable identifiers. Cache retrieved work metadata to avoid re-fetching within a project.
Common Recipes
Recipe: DOI to Metadata Batch Lookup
When to use: Enrich a list of DOIs with citation counts, open-access URLs, and abstracts.
import requests, pandas as pd, time BASE = "https://api.openalex.org" dois = [ "10.1038/s41592-019-0458-z", "10.1186/s13059-021-02519-4", "10.1038/s41587-019-0071-9", ] rows = [] for doi in dois: r = requests.get(f"{BASE}/works/https://doi.org/{doi}", params={"select": "title,publication_year,cited_by_count,open_access", "mailto": "your@email.com"}) if r.ok: w = r.json() rows.append({ "doi": doi, "title": w.get("title"), "year": w.get("publication_year"), "citations": w.get("cited_by_count"), "is_oa": w.get("open_access", {}).get("is_oa"), }) time.sleep(0.1) df = pd.DataFrame(rows) print(df.to_string(index=False))
Recipe: Count Papers by Country
When to use: Geographic analysis of research output on a topic.
import requests, pandas as pd r = requests.get( "https://api.openalex.org/works", params={"search": "CRISPR therapeutics", "filter": "publication_year:2023", "group_by": "authorships.institutions.country_code", "per_page": 200, "mailto": "your@email.com"} ) df = pd.DataFrame(r.json()["group_by"]).rename(columns={"key": "country", "count": "papers"}) print(df.sort_values("papers", ascending=False).head(10).to_string(index=False))
Recipe: Find Most-Cited Papers in a Field
When to use: Identify landmark papers on a topic for background reading.
import requests, pandas as pd r = requests.get( "https://api.openalex.org/works", params={"search": "protein language model", "sort": "cited_by_count:desc", "per_page": 10, "mailto": "your@email.com"} ) for w in r.json()["results"]: print(f"[{w['cited_by_count']:5d} cites] ({w['publication_year']}) {w['title'][:70]}")
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Rate limit exceeded | Add between requests; use polite pool () |
Empty | No abstract available | Check for before reconstructing; not all works have abstracts |
| Cursor pagination returns duplicates | Cursor expired | Re-start pagination with |
| DOI lookup returns 404 | DOI not indexed in OpenAlex | Try title search instead; OpenAlex indexes 250M+ but not 100% of literature |
| Filter returns 0 results | Field name wrong or filter syntax error | Check filter syntax: with no spaces; verify field names in API docs |
is stale | Citation counts update periodically | Counts are refreshed regularly but may lag by days; use for trends not exact figures |
Related Skills
— Biomedical literature with MeSH controlled vocabulary; better for clinical and life sciencespubmed-database
— Biomedical preprints not yet indexed in OpenAlexbiorxiv-database
— Hypothesis generation workflows using literature as inputscientific-brainstorming
— Guide for designing systematic literature reviews using OpenAlexliterature-review
References
- OpenAlex documentation — Full API reference and data model
- OpenAlex API endpoint — Interactive API explorer
- OpenAlex paper (Priem et al. 2022) — Description of the OpenAlex data system
- OpenAlex entity types — Works, Authors, Sources, Institutions, Concepts documentation