Commonly-used-high-value-skills arxiv

Search and retrieve academic papers from arXiv using their free REST API. No API key needed. Search by keyword, author, category, or ID. Combine with web_extract or the ocr-and-documents skill to read full paper content.

install
source · Clone the upstream repo
git clone https://github.com/seaworld008/Commonly-used-high-value-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/seaworld008/Commonly-used-high-value-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/knowledge-and-pm-integrations/arxiv" ~/.claude/skills/seaworld008-commonly-used-high-value-skills-arxiv-2cd73a && rm -rf "$T"
manifest: skills/knowledge-and-pm-integrations/arxiv/SKILL.md
source content

arXiv Research

Search and retrieve academic papers from arXiv via their free REST API. No API key, no dependencies — just curl.

When to Use

Use this skill when the user wants to:

  • search arXiv by topic, author, category, or paper ID
  • pull abstracts, PDF links, or BibTeX-style metadata
  • build a paper shortlist before deeper reading
  • combine lightweight paper discovery with a later reading or synthesis workflow

Usage

Recommended flow:

search papers
-> shortlist candidates
-> inspect metadata and citations
-> fetch abstract or full PDF
-> carry best papers into a deeper research workflow

Quick Reference

ActionCommand
Search papers
curl "https://export.arxiv.org/api/query?search_query=all:QUERY&max_results=5"
Get specific paper
curl "https://export.arxiv.org/api/query?id_list=2402.03300"
Read abstract (web)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])
Read full paper (PDF)
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

Searching Papers

The API returns Atom XML. Parse with

grep
/
sed
or pipe through
python3
for clean output.

Basic search

curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5"

Clean output (parse XML to readable format)

curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5&sortBy=submittedDate&sortOrder=descending" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom'}
root = ET.parse(sys.stdin).getroot()
for i, entry in enumerate(root.findall('a:entry', ns)):
    title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
    arxiv_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
    published = entry.find('a:published', ns).text[:10]
    authors = ', '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
    summary = entry.find('a:summary', ns).text.strip()[:200]
    cats = ', '.join(c.get('term') for c in entry.findall('a:category', ns))
    print(f'{i+1}. [{arxiv_id}] {title}')
    print(f'   Authors: {authors}')
    print(f'   Published: {published} | Categories: {cats}')
    print(f'   Abstract: {summary}...')
    print(f'   PDF: https://arxiv.org/pdf/{arxiv_id}')
    print()
"

Search Query Syntax

PrefixSearchesExample
all:
All fields
all:transformer+attention
ti:
Title
ti:large+language+models
au:
Author
au:vaswani
abs:
Abstract
abs:reinforcement+learning
cat:
Category
cat:cs.AI
co:
Comment
co:accepted+NeurIPS

Boolean operators

# AND (default when using +)
search_query=all:transformer+attention

# OR
search_query=all:GPT+OR+all:BERT

# AND NOT
search_query=all:language+model+ANDNOT+all:vision

# Exact phrase
search_query=ti:"chain+of+thought"

# Combined
search_query=au:hinton+AND+cat:cs.LG

Sort and Pagination

ParameterOptions
sortBy
relevance
,
lastUpdatedDate
,
submittedDate
sortOrder
ascending
,
descending
start
Result offset (0-based)
max_results
Number of results (default 10, max 30000)
# Latest 10 papers in cs.AI
curl -s "https://export.arxiv.org/api/query?search_query=cat:cs.AI&sortBy=submittedDate&sortOrder=descending&max_results=10"

Fetching Specific Papers

# By arXiv ID
curl -s "https://export.arxiv.org/api/query?id_list=2402.03300"

# Multiple papers
curl -s "https://export.arxiv.org/api/query?id_list=2402.03300,2401.12345,2403.00001"

BibTeX Generation

After fetching metadata for a paper, generate a BibTeX entry:

{% raw %}

curl -s "https://export.arxiv.org/api/query?id_list=1706.03762" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}
root = ET.parse(sys.stdin).getroot()
entry = root.find('a:entry', ns)
if entry is None: sys.exit('Paper not found')
title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
authors = ' and '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
year = entry.find('a:published', ns).text[:4]
raw_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
cat = entry.find('arxiv:primary_category', ns)
primary = cat.get('term') if cat is not None else 'cs.LG'
last_name = entry.find('a:author', ns).find('a:name', ns).text.split()[-1]
print(f'@article{{{last_name}{year}_{raw_id.replace(\".\", \"\")},')
print(f'  title     = {{{title}}},')
print(f'  author    = {{{authors}}},')
print(f'  year      = {{{year}}},')
print(f'  eprint    = {{{raw_id}}},')
print(f'  archivePrefix = {{arXiv}},')
print(f'  primaryClass  = {{{primary}}},')
print(f'  url       = {{https://arxiv.org/abs/{raw_id}}}')
print('}')
"

{% endraw %}

Reading Paper Content

After finding a paper, read it:

# Abstract page (fast, metadata + abstract)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])

# Full paper (PDF → markdown via Firecrawl)
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

For local PDF processing, see the

ocr-and-documents
skill.

Common Categories

CategoryField
cs.AI
Artificial Intelligence
cs.CL
Computation and Language (NLP)
cs.CV
Computer Vision
cs.LG
Machine Learning
cs.CR
Cryptography and Security
stat.ML
Machine Learning (Statistics)
math.OC
Optimization and Control
physics.comp-ph
Computational Physics

Full list: https://arxiv.org/category_taxonomy

Helper Script

The

scripts/search_arxiv.py
script handles XML parsing and provides clean output:

python scripts/search_arxiv.py "GRPO reinforcement learning"
python scripts/search_arxiv.py "transformer attention" --max 10 --sort date
python scripts/search_arxiv.py --author "Yann LeCun" --max 5
python scripts/search_arxiv.py --category cs.AI --sort date
python scripts/search_arxiv.py --id 2402.03300
python scripts/search_arxiv.py --id 2402.03300,2401.12345

No dependencies — uses only Python stdlib.


Semantic Scholar (Citations, Related Papers, Author Profiles)

arXiv doesn't provide citation data or recommendations. Use the Semantic Scholar API for that — free, no key needed for basic use (1 req/sec), returns JSON.

Get paper details + citations

# By arXiv ID
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300?fields=title,authors,citationCount,referenceCount,influentialCitationCount,year,abstract" | python3 -m json.tool

# By Semantic Scholar paper ID or DOI
curl -s "https://api.semanticscholar.org/graph/v1/paper/DOI:10.1234/example?fields=title,citationCount"

Get citations OF a paper (who cited it)

curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/citations?fields=title,authors,year,citationCount&limit=10" | python3 -m json.tool

Get references FROM a paper (what it cites)

curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/references?fields=title,authors,year,citationCount&limit=10" | python3 -m json.tool

Search papers (alternative to arXiv search, returns JSON)

curl -s "https://api.semanticscholar.org/graph/v1/paper/search?query=GRPO+reinforcement+learning&limit=5&fields=title,authors,year,citationCount,externalIds" | python3 -m json.tool

Get richer paper metadata (practical alternative to the deprecated recommendations endpoint)

curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300?fields=title,authors,year,referenceCount,citationCount,influentialCitationCount" | python3 -m json.tool

Author profile

curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=Yann+LeCun&fields=name,hIndex,citationCount,paperCount" | python3 -m json.tool

Useful Semantic Scholar fields

title
,
authors
,
year
,
abstract
,
citationCount
,
referenceCount
,
influentialCitationCount
,
isOpenAccess
,
openAccessPdf
,
fieldsOfStudy
,
publicationVenue
,
externalIds
(contains arXiv ID, DOI, etc.)


Complete Research Workflow

  1. Discover:
    python scripts/search_arxiv.py "your topic" --sort date --max 10
  2. Assess impact:
    curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID?fields=citationCount,influentialCitationCount"
  3. Read abstract:
    web_extract(urls=["https://arxiv.org/abs/ID"])
  4. Read full paper:
    web_extract(urls=["https://arxiv.org/pdf/ID"])
  5. Find related work:
    curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID/references?fields=title,citationCount&limit=20"
  6. Get recommendations: POST to Semantic Scholar recommendations endpoint
  7. Track authors:
    curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=NAME"

Rate Limits

APIRateAuth
arXiv~1 req / 3 secondsNone needed
Semantic Scholar1 req / secondNone (100/sec with API key)

Notes

  • arXiv returns Atom XML — use the helper script or parsing snippet for clean output
  • Semantic Scholar returns JSON — pipe through
    python3 -m json.tool
    for readability
  • arXiv IDs: old format (
    hep-th/0601001
    ) vs new (
    2402.03300
    )
  • PDF:
    https://arxiv.org/pdf/{id}
    — Abstract:
    https://arxiv.org/abs/{id}
  • HTML (when available):
    https://arxiv.org/html/{id}
  • For local PDF processing, see the
    ocr-and-documents
    skill

ID Versioning

  • arxiv.org/abs/1706.03762
    always resolves to the latest version
  • arxiv.org/abs/1706.03762v1
    points to a specific immutable version
  • When generating citations, preserve the version suffix you actually read to prevent citation drift (a later version may substantially change content)
  • The API
    <id>
    field returns the versioned URL (e.g.,
    http://arxiv.org/abs/1706.03762v7
    )

Withdrawn Papers

Papers can be withdrawn after submission. When this happens:

  • The
    <summary>
    field contains a withdrawal notice (look for "withdrawn" or "retracted")
  • Metadata fields may be incomplete
  • Always check the summary before treating a result as a valid paper