Awesome-Agent-Skills-for-Empirical-Research software-heritage-api
Archive and retrieve source code history via Software Heritage API
install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/domains/cs/software-heritage-api" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-software-heritage && rm -rf "$T"
manifest:
skills/43-wentorai-research-plugins/skills/domains/cs/software-heritage-api/SKILL.mdsource content
Software Heritage API
Overview
Software Heritage is the universal archive of software source code, preserving 18B+ source files from 280M+ software origins (GitHub, GitLab, Bitbucket, CRAN, PyPI, Debian, etc.). Each artifact receives a persistent SWHID (SoftWare Heritage IDentifier) for reliable citation in research. The API enables code search, provenance tracking, and historical analysis. Free, no authentication for read access.
API Endpoints
Base URL
https://archive.softwareheritage.org/api/1
Search Origins (Repositories)
# Search for repositories curl "https://archive.softwareheritage.org/api/1/origin/search/scikit-learn/?limit=10" # Get origin metadata curl "https://archive.softwareheritage.org/api/1/origin/https://github.com/scikit-learn/scikit-learn/get/" # List visits (snapshots) of an origin curl "https://archive.softwareheritage.org/api/1/origin/https://github.com/scikit-learn/scikit-learn/visits/"
Retrieve Content
# Get a specific file by its SHA1 hash curl "https://archive.softwareheritage.org/api/1/content/sha1:adc83b19e793491b1c6ea0fd8b46cd9f32e592fc/" # Get raw file content curl "https://archive.softwareheritage.org/api/1/content/sha1:adc83b19e793491b1c6ea0fd8b46cd9f32e592fc/raw/" # Get directory listing curl "https://archive.softwareheritage.org/api/1/directory/{sha1_git}/"
Resolve SWHIDs
# Resolve a SWHID to its object curl "https://archive.softwareheritage.org/api/1/resolve/swh:1:cnt:adc83b19e793491b1c6ea0fd8b46cd9f32e592fc/" # Get snapshot curl "https://archive.softwareheritage.org/api/1/snapshot/{sha1_git}/" # Get revision (commit) curl "https://archive.softwareheritage.org/api/1/revision/{sha1_git}/"
Save Code Now
# Request archival of a repository curl -X POST "https://archive.softwareheritage.org/api/1/origin/save/git/url/https://github.com/user/repo/"
SWHID Format
SoftWare Heritage persistent IDentifiers:
swh:1:{type}:{hash} Types: cnt → Content (file) dir → Directory rev → Revision (commit) rel → Release (tag) snp → Snapshot Examples: swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 (file) swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d (commit) swh:1:snp:c7c108084bc0bf3d81436bf980b46e98571c7b17 (snapshot) With qualifiers: swh:1:cnt:{hash};origin=https://github.com/user/repo;visit=swh:1:snp:{hash};path=/src/main.py
Python Usage
import requests import time BASE_URL = "https://archive.softwareheritage.org/api/1" def search_origins(query: str, limit: int = 10) -> list: """Search Software Heritage for archived repositories.""" resp = requests.get( f"{BASE_URL}/origin/search/{query}/", params={"limit": limit}, timeout=30, ) resp.raise_for_status() return [ { "url": o.get("url"), "has_visits": o.get("has_visits"), } for o in resp.json() ] def get_origin_visits(origin_url: str) -> list: """Get archival snapshots for a repository.""" resp = requests.get( f"{BASE_URL}/origin/{origin_url}/visits/", timeout=30, ) resp.raise_for_status() return [ { "date": v.get("date"), "status": v.get("status"), "snapshot": v.get("snapshot"), "type": v.get("type"), } for v in resp.json() ] def get_directory(sha1_git: str) -> list: """List files in an archived directory.""" resp = requests.get( f"{BASE_URL}/directory/{sha1_git}/", timeout=30, ) resp.raise_for_status() return [ { "name": entry.get("name"), "type": entry.get("type"), "target": entry.get("target"), } for entry in resp.json() ] def save_code_now(repo_url: str) -> dict: """Request Software Heritage to archive a repository.""" resp = requests.post( f"{BASE_URL}/origin/save/git/url/{repo_url}/", timeout=30, ) resp.raise_for_status() return resp.json() # Example: find archived ML frameworks origins = search_origins("pytorch", limit=5) for o in origins: print(f"Archived: {o['url']}") # Example: get snapshot history visits = get_origin_visits("https://github.com/pytorch/pytorch") for v in visits[:5]: print(f"[{v['date'][:10]}] {v['status']} — {v['type']}") # Example: request archival # result = save_code_now("https://github.com/my-org/my-research-code") # print(f"Save request: {result['save_request_status']}")
Use Cases
- Code citation: Cite specific code versions in papers using SWHIDs
- Reproducibility: Archive the exact code used in experiments
- Provenance tracking: Trace code evolution and authorship
- Software archaeology: Study historical codebases
- Compliance: Ensure open-source license compliance through archival
Rate Limits
- Unauthenticated: 120 requests/hour
- Authenticated (free token): 1200 requests/hour
References
- Software Heritage
- API Documentation
- SWHID Specification
- Di Cosmo, R. & Zacchiroli, S. (2017). "Software Heritage: Why and How to Preserve Software Source Code." iPRES 2017.