Awesome-Agent-Skills-for-Empirical-Research pangaea-data-api

Access earth and environmental science datasets via PANGAEA API

install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/domains/geoscience/pangaea-data-api" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-pangaea-data-api && rm -rf "$T"
manifest: skills/43-wentorai-research-plugins/skills/domains/geoscience/pangaea-data-api/SKILL.md
source content

PANGAEA Data Repository API

Overview

PANGAEA is the world's leading data repository for earth and environmental sciences, hosting 400K+ datasets with 20B+ data points. It archives research data from oceanography, paleoclimatology, geology, ecology, and atmospheric science. Each dataset has a DOI and is linked to the originating publication. The API provides search, metadata retrieval, and data download. Free, no authentication required.

API Endpoints

Search API

# Search datasets by keyword
curl "https://www.pangaea.de/advanced/search.php?q=ocean+temperature&count=20&type=json"

# Search with geographic bounding box
curl "https://www.pangaea.de/advanced/search.php?\
q=sediment+core&minlat=-60&maxlat=-30&minlon=-180&maxlon=180&type=json"

# Filter by parameter (measurement type)
curl "https://www.pangaea.de/advanced/search.php?\
q=carbon+dioxide&param=Atmospheric+CO2&type=json"

# Filter by date range
curl "https://www.pangaea.de/advanced/search.php?\
q=Arctic+ice&mindate=2020-01-01&maxdate=2026-12-31&type=json"

ElasticSearch API

# Full-text search via Elasticsearch
curl -X POST "https://ws.pangaea.de/es/pangaea/panmd/_search" \
  -H "Content-Type: application/json" \
  -d '{
    "query": {
      "bool": {
        "must": [
          {"match": {"citation.title": "ocean temperature"}}
        ],
        "filter": [
          {"range": {"citation.year": {"gte": 2020}}}
        ]
      }
    },
    "size": 20
  }'

Dataset Access

# Get dataset metadata
curl "https://doi.pangaea.de/10.1594/PANGAEA.123456?format=metainfo_json"

# Download dataset as tab-delimited text
curl "https://doi.pangaea.de/10.1594/PANGAEA.123456?format=textfile"

# Download as CSV
curl "https://doi.pangaea.de/10.1594/PANGAEA.123456?format=csv"

OAI-PMH Harvesting

# List records
curl "https://ws.pangaea.de/oai/provider?verb=ListRecords&metadataPrefix=oai_dc"

# Get specific record
curl "https://ws.pangaea.de/oai/provider?verb=GetRecord&identifier=oai:pangaea.de:doi:10.1594/PANGAEA.123456&metadataPrefix=oai_dc"

Query Parameters (Search API)

ParameterDescriptionExample
q
Search query
q=coral+reef+bleaching
count
Results per page
count=50
offset
Pagination offset
offset=20
minlat/maxlat
Latitude bounds
-90
to
90
minlon/maxlon
Longitude bounds
-180
to
180
mindate/maxdate
Temporal filter
2020-01-01
param
Parameter/measurement
Temperature
topic
Topic filter
Atmosphere
,
Biosphere
type
Response format
json
,
xml

Python Usage

import requests
import pandas as pd
from io import StringIO

SEARCH_URL = "https://www.pangaea.de/advanced/search.php"
ES_URL = "https://ws.pangaea.de/es/pangaea/panmd/_search"


def search_pangaea(query: str, count: int = 20,
                   bbox: dict = None) -> list:
    """Search PANGAEA for earth science datasets."""
    params = {"q": query, "count": count, "type": "json"}
    if bbox:
        params.update({
            "minlat": bbox.get("south", -90),
            "maxlat": bbox.get("north", 90),
            "minlon": bbox.get("west", -180),
            "maxlon": bbox.get("east", 180),
        })

    resp = requests.get(SEARCH_URL, params=params, timeout=30)
    resp.raise_for_status()
    data = resp.json()

    results = []
    for item in data.get("results", []):
        results.append({
            "doi": item.get("URI", ""),
            "title": item.get("citation", ""),
            "year": item.get("year"),
            "size": item.get("size"),
            "parameters": item.get("params", []),
            "score": item.get("score"),
        })
    return results


def download_dataset(doi: str) -> pd.DataFrame:
    """Download a PANGAEA dataset as a pandas DataFrame."""
    url = f"https://doi.pangaea.de/{doi}?format=textfile"
    resp = requests.get(url, timeout=60)
    resp.raise_for_status()

    lines = resp.text.split("\n")
    header_end = next(
        (i for i, line in enumerate(lines) if line.startswith("*/")),
        -1,
    )
    data_text = "\n".join(lines[header_end + 1:])
    return pd.read_csv(StringIO(data_text), sep="\t")


def search_by_location(query: str, lat: float, lon: float,
                       radius_deg: float = 5.0) -> list:
    """Search datasets near a geographic location."""
    bbox = {
        "south": lat - radius_deg,
        "north": lat + radius_deg,
        "west": lon - radius_deg,
        "east": lon + radius_deg,
    }
    return search_pangaea(query, bbox=bbox)


# Example: find ocean temperature datasets
datasets = search_pangaea("sea surface temperature", count=5)
for ds in datasets:
    print(f"[{ds['year']}] {ds['title'][:80]}...")
    print(f"  DOI: {ds['doi']} | Size: {ds['size']}")

# Example: download a specific dataset
# df = download_dataset("10.1594/PANGAEA.123456")
# print(df.head())

# Example: find Arctic research data
arctic = search_by_location("permafrost", lat=70, lon=25)
for ds in arctic[:3]:
    print(f"{ds['title'][:80]}...")

Data Topics

TopicCoverage
OceansTemperature, salinity, currents, chemistry
PaleoclimateIce cores, sediment cores, tree rings
AtmosphereCO2, aerosols, weather observations
LithosphereGeology, tectonics, geochemistry
BiosphereBiodiversity, ecology, marine biology
CryosphereSea ice, glaciers, permafrost

References