Awesome-claude-skills odp-data-ingest

Ingest data into Ocean Data Platform (ODP) / HUB Ocean using the ODP Python SDK — covers datasets, file uploads, tabular data, and spatial data

install

source · Clone the upstream repo

git clone https://github.com/joevstaas/awesome-claude-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/joevstaas/awesome-claude-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/odp-data-ingest" ~/.claude/skills/joevstaas-awesome-claude-skills-odp-data-ingest && rm -rf "$T"

manifest: skills/odp-data-ingest/SKILL.md

source content

ODP Data Ingest Skill

Use this skill when the user wants to ingest, upload, or manage datasets in the Ocean Data Platform (ODP) by Hub Ocean.

Prerequisites

Python Dependencies

pip install odp-sdk pyarrow shapely python-dotenv

Package	Purpose
`odp-sdk`	ODP client library (authentication, catalog, dataset operations)
`pyarrow`	Define table schemas and serialize tabular data
`shapely`	Convert geometries between GeoJSON and WKT (for spatial data)

Authentication

The ODP SDK authenticates via API key:

from odp.client import Client

client = Client(api_key="your-api-key")

Store the key in an environment variable (

ODP_API_KEY

) and load via

python-dotenv

or similar.

Core Concepts

Data Collections and Datasets

ODP organizes data hierarchically:

Data Collection — a logical grouping of related datasets (identified by UUID)
Dataset — a single data entity within a collection, containing files and/or tabular data

Data Storage Options

Each dataset can hold two types of data:

Type Use case API

Files

Raw file storage (GeoJSON, CSV, images, etc.)

ds.files.upload()

ds.files.download()

Tabular

Structured rows with a PyArrow schema, supports spatial queries

ds.table.create()

ds.insert()

You can use both in the same dataset (e.g., store the raw GeoJSON file and a queryable table).

Ingest Workflow

Step 1: Create a Dataset

Check if a dataset already exists by name, then create if needed:

import requests
from odp.catalog_v2 import get_dataset_meta_by_name

# Check for existing dataset
# Returns a DatasetMeta dataclass (not a dict) — access fields via .id, .name, .description
existing = get_dataset_meta_by_name(client, "My Dataset Name")

if not existing:
    # Create new dataset
    res = client._request(
        requests.Request(
            method="POST",
            url=client.base_url + "/api/catalog/v2/datasets",
            json={
                "name": "My Dataset Name",
                "description": "Description of the dataset",
            },
        ),
        retry=False,
    )
    res.raise_for_status()
    dataset_id = res.json()["id"]

    # Add to a data collection
    res2 = client._request(
        requests.Request(
            method="POST",
            url=client.base_url + f"/api/catalog/v2/data-collections/{collection_uid}/datasets/{dataset_id}",
        ),
        retry=False,
    )
    res2.raise_for_status()
else:
    dataset_id = existing.id  # DatasetMeta is a dataclass, use attribute access

Step 2: Upload Raw Files

Upload any file to the dataset's file storage:

ds = client.dataset(dataset_id)

with open("data.geojson", "rb") as f:
    file_id = ds.files.upload("data.geojson", f)

Upload also accepts raw bytes:

ds.files.upload("hello.txt", b"Hello World!")

Note:

ds.files.update_meta()

has limited field support. Supported fields:

name

format

. Setting

"description"

will raise an error.

Step 3: Upload Tabular Data

Define a PyArrow schema and insert rows:

import pyarrow as pa

schema = pa.schema([
    pa.field("id", pa.string(), nullable=False),
    pa.field("name", pa.string(), nullable=True,
             metadata={"description": "Station name"}),
    pa.field("value", pa.float64(), nullable=True,
             metadata={"description": "Measured value"}),
    pa.field("geometry", pa.string(), nullable=True,
             metadata={"isGeometry": "1", "index": "1",
                        "description": "Point geometry in WKT format"}),
])

# Create the table (idempotent if schema matches)
ds.table.create(schema)

# Insert rows within a transaction
rows = [
    {"id": "uuid-1", "name": "Station A", "value": 12.5, "geometry": "POINT (10.7 59.9)"},
    {"id": "uuid-2", "name": "Station B", "value": 8.3, "geometry": "POINT (10.8 59.8)"},
]

with ds as tx:
    tx.insert(rows)

Step 4: Delete a Dataset (for re-ingestion)

res = client._request(
    requests.Request(
        method="DELETE",
        url=client.base_url + f"/api/catalog/v2/datasets/{dataset_id}",
    ),
    retry=False,
)
res.raise_for_status()

Column-Level Metadata

PyArrow field metadata can set column descriptions, classifications, and aggregation hints that appear in the ODP portal.

Supported Metadata Keys

Key	Values	Effect in ODP
`isGeometry`	`"1"`	Marks column as the geometry column
`index`	`"1"`	Creates a spatial index on the column
`description`	Any string	Sets the column's Description in the portal
`class`	`"geometry"` , `"latitude"` , `"longitude"`	Sets the column's Classification dropdown
`aggr`	`"sum"` , `"mean"` , `"min"` , `"max"` , `"count"`	Aggregation hint for the column

Example: Schema with Column Descriptions

schema = pa.schema([
    pa.field("id", pa.string(), nullable=False),
    pa.field("water_location_id", pa.int64(), nullable=True,
             metadata={"description": "Vannlokasjon-ID fra Vannmiljø"}),
    pa.field("station_name", pa.string(), nullable=True,
             metadata={"description": "Navn på målestasjonen"}),
    pa.field("value", pa.float64(), nullable=True,
             metadata={"description": "Målt verdi", "aggr": "mean"}),
    pa.field("geometry", pa.string(), nullable=True,
             metadata={
                 "isGeometry": "1",
                 "index": "1",
                 "description": "Punkt-geometri i WKT-format",
             }),
])

All metadata values must be strings. The metadata dict is passed to

pa.field(..., metadata={...})

and propagated to ODP when

ds.table.create(schema)

is called.

Working with Spatial Data (GeoJSON → ODP)

Geometry Conversion

ODP tabular storage expects geometry in WKT (Well-Known Text) format. Convert from GeoJSON using Shapely:

from shapely.geometry import shape
from shapely import wkt

def geojson_geometry_to_wkt(geometry: dict) -> str | None:
    if not geometry:
        return None
    geom = shape(geometry)
    return wkt.dumps(geom)

Geometry Column Metadata

Mark the geometry column with special PyArrow field metadata so ODP recognizes it as spatial:

pa.field(
    "geometry",
    pa.string(),
    nullable=True,
    metadata={"isGeometry": "1", "index": "1"}
)

Schema Inference from GeoJSON Features

When ingesting GeoJSON with varying properties, infer the schema dynamically:

def infer_pyarrow_type(value):
    if value is None:
        return pa.string()
    elif isinstance(value, bool):
        return pa.bool_()
    elif isinstance(value, int):
        return pa.int64()
    elif isinstance(value, float):
        return pa.float64()
    elif isinstance(value, (list, dict)):
        return pa.string()  # Serialize complex types as JSON strings
    else:
        return pa.string()

Scan all features, collect types per property, and fall back to

pa.string()

when mixed types are detected.

Field Name Sanitization

ODP has restrictions on column names. Sanitize before creating the schema:

Pattern Replacement Reason

id

source_id

Avoids collision with the primary key column

_prefix

meta_prefix

Underscore-prefixed names not allowed

and

Special characters not allowed in column names

Recommended Standard Columns

Add provenance columns to every row for traceability:

row = {
    "id": str(uuid.uuid4()),        # Primary key
    "source_dataset": "dataset_key", # Which dataset this came from
    "source_name": "Display Name",   # Human-readable source
    "geometry": wkt_string,          # WKT geometry
    # ... remaining properties from the source data
}

Working with Non-Spatial Data

For structured data without geometry (e.g., action plans, reports, measurements):

Define a fixed PyArrow schema manually (no inference needed)
Omit the geometry column
Same upload pattern:
```
ds.table.create(schema)
```
then
```
ds.insert(rows)
```
in a transaction

Idempotent Ingestion Pattern

A robust ingest script should support:

--file <key>     # Ingest a single dataset
--all            # Ingest all datasets
--clean          # Delete existing and re-ingest
--list           # List available datasets

The clean/re-ingest pattern:

Look up existing dataset by name via
```
get_dataset_meta_by_name()
```
If found and
```
--clean
```
: delete it, then create fresh
If found and not
```
--clean
```
: skip (already ingested)
If not found: create new

Dataset Metadata Management

The SDK does not have built-in methods for updating dataset metadata beyond name and description at creation time. Use the REST API directly via

client._request()

to update metadata after creation.

Metadata PATCH Endpoints

Each metadata facet has its own endpoint at

/api/catalog/v2/datasets/{datasetId}/metadata/...

Endpoint	Method	Payload
`.../general`	PATCH	`{name: str, description: str, tags: str[]}`
`.../provider`	PATCH	`{provider_id: str}` — use an existing provider UUID
`.../license`	PATCH	`{license_enum: str}` — e.g. `"CC-BY-4.0"`
`.../citation`	PATCH	`{text: str, link: str}`
`.../additional-info`	PATCH	Free-form JSON object (any keys)
`.../constraints`	PATCH	`{constraints: [{text: str}]}`
`.../documentation`	PATCH	`{documentation: str[]}`

Example: Update All Metadata

import requests

base = f"{client.base_url}/api/catalog/v2/datasets/{dataset_id}/metadata"

# General: name, description, tags
client._request(requests.Request(
    method="PATCH",
    url=f"{base}/general",
    json={
        "name": "My Dataset",
        "description": "A detailed description of the dataset.",
        "tags": ["ocean", "marine", "monitoring"],
    },
), retry=False).raise_for_status()

# Provider (use an existing provider UUID — find via GET /api/catalog/v2/providers or the portal)
client._request(requests.Request(
    method="PATCH",
    url=f"{base}/provider",
    json={"provider_id": "ec54f2cd-e56c-4ac0-8d29-82654090e658"},  # HUB Ocean
), retry=False).raise_for_status()

# License
client._request(requests.Request(
    method="PATCH",
    url=f"{base}/license",
    json={"license_enum": "CC-BY-4.0"},
), retry=False).raise_for_status()

# Citation
client._request(requests.Request(
    method="PATCH",
    url=f"{base}/citation",
    json={
        "text": "My Project (2026). Dataset Name. Ocean Data Platform.",
        "link": "https://github.com/org/repo",
    },
), retry=False).raise_for_status()

# Additional info (free-form — use for geographic coverage, update frequency, etc.)
client._request(requests.Request(
    method="PATCH",
    url=f"{base}/additional-info",
    json={
        "geographic_coverage": "Global — Atlantic, Pacific, Indian oceans",
        "temporal_coverage": "2025-01-01 to present",
        "update_frequency": "Daily (automated)",
    },
), retry=False).raise_for_status()

Important Notes on Metadata

Provider: The
```
custom_provider
```
field with
```
{name, description, website, kind}
```
exists but the
```
kind
```
enum validation is strict and undocumented. Using an existing
```
provider_id
```
is more reliable.
Geographic coverage: There is no dedicated spatial extent endpoint for datasets. The spatial bounds shown in the portal are auto-computed from the geometry column in the tabular data. Use
```
additional-info
```
for descriptive geographic coverage text.
Full PUT:
```
PUT /api/catalog/v2/datasets/{datasetId}
```
replaces all metadata at once but requires every field — prefer the granular PATCH endpoints.

Reading Data Back from ODP

Download a raw file

ds = client.dataset(dataset_id)
content = ds.files.download(file_id)

Query tabular data

Use the STAC API for spatial/temporal queries — see the

odp-stac-api

skill.

Tips and Gotchas

Table schema is immutable — once created, you cannot change column types. You can use
```
ds.table.alter(new_schema)
```
but this triggers a full data re-ingestion. For simple metadata changes, it's often easier to delete and recreate the dataset.
Transactions are required — always use
```
with ds as tx: tx.insert(rows)
```
for tabular inserts.
Large datasets — for datasets with many features (>10k rows), consider batching inserts.
Mixed types — if a GeoJSON property has mixed types across features (e.g., sometimes
```
int
```
, sometimes
```
string
```
), fall back to
```
pa.string()
```
for that column.
Complex values — lists and dicts in GeoJSON properties should be serialized to JSON strings before insertion.
UUID primary keys — always generate UUIDs for the
```
id
```
column; do not reuse source IDs as the primary key.