Awesome-claude-skills odp-data-ingest
Ingest data into Ocean Data Platform (ODP) / HUB Ocean using the ODP Python SDK — covers datasets, file uploads, tabular data, and spatial data
git clone https://github.com/joevstaas/awesome-claude-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/joevstaas/awesome-claude-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/odp-data-ingest" ~/.claude/skills/joevstaas-awesome-claude-skills-odp-data-ingest && rm -rf "$T"
skills/odp-data-ingest/SKILL.mdODP Data Ingest Skill
Use this skill when the user wants to ingest, upload, or manage datasets in the Ocean Data Platform (ODP) by Hub Ocean.
Prerequisites
Python Dependencies
pip install odp-sdk pyarrow shapely python-dotenv
| Package | Purpose |
|---|---|
| ODP client library (authentication, catalog, dataset operations) |
| Define table schemas and serialize tabular data |
| Convert geometries between GeoJSON and WKT (for spatial data) |
Authentication
The ODP SDK authenticates via API key:
from odp.client import Client client = Client(api_key="your-api-key")
Store the key in an environment variable (
ODP_API_KEY) and load via python-dotenv or similar.
Core Concepts
Data Collections and Datasets
ODP organizes data hierarchically:
- Data Collection — a logical grouping of related datasets (identified by UUID)
- Dataset — a single data entity within a collection, containing files and/or tabular data
Data Storage Options
Each dataset can hold two types of data:
| Type | Use case | API |
|---|---|---|
| Files | Raw file storage (GeoJSON, CSV, images, etc.) | / |
| Tabular | Structured rows with a PyArrow schema, supports spatial queries | / |
You can use both in the same dataset (e.g., store the raw GeoJSON file and a queryable table).
Ingest Workflow
Step 1: Create a Dataset
Check if a dataset already exists by name, then create if needed:
import requests from odp.catalog_v2 import get_dataset_meta_by_name # Check for existing dataset # Returns a DatasetMeta dataclass (not a dict) — access fields via .id, .name, .description existing = get_dataset_meta_by_name(client, "My Dataset Name") if not existing: # Create new dataset res = client._request( requests.Request( method="POST", url=client.base_url + "/api/catalog/v2/datasets", json={ "name": "My Dataset Name", "description": "Description of the dataset", }, ), retry=False, ) res.raise_for_status() dataset_id = res.json()["id"] # Add to a data collection res2 = client._request( requests.Request( method="POST", url=client.base_url + f"/api/catalog/v2/data-collections/{collection_uid}/datasets/{dataset_id}", ), retry=False, ) res2.raise_for_status() else: dataset_id = existing.id # DatasetMeta is a dataclass, use attribute access
Step 2: Upload Raw Files
Upload any file to the dataset's file storage:
ds = client.dataset(dataset_id) with open("data.geojson", "rb") as f: file_id = ds.files.upload("data.geojson", f)
Upload also accepts raw bytes:
ds.files.upload("hello.txt", b"Hello World!")
Note:
ds.files.update_meta() has limited field support. Supported fields: name, format. Setting "description" will raise an error.
Step 3: Upload Tabular Data
Define a PyArrow schema and insert rows:
import pyarrow as pa schema = pa.schema([ pa.field("id", pa.string(), nullable=False), pa.field("name", pa.string(), nullable=True, metadata={"description": "Station name"}), pa.field("value", pa.float64(), nullable=True, metadata={"description": "Measured value"}), pa.field("geometry", pa.string(), nullable=True, metadata={"isGeometry": "1", "index": "1", "description": "Point geometry in WKT format"}), ]) # Create the table (idempotent if schema matches) ds.table.create(schema) # Insert rows within a transaction rows = [ {"id": "uuid-1", "name": "Station A", "value": 12.5, "geometry": "POINT (10.7 59.9)"}, {"id": "uuid-2", "name": "Station B", "value": 8.3, "geometry": "POINT (10.8 59.8)"}, ] with ds as tx: tx.insert(rows)
Step 4: Delete a Dataset (for re-ingestion)
res = client._request( requests.Request( method="DELETE", url=client.base_url + f"/api/catalog/v2/datasets/{dataset_id}", ), retry=False, ) res.raise_for_status()
Column-Level Metadata
PyArrow field metadata can set column descriptions, classifications, and aggregation hints that appear in the ODP portal.
Supported Metadata Keys
| Key | Values | Effect in ODP |
|---|---|---|
| | Marks column as the geometry column |
| | Creates a spatial index on the column |
| Any string | Sets the column's Description in the portal |
| , , | Sets the column's Classification dropdown |
| , , , , | Aggregation hint for the column |
Example: Schema with Column Descriptions
schema = pa.schema([ pa.field("id", pa.string(), nullable=False), pa.field("water_location_id", pa.int64(), nullable=True, metadata={"description": "Vannlokasjon-ID fra Vannmiljø"}), pa.field("station_name", pa.string(), nullable=True, metadata={"description": "Navn på målestasjonen"}), pa.field("value", pa.float64(), nullable=True, metadata={"description": "Målt verdi", "aggr": "mean"}), pa.field("geometry", pa.string(), nullable=True, metadata={ "isGeometry": "1", "index": "1", "description": "Punkt-geometri i WKT-format", }), ])
All metadata values must be strings. The metadata dict is passed to
pa.field(..., metadata={...}) and propagated to ODP when ds.table.create(schema) is called.
Working with Spatial Data (GeoJSON → ODP)
Geometry Conversion
ODP tabular storage expects geometry in WKT (Well-Known Text) format. Convert from GeoJSON using Shapely:
from shapely.geometry import shape from shapely import wkt def geojson_geometry_to_wkt(geometry: dict) -> str | None: if not geometry: return None geom = shape(geometry) return wkt.dumps(geom)
Geometry Column Metadata
Mark the geometry column with special PyArrow field metadata so ODP recognizes it as spatial:
pa.field( "geometry", pa.string(), nullable=True, metadata={"isGeometry": "1", "index": "1"} )
Schema Inference from GeoJSON Features
When ingesting GeoJSON with varying properties, infer the schema dynamically:
def infer_pyarrow_type(value): if value is None: return pa.string() elif isinstance(value, bool): return pa.bool_() elif isinstance(value, int): return pa.int64() elif isinstance(value, float): return pa.float64() elif isinstance(value, (list, dict)): return pa.string() # Serialize complex types as JSON strings else: return pa.string()
Scan all features, collect types per property, and fall back to
pa.string() when mixed types are detected.
Field Name Sanitization
ODP has restrictions on column names. Sanitize before creating the schema:
| Pattern | Replacement | Reason |
|---|---|---|
| | Avoids collision with the primary key column |
| | Underscore-prefixed names not allowed |
and | | Special characters not allowed in column names |
Recommended Standard Columns
Add provenance columns to every row for traceability:
row = { "id": str(uuid.uuid4()), # Primary key "source_dataset": "dataset_key", # Which dataset this came from "source_name": "Display Name", # Human-readable source "geometry": wkt_string, # WKT geometry # ... remaining properties from the source data }
Working with Non-Spatial Data
For structured data without geometry (e.g., action plans, reports, measurements):
- Define a fixed PyArrow schema manually (no inference needed)
- Omit the geometry column
- Same upload pattern:
thends.table.create(schema)
in a transactionds.insert(rows)
Idempotent Ingestion Pattern
A robust ingest script should support:
--file <key> # Ingest a single dataset --all # Ingest all datasets --clean # Delete existing and re-ingest --list # List available datasets
The clean/re-ingest pattern:
- Look up existing dataset by name via
get_dataset_meta_by_name() - If found and
: delete it, then create fresh--clean - If found and not
: skip (already ingested)--clean - If not found: create new
Dataset Metadata Management
The SDK does not have built-in methods for updating dataset metadata beyond name and description at creation time. Use the REST API directly via
client._request() to update metadata after creation.
Metadata PATCH Endpoints
Each metadata facet has its own endpoint at
/api/catalog/v2/datasets/{datasetId}/metadata/...:
| Endpoint | Method | Payload |
|---|---|---|
| PATCH | |
| PATCH | — use an existing provider UUID |
| PATCH | — e.g. |
| PATCH | |
| PATCH | Free-form JSON object (any keys) |
| PATCH | |
| PATCH | |
Example: Update All Metadata
import requests base = f"{client.base_url}/api/catalog/v2/datasets/{dataset_id}/metadata" # General: name, description, tags client._request(requests.Request( method="PATCH", url=f"{base}/general", json={ "name": "My Dataset", "description": "A detailed description of the dataset.", "tags": ["ocean", "marine", "monitoring"], }, ), retry=False).raise_for_status() # Provider (use an existing provider UUID — find via GET /api/catalog/v2/providers or the portal) client._request(requests.Request( method="PATCH", url=f"{base}/provider", json={"provider_id": "ec54f2cd-e56c-4ac0-8d29-82654090e658"}, # HUB Ocean ), retry=False).raise_for_status() # License client._request(requests.Request( method="PATCH", url=f"{base}/license", json={"license_enum": "CC-BY-4.0"}, ), retry=False).raise_for_status() # Citation client._request(requests.Request( method="PATCH", url=f"{base}/citation", json={ "text": "My Project (2026). Dataset Name. Ocean Data Platform.", "link": "https://github.com/org/repo", }, ), retry=False).raise_for_status() # Additional info (free-form — use for geographic coverage, update frequency, etc.) client._request(requests.Request( method="PATCH", url=f"{base}/additional-info", json={ "geographic_coverage": "Global — Atlantic, Pacific, Indian oceans", "temporal_coverage": "2025-01-01 to present", "update_frequency": "Daily (automated)", }, ), retry=False).raise_for_status()
Important Notes on Metadata
- Provider: The
field withcustom_provider
exists but the{name, description, website, kind}
enum validation is strict and undocumented. Using an existingkind
is more reliable.provider_id - Geographic coverage: There is no dedicated spatial extent endpoint for datasets. The spatial bounds shown in the portal are auto-computed from the geometry column in the tabular data. Use
for descriptive geographic coverage text.additional-info - Full PUT:
replaces all metadata at once but requires every field — prefer the granular PATCH endpoints.PUT /api/catalog/v2/datasets/{datasetId}
Reading Data Back from ODP
Download a raw file
ds = client.dataset(dataset_id) content = ds.files.download(file_id)
Query tabular data
Use the STAC API for spatial/temporal queries — see the
odp-stac-api skill.
Tips and Gotchas
- Table schema is immutable — once created, you cannot change column types. You can use
but this triggers a full data re-ingestion. For simple metadata changes, it's often easier to delete and recreate the dataset.ds.table.alter(new_schema) - Transactions are required — always use
for tabular inserts.with ds as tx: tx.insert(rows) - Large datasets — for datasets with many features (>10k rows), consider batching inserts.
- Mixed types — if a GeoJSON property has mixed types across features (e.g., sometimes
, sometimesint
), fall back tostring
for that column.pa.string() - Complex values — lists and dicts in GeoJSON properties should be serialized to JSON strings before insertion.
- UUID primary keys — always generate UUIDs for the
column; do not reuse source IDs as the primary key.id