Claude-skill-registry gcs-data-catalog

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/gcs-data-catalog" ~/.claude/skills/majiayu000-claude-skill-registry-gcs-data-catalog && rm -rf "$T"
manifest: skills/data/gcs-data-catalog/SKILL.md
source content

GCS Data Catalog - Master Index

This skill provides immediate access to Landbruget.dk's GCS data lake containing 18+ Danish agricultural datasets.

Quick Access

GCS Bucket: Set via

GCS_BUCKET
environment variable (see
.env
)

Medallion Architecture:

  • bronze/
    - Raw data exactly as received
  • silver/
    - Cleaned, validated, standardized
  • gold/
    - Analysis-ready, joined datasets

Setup Code

import os
import pyarrow.parquet as pq
from google.cloud import storage

# Initialize GCS client
client = storage.Client()
bucket_name = os.environ.get('GCS_BUCKET')  # Set in .env
bucket = client.bucket(bucket_name)

# Read parquet from GCS
def read_gcs_parquet(gcs_path: str):
    """Read parquet file from GCS path like 'silver/subsidies/*/data.parquet'"""
    import io
    blob = bucket.blob(gcs_path)
    buffer = io.BytesIO()
    blob.download_to_file(buffer)
    buffer.seek(0)
    return pq.read_table(buffer).to_pandas()

Data Categories (Frontend-Aligned)

CategoryDanish NameSkill PathKey JoinMetrics
FinanceØkonomi
gcs-data-catalog/okonomi/
cvr_number3
Agricultural LandLandbrugsareal
gcs-data-catalog/landbrugsareal/
field_id, cvr_number4
EnvironmentMiljø
gcs-data-catalog/miljo/
geometry, field_id8
LivestockHusdyr
gcs-data-catalog/husdyr/
chr_number6
EmployeesMedarbejdere
gcs-data-catalog/medarbejdere/
cvr_number5

Key Identifiers

IdentifierFormatDescriptionValidation
CVR8 digitsCompany registration number
^\d{8}$
CHR6 digitsCentral Husbandry Register (herd ID)
^\d{6}$
BFEVariableCadastral parcel numbervaries
field_idStringField identifier from FVMvaries
field_uuidUUIDUnique field identifierUUID format

Dataset Quick Reference

Økonomi (Finance)

DatasetPathRowsKey Columns
Subsidies
silver/subsidies/
554Kcvr_number, tilskudsberetigt
CVR Enrichment
gold/cvr_enrichment/*/
variescvr_number, company data
Property Owners
silver/property_owners/
8.2MCVRNummer, owner info

Landbrugsareal (Agricultural Land)

DatasetPathRowsKey Columns
FVM Marker (fields)
silver/fvm_marker_{year}/
617K/yearfield_id, cvr_number, crop_code, geometry
Field Production
gold/field_production_{year}/
617K/yearfield_id, yield_estimate, crop_type
Agricultural Blocks
silver/agricultural_blocks_{year}/
variesblock_id, geometry
Cadastral
silver/cadastral/
2.16Mbfe_number, geometry

Miljø (Environment)

DatasetPathRowsKey Columns
Pesticide Disaggregation
gold/pesticide_disaggregation_{year}/
1.52Mcvr_number, PesticideName, DosageQuantity
NLES5 Nitrogen
gold/nles5_nitrogen_*/
500Kfield_id, nitrogen_washout_kg_ha
BNBO Status
silver/bnbo_status/
5.4Kgeometry, status_bnbo
Wetlands
silver/wetlands/
1.7Mgeometry, toerv_pct

Husdyr (Livestock)

DatasetPathRowsKey Columns
Svineflytning
silver/svineflytning/*/movements.parquet
1.27Msender_chr_number, receiver_chr_number, total_animals
CHR Movements
bronze/chr/*/chr_dyr_movement_summaries.parquet
124Kreporting_herd_number, animal_count
Animal Welfare
silver/animal welfare/
varieschr_number

Medarbejdere (Employees)

DatasetPathRowsKey Columns
Arbejdstilsynet
gold/arbejdstilsynet_inspections/
536cvr_number, decision, severity_score
Work Permits
silver/work permits/
variescvr_number
Worker Safety
silver/worker safety/
variescvr_number

Common Queries

List Available Years for a Dataset

gsutil ls gs://$GCS_BUCKET/silver/fvm_marker_*/

Check Dataset Schema

import os
import pyarrow.parquet as pq
from google.cloud import storage
import io

client = storage.Client()
bucket_name = os.environ.get('GCS_BUCKET')
bucket = client.bucket(bucket_name)

# Get first parquet file and read schema
blob = bucket.blob('silver/subsidies/2025-01-10T00:00:26.377177/data.parquet')
buffer = io.BytesIO()
blob.download_to_file(buffer)
buffer.seek(0)
schema = pq.read_schema(buffer)
print(schema)

Query Specific CVR

df = read_gcs_parquet('silver/subsidies/2025-01-10T00:00:26.377177/data.parquet')
company_data = df[df['cvr_number'] == '31373077']

Cross-Dataset Joins

CVR-based joins (most common)

# Join subsidies with pesticides on CVR
subsidies = read_gcs_parquet('silver/subsidies/*/data.parquet')
pesticides = read_gcs_parquet('gold/pesticide_disaggregation_2024/*/data.parquet')
merged = subsidies.merge(pesticides, on='cvr_number', how='inner')

Field-based joins

# Join field production with nitrogen estimates
field_prod = read_gcs_parquet('gold/field_production_2024/*/data.parquet')
nitrogen = read_gcs_parquet('gold/nles5_nitrogen_2024/*/data.parquet')
merged = field_prod.merge(nitrogen, on=['field_id', 'cvr_number'], how='inner')

CHR-based joins

# Join movements with animal welfare
movements = read_gcs_parquet('silver/svineflytning/*/movements.parquet')
welfare = read_gcs_parquet('silver/animal welfare/*/data.parquet')
# Join on sender or receiver CHR

Data Update Schedule

LayerFrequencyNotes
BronzeWeekly (Mondays 2AM UTC)Immutable, timestamped
SilverAfter bronze updateCleaned, validated
GoldAfter silver updateAnalysis-ready

Related Skills

  • okonomi/ - Financial data: subsidies, property values
  • landbrugsareal/ - Field and crop data: FVM marker, production
  • miljo/ - Environmental data: pesticides, nitrogen, BNBO
  • husdyr/ - Livestock data: CHR, movements, welfare
  • medarbejdere/ - Employee data: inspections, safety

Troubleshooting

Authentication

# Check GCS access
gcloud auth application-default login
gsutil ls gs://$GCS_BUCKET/

Large Files

For datasets > 1GB, use DuckDB or chunked reading:

import duckdb
# Query directly without loading into memory
result = duckdb.query("""
    SELECT cvr_number, SUM(area_ha) as total_area
    FROM 'gs://$GCS_BUCKET/gold/field_production_2024/*/data.parquet'
    GROUP BY cvr_number
""").df()

CRS Conversion

All geometry is stored in EPSG:4326 (WGS84). For Danish coordinates (EPSG:25832):

import geopandas as gpd
gdf = gdf.to_crs('EPSG:25832')  # Convert to UTM 32N