Claude-skill-registry data-lake-platform

Data lake and lakehouse platform patterns: ingestion/CDC, transformations, open table formats (Iceberg/Delta/Hudi), query and serving engines (Trino/ClickHouse/DuckDB), orchestration, governance/lineage, cost and operations. Self-hosted and cloud options.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-lake-platform" ~/.claude/skills/majiayu000-claude-skill-registry-data-lake-platform && rm -rf "$T"
manifest: skills/data/data-lake-platform/SKILL.md
source content

Data Lake Platform

Build and operate production data lakes and lakehouses: ingest, transform, store in open formats, and serve analytics reliably.

When to Use

  • Design data lake/lakehouse architecture
  • Set up ingestion pipelines (batch, incremental, CDC)
  • Build SQL transformation layers (SQLMesh, dbt)
  • Choose table formats and catalogs (Iceberg, Delta, Hudi)
  • Deploy query/serving engines (Trino, ClickHouse, DuckDB)
  • Implement streaming pipelines (Kafka, Flink)
  • Set up orchestration (Dagster, Airflow, Prefect)
  • Add governance, lineage, data quality, and cost controls

Triage Questions

  1. Batch, streaming, or hybrid? What is the freshness SLO?
  2. Append-only vs upserts/deletes (CDC)? Is time travel required?
  3. Primary query pattern: BI dashboards (high concurrency), ad-hoc joins, embedded analytics?
  4. PII/compliance: row/column-level access, retention, audit logging?
  5. Platform constraints: self-hosted vs cloud, preferred engines, team strengths?

Default Baseline (Good Starting Point)

  • Storage: object storage + open table format (usually Iceberg)
  • Catalog: REST/Hive/Glue/Nessie/Unity (match your platform)
  • Transforms: SQLMesh or dbt (pick one and standardize)
  • Lake query: Trino (or Spark for heavy compute/ML workloads)
  • Serving (optional): ClickHouse/StarRocks/Doris for low-latency BI
  • Governance: DataHub/OpenMetadata + OpenLineage
  • Orchestration: Dagster/Airflow/Prefect

Workflow

  1. Pick table format + catalog:
    references/storage-formats.md
    (use
    assets/cross-platform/template-schema-evolution.md
    and
    assets/cross-platform/template-partitioning-strategy.md
    )
  2. Design ingestion (batch/incremental/CDC):
    references/ingestion-patterns.md
    (use
    assets/cross-platform/template-ingestion-governance-checklist.md
    and
    assets/cross-platform/template-incremental-loading.md
    )
  3. Design transformations (bronze/silver/gold or data products):
    references/transformation-patterns.md
    (use
    assets/cross-platform/template-data-pipeline.md
    )
  4. Choose lake query vs serving engines:
    references/query-engine-patterns.md
  5. Add governance, lineage, and quality gates:
    references/governance-catalog.md
    (use
    assets/cross-platform/template-data-quality-governance.md
    and
    assets/cross-platform/template-data-quality.md
    )
  6. Plan operations + cost controls:
    references/operational-playbook.md
    and
    references/cost-optimization.md
    (use
    assets/cross-platform/template-data-quality-backfill-runbook.md
    and
    assets/cross-platform/template-cost-optimization.md
    )

Architecture Patterns

  • Medallion (bronze/silver/gold):
    references/architecture-patterns.md
  • Data mesh (domain-owned data products):
    references/architecture-patterns.md
  • Streaming-first (Kappa):
    references/streaming-patterns.md
  • Diagrams/mermaid snippets:
    references/overview.md

Quick Start

dlt + ClickHouse

pip install "dlt[clickhouse]"
dlt init rest_api clickhouse
python pipeline.py

SQLMesh + DuckDB

pip install sqlmesh
sqlmesh init duckdb
sqlmesh plan && sqlmesh run

Reliability and Safety

Do

  • Define data contracts and owners up front
  • Add quality gates (freshness, volume, schema, distribution) per tier
  • Make every pipeline idempotent and re-runnable (backfills are normal)
  • Treat access control and audit logging as first-class requirements

Avoid

  • Skipping validation to "move fast"
  • Storing PII without access controls
  • Pipelines that can't be re-run safely
  • Manual schema changes without version control

Resources

ResourcePurpose
references/overview.mdDiagrams and decision flows
references/architecture-patterns.mdMedallion, data mesh
references/ingestion-patterns.mddlt vs Airbyte, CDC
references/transformation-patterns.mdSQLMesh vs dbt
references/storage-formats.mdIceberg vs Delta
references/query-engine-patterns.mdClickHouse, DuckDB
references/streaming-patterns.mdKafka, Flink
references/orchestration-patterns.mdDagster, Airflow
references/bi-visualization-patterns.mdMetabase, Superset
references/cost-optimization.mdCost levers and maintenance
references/operational-playbook.mdMonitoring and incident response
references/governance-catalog.mdCatalog, lineage, access control

Templates

TemplatePurpose
assets/cross-platform/template-medallion-architecture.mdBaseline bronze/silver/gold plan
assets/cross-platform/template-data-pipeline.mdEnd-to-end pipeline skeleton
assets/cross-platform/template-ingestion-governance-checklist.mdSource onboarding checklist
assets/cross-platform/template-incremental-loading.mdIncremental + backfill plan
assets/cross-platform/template-schema-evolution.mdSchema change rules
assets/cross-platform/template-cost-optimization.mdCost control checklist
assets/cross-platform/template-data-quality-governance.mdQuality contracts + SLOs
assets/cross-platform/template-data-quality-backfill-runbook.mdBackfill incident/runbook

Related Skills

SkillPurpose
ai-mlopsML deployment
ai-ml-data-scienceFeature engineering
data-sql-optimizationOLTP optimization