Agent-skills motherduck-build-data-pipeline
Design an end-to-end MotherDuck pipeline. Use when choosing raw, staging, and analytics boundaries, bulk ingestion paths, transformation sequencing, publication targets, or whether DuckLake is actually required.
git clone https://github.com/motherduckdb/agent-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/motherduckdb/agent-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/motherduck-skills-claude/skills/motherduck-build-data-pipeline" ~/.claude/skills/motherduckdb-agent-skills-motherduck-build-data-pipeline && rm -rf "$T"
plugins/motherduck-skills-claude/skills/motherduck-build-data-pipeline/SKILL.mdBuild a Data Pipeline with MotherDuck
Use this skill when the user needs an ingestion-to-serving workflow, not just a single load step.
This is a use-case skill. It orchestrates
motherduck-connect, motherduck-load-data, motherduck-model-data, motherduck-query, motherduck-share-data, and motherduck-ducklake.
Start Here: Is a MotherDuck Server Active?
Always determine this first.
- If a remote MotherDuck MCP server or local MotherDuck server is active, use it.
- If the user already knows the destination database, confirm it before designing stages.
- Explore the live environment:
- current databases and schemas
- raw, staging, and analytics boundaries if they already exist
- source tables, target tables, and table grain
- key columns, date fields, and join keys
Use that discovery to decide whether the pipeline is:
- landing into an empty workspace
- extending an existing warehouse layout
- publishing into an existing analytics model
If no server is active, ask for source shape and target shape before drafting the pipeline.
Use This Skill When
- The user needs ingestion plus transformation plus serving output.
- The work spans raw landing, curation, and publication.
- The user needs a stage-by-stage pipeline pattern rather than one command.
- The problem is bigger than a single import step or one ad hoc transformation.
Pipeline Defaults
- batch over streaming
- raw landing before curation
- explicit raw -> staging -> analytics boundaries
- bulk ingest paths over row-by-row writes
- DuckDB 1.5+ clients for write/delete-heavy jobs that depend on concurrent checkpoint behavior
- native MotherDuck storage unless DuckLake is explicitly required
Workflow
- Confirm whether live MotherDuck discovery is available.
- Inspect the current workspace and target data model.
- Define raw, staging, and analytics boundaries.
- Ingest raw data.
- Deduplicate, type, and promote into staging.
- Materialize analytics-ready outputs.
- Validate counts and business metrics before publishing downstream assets.
When this skill produces a native DuckDB (
md:) connection, watermark it with custom_user_agent=agent-skills/<latest-available-skills-version>(harness-<harness>;llm-<llm>). If metadata is missing, fall back to harness-unknown and llm-unknown.
Output
The output of this skill should be:
- the stage layout
- the ingestion method
- the transformation sequence
- the serving tables or views
- the validation checks
If the caller explicitly asks for structured JSON, return raw JSON only with no Markdown fences or prose before/after it. This is mainly for automated tests, regression checks, or downstream tooling that needs a stable machine-readable shape. Normal human-facing use of the skill can stay in prose unless JSON is explicitly requested.
Use this exact top-level shape when JSON is requested:
{ "summary": {}, "assumptions": [], "implementation_plan": [], "validation_plan": [], "risks": [] }
References
-- fully runnable MotherDuck reference project usingreferences/dlt-dbt-motherduck-project/
,dlt
, and validation queriesdbt-duckdb
-- preserved detailed pipeline guidance that used to live in this skillreferences/PIPELINE_IMPLEMENTATION_GUIDE.md
-- lower-level ingestion patterns../load-data/references/INGESTION_PATTERNS.md
Runnable Artifact
-- MotherDuck-backed Python example that stages a Parquet extract, lands it into raw, deduplicates it, and publishes analytics output across raw/staging/analytics databasesartifacts/pipeline_stage_example.py
-- TypeScript companion artifact with the same stage layout and output contractartifacts/pipeline_stage_example.ts
-- end-to-end MotherDuck example that bootstraps the target database, lands raw data withreferences/dlt-dbt-motherduck-project/
, builds staging and analytics models withdlt
, and validates the final martdbt
Run it with:
uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.py
Run the same stage pattern against temporary MotherDuck databases:
MOTHERDUCK_ARTIFACT_USE_MOTHERDUCK=1 \ uv run --with duckdb python skills/motherduck-build-data-pipeline/artifacts/pipeline_stage_example.py
Validate the TypeScript companion artifact:
uv run scripts/test_typescript_artifacts.py
For the full MotherDuck project:
cd skills/motherduck-build-data-pipeline/references/dlt-dbt-motherduck-project export MOTHERDUCK_TOKEN=... export MOTHERDUCK_PIPELINE_DB=md_skills_pipeline_demo uv sync --python 3.12 uv run python pipeline/run_all.py uv run python pipeline/cleanup.py
Verified Notes
- Bootstrap the target MotherDuck database before running
. Thedlt
destination does not create the database for you.motherduck - Keep this stack on Python 3.11 or 3.12 for now. The tested
path here was not reliable on Python 3.14.dbt-duckdb - If you want exact schema names like
,raw
, andstaging
in dbt, overrideanalytics
.generate_schema_name - When a long-lived Python process loads data and a separate
subprocess builds models, run post-build validation in a fresh process or refresh database state before reading new relations.dbt
Related Skills
-- choose the right connection pathmotherduck-connect
-- ingestion mechanicsmotherduck-load-data
-- shape the analytics layermotherduck-model-data
-- write transformations and validationsmotherduck-query
-- publish curated outputsmotherduck-share-data
-- only when open-table-format storage is a real requirementmotherduck-ducklake