Claude-skill-registry dbt Patterns
Comprehensive guide to dbt (data build tool) patterns, modeling best practices, testing strategies, and production workflows for modern data transformation
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/dbt-patterns" ~/.claude/skills/majiayu000-claude-skill-registry-dbt-patterns && rm -rf "$T"
skills/data/dbt-patterns/SKILL.mddbt Patterns
What is dbt?
dbt (data build tool): SQL-first transformation tool that enables analytics engineers to transform data in the warehouse using SELECT statements.
Core Concept
Raw Data (Extract & Load) → dbt (Transform) → Analytics-Ready Data Traditional ETL: Extract → Transform → Load Modern ELT: Extract → Load → Transform (with dbt)
Why dbt?
- Version control: SQL as code (Git)
- Testing: Built-in data quality tests
- Documentation: Auto-generated docs
- Modularity: Reusable models
- Lineage: Visual data lineage
- Collaboration: Team workflows
dbt Project Structure
Standard Layout
my_dbt_project/ ├── dbt_project.yml # Project configuration ├── profiles.yml # Connection profiles ├── models/ # SQL models │ ├── staging/ # Raw data cleaning │ ├── intermediate/ # Business logic │ └── marts/ # Final analytics tables ├── tests/ # Custom tests ├── macros/ # Reusable SQL ├── seeds/ # CSV reference data ├── snapshots/ # SCD Type 2 └── analyses/ # Ad-hoc queries
dbt_project.yml
name: 'my_project' version: '1.0.0' config-version: 2 profile: 'my_profile' model-paths: ["models"] test-paths: ["tests"] seed-paths: ["seeds"] macro-paths: ["macros"] snapshot-paths: ["snapshots"] models: my_project: staging: +materialized: view +schema: staging intermediate: +materialized: ephemeral marts: +materialized: table +schema: analytics
Model Layers (Staging → Intermediate → Marts)
Staging Layer
Purpose: Clean and standardize raw data
Pattern:
-- models/staging/stg_orders.sql with source as ( select * from {{ source('raw', 'orders') }} ), renamed as ( select id as order_id, user_id, created_at as order_created_at, status as order_status, total_amount from source ) select * from renamed
Best Practices:
- One staging model per source table
- Rename columns to consistent naming
- Cast data types
- No business logic
- Materialized as views (lightweight)
Intermediate Layer
Purpose: Business logic and transformations
Pattern:
-- models/intermediate/int_orders_with_customer.sql with orders as ( select * from {{ ref('stg_orders') }} ), customers as ( select * from {{ ref('stg_customers') }} ), joined as ( select orders.*, customers.customer_name, customers.customer_segment from orders left join customers on orders.user_id = customers.customer_id ) select * from joined
Best Practices:
- Complex joins
- Business logic
- Calculations
- Materialized as ephemeral (not persisted)
Marts Layer
Purpose: Final analytics-ready tables
Pattern:
-- models/marts/fct_orders.sql with orders as ( select * from {{ ref('int_orders_with_customer') }} ), aggregated as ( select order_id, user_id, customer_name, order_created_at, order_status, total_amount, case when order_status = 'completed' then total_amount else 0 end as completed_revenue from orders ) select * from aggregated
Best Practices:
- Business-friendly naming
- Denormalized for analytics
- Materialized as tables (fast queries)
- Documented
Naming Conventions
Model Naming
Staging: stg_<source>_<table> stg_salesforce_accounts stg_stripe_payments Intermediate: int_<entity>_<verb> int_orders_joined int_customers_enriched Facts: fct_<entity> fct_orders fct_revenue Dimensions: dim_<entity> dim_customers dim_products
Column Naming
IDs: <entity>_id customer_id, order_id Dates: <entity>_<verb>_at order_created_at, customer_updated_at Booleans: is_<condition> or has_<condition> is_active, has_subscription Amounts: <entity>_amount order_amount, refund_amount
Materializations
View
{{ config(materialized='view') }} select * from {{ ref('stg_orders') }}
- Pros: Always fresh, no storage
- Cons: Slow queries (recomputed each time)
- Use for: Staging models, rarely queried
Table
{{ config(materialized='table') }} select * from {{ ref('fct_orders') }}
- Pros: Fast queries
- Cons: Storage cost, stale data
- Use for: Marts, frequently queried
Incremental
{{ config( materialized='incremental', unique_key='order_id' ) }} select * from {{ ref('stg_orders') }} {% if is_incremental() %} where order_created_at > (select max(order_created_at) from {{ this }}) {% endif %}
- Pros: Fast builds, handles large data
- Cons: Complex logic
- Use for: Large fact tables
Ephemeral
{{ config(materialized='ephemeral') }} select * from {{ ref('stg_orders') }}
- Pros: No storage, CTE in downstream models
- Cons: Recomputed in each downstream model
- Use for: Intermediate models
Testing
Schema Tests
# models/schema.yml version: 2 models: - name: fct_orders description: "Order facts table" columns: - name: order_id description: "Unique order identifier" tests: - unique - not_null - name: user_id description: "Customer who placed order" tests: - not_null - relationships: to: ref('dim_customers') field: customer_id - name: order_status tests: - accepted_values: values: ['pending', 'completed', 'cancelled'] - name: total_amount tests: - not_null - dbt_utils.expression_is_true: expression: ">= 0"
Custom Tests
-- tests/assert_positive_revenue.sql select * from {{ ref('fct_orders') }} where total_amount < 0
Test Types
- unique: No duplicates
- not_null: No nulls
- accepted_values: Value in list
- relationships: Foreign key check
- custom: SQL query returns 0 rows
Macros
Reusable SQL
-- macros/cents_to_dollars.sql {% macro cents_to_dollars(column_name) %} ({{ column_name }} / 100.0)::decimal(10,2) {% endmacro %}
Usage:
select order_id, {{ cents_to_dollars('total_amount_cents') }} as total_amount_dollars from {{ ref('stg_orders') }}
Common Macros
-- macros/generate_schema_name.sql {% macro generate_schema_name(custom_schema_name, node) %} {%- if custom_schema_name is none -%} {{ target.schema }} {%- else -%} {{ custom_schema_name | trim }} {%- endif -%} {% endmacro %}
Sources
Defining Sources
# models/staging/sources.yml version: 2 sources: - name: raw database: analytics_db schema: raw_data tables: - name: orders description: "Raw orders from production DB" columns: - name: id tests: - unique - not_null freshness: warn_after: {count: 12, period: hour} error_after: {count: 24, period: hour} loaded_at_field: _loaded_at
Using Sources
select * from {{ source('raw', 'orders') }}
Source Freshness
dbt source freshness
Documentation
Model Documentation
# models/schema.yml version: 2 models: - name: fct_orders description: | Order facts table containing all orders with customer information. This table is updated daily at 2 AM UTC. **Business Rules:** - Only includes orders with status 'completed' or 'pending' - Cancelled orders are excluded columns: - name: order_id description: "Unique identifier for each order" - name: total_amount description: "Total order amount in dollars"
Generate Docs
dbt docs generate dbt docs serve
Output: Interactive documentation website with:
- Model descriptions
- Column descriptions
- Data lineage (DAG)
- Source freshness
Snapshots (SCD Type 2)
Snapshot Configuration
-- snapshots/customers_snapshot.sql {% snapshot customers_snapshot %} {{ config( target_schema='snapshots', unique_key='customer_id', strategy='timestamp', updated_at='updated_at' ) }} select * from {{ source('raw', 'customers') }} {% endsnapshot %}
Run Snapshots
dbt snapshot
Output
customer_id | name | updated_at | dbt_valid_from | dbt_valid_to | dbt_scd_id 1 | John | 2024-01-01 | 2024-01-01 | 2024-01-15 | abc123 1 | John | 2024-01-15 | 2024-01-15 | null | def456
Incremental Models
Basic Incremental
{{ config( materialized='incremental', unique_key='order_id' ) }} select order_id, user_id, order_created_at, total_amount from {{ ref('stg_orders') }} {% if is_incremental() %} -- Only process new/updated records where order_created_at > (select max(order_created_at) from {{ this }}) {% endif %}
Incremental with Delete+Insert
{{ config( materialized='incremental', unique_key='order_id', incremental_strategy='delete+insert' ) }}
Incremental with Merge
{{ config( materialized='incremental', unique_key='order_id', incremental_strategy='merge', merge_update_columns=['order_status', 'total_amount'] ) }}
Packages
Installing Packages
# packages.yml packages: - package: dbt-labs/dbt_utils version: 1.1.0 - package: calogica/dbt_expectations version: 0.9.0 - package: dbt-labs/codegen version: 0.11.0
dbt deps
Using Packages
-- dbt_utils select {{ dbt_utils.surrogate_key(['order_id', 'user_id']) }} as unique_key, * from {{ ref('stg_orders') }} -- dbt_expectations tests: - dbt_expectations.expect_column_values_to_be_between: min_value: 0 max_value: 1000000
Production Workflows
Development Workflow
# 1. Create feature branch git checkout -b feature/new-model # 2. Develop model # Edit models/marts/fct_new_model.sql # 3. Run model dbt run --select fct_new_model # 4. Test model dbt test --select fct_new_model # 5. Document model # Edit models/schema.yml # 6. Commit and push git add . git commit -m "Add new model" git push origin feature/new-model # 7. Create PR # Review, approve, merge
CI/CD Pipeline
# .github/workflows/dbt_ci.yml name: dbt CI on: [pull_request] jobs: dbt_run: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - name: Setup Python uses: actions/setup-python@v2 with: python-version: '3.9' - name: Install dbt run: pip install dbt-snowflake - name: Run dbt run: | dbt deps dbt run --select state:modified+ --defer --state ./prod_manifest dbt test --select state:modified+ --defer --state ./prod_manifest
Production Deployment
# Daily production run dbt run --target prod dbt test --target prod dbt source freshness --target prod
Performance Optimization
Optimize Queries
-- Bad: Multiple CTEs with same source with orders_1 as ( select * from {{ ref('stg_orders') }} where status = 'completed' ), orders_2 as ( select * from {{ ref('stg_orders') }} where status = 'pending' ) -- Good: Single CTE, filter later with orders as ( select * from {{ ref('stg_orders') }} where status in ('completed', 'pending') ), completed_orders as ( select * from orders where status = 'completed' ), pending_orders as ( select * from orders where status = 'pending' )
Use Incremental Models
-- For large tables (millions of rows) {{ config(materialized='incremental') }}
Partition Tables
{{ config( materialized='table', partition_by={ "field": "order_date", "data_type": "date", "granularity": "day" } ) }}
Best Practices
1. One Model Per File
✓ models/staging/stg_orders.sql ✗ models/staging/all_staging_models.sql
2. Use CTEs, Not Subqueries
-- Good with orders as ( select * from {{ ref('stg_orders') }} ) select * from orders -- Bad select * from ( select * from {{ ref('stg_orders') }} ) as orders
3. Explicit Column Selection
-- Good select order_id, user_id, total_amount from {{ ref('stg_orders') }} -- Bad select * from {{ ref('stg_orders') }}
4. Test Everything
# Every model should have tests models: - name: fct_orders columns: - name: order_id tests: [unique, not_null]
5. Document Everything
# Every model and column should have description models: - name: fct_orders description: "Order facts table" columns: - name: order_id description: "Unique order ID"
Common Patterns
Slowly Changing Dimensions (SCD)
-- Type 1: Overwrite {{ config(materialized='table') }} select * from {{ ref('stg_customers') }} -- Type 2: Historical tracking {% snapshot customers_snapshot %} {{ config(strategy='timestamp', updated_at='updated_at') }} select * from {{ ref('stg_customers') }} {% endsnapshot %}
Fact Tables
-- Transactional facts {{ config(materialized='incremental', unique_key='order_id') }} select order_id, customer_id, product_id, order_date, quantity, amount from {{ ref('stg_orders') }}
Dimension Tables
-- Dimension table {{ config(materialized='table') }} select customer_id, customer_name, customer_segment, customer_region from {{ ref('stg_customers') }}
Troubleshooting
Common Errors
Compilation Error:
Compilation Error in model fct_orders Model 'stg_orders' not found
Fix: Check model name, ensure it exists
Test Failure:
Failure in test unique_fct_orders_order_id Got 5 results, expected 0
Fix: Investigate duplicate order_ids
Freshness Error:
Source 'raw.orders' is stale (loaded 25 hours ago)
Fix: Check ETL pipeline, data loading
Summary
dbt: SQL-first transformation tool
Layers:
- Staging: Clean raw data
- Intermediate: Business logic
- Marts: Analytics-ready
Materializations:
- View: Always fresh, slow
- Table: Fast, stale
- Incremental: Large data
- Ephemeral: No storage
Testing:
- unique, not_null, accepted_values, relationships
- Custom SQL tests
Best Practices:
- One model per file
- Use CTEs
- Test everything
- Document everything
- Version control
Workflow:
- Develop → Test → Document → PR → Merge → Deploy