Data-engineering-skills developing-incremental-models
git clone https://github.com/AltimateAI/data-engineering-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/AltimateAI/data-engineering-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/dbt/developing-incremental-models" ~/.claude/skills/altimateai-data-engineering-skills-developing-incremental-models && rm -rf "$T"
skills/dbt/developing-incremental-models/SKILL.mddbt Incremental Model Development
Choose the right strategy. Design the unique_key carefully. Handle edge cases.
When to Use Incremental
| Scenario | Recommendation |
|---|---|
| Source data < 10M rows | Use (simpler, full refresh is fast) |
| Source data > 10M rows | Consider |
| Source data updated in place | Use with strategy |
| Append-only source (logs, events) | Use with strategy |
| Partitioned warehouse data | Use if supported |
Default to
unless you have a clear performance reason for incremental.table
Critical Rules
- ALWAYS test with
first before relying on incremental logic--full-refresh - ALWAYS verify unique_key is truly unique in both source and target
- If merge fails 3+ times, check unique_key for duplicates
- Run full refresh periodically to prevent data drift
Workflow
1. Confirm Incremental is Needed
# Check source table size dbt show --inline "select count(*) from {{ source('schema', 'table') }}"
If count < 10 million, consider using
table instead. Incremental adds complexity.
2. Understand the Source Data Pattern
Before choosing a strategy, answer:
- Is data append-only? (new rows added, never updated)
- Are existing rows updated? (need merge/upsert)
- Is there a reliable timestamp? (for filtering new data)
- What's the unique identifier? (for merge matching)
# Check for timestamp column dbt show --inline " select min(updated_at) as earliest, max(updated_at) as latest, count(distinct date(updated_at)) as days_of_data from {{ source('schema', 'table') }} "
3. Choose the Right Strategy
| Strategy | Use When | How It Works |
|---|---|---|
| Data is append-only, no updates | INSERT only, no deduplication |
| Data can be updated | MERGE/UPSERT by unique_key |
| Data updated in batches | DELETE matching rows, then INSERT |
| Partitioned tables (BigQuery, Spark) | Replace entire partitions |
Default:
merge is safest for most use cases.
Note: Strategy availability varies by adapter. Check the dbt incremental strategy docs for your specific warehouse.
4. Design the Unique Key
CRITICAL: unique_key must be truly unique in your data.
# Verify uniqueness BEFORE creating model dbt show --inline " select {{ unique_key_column }}, count(*) from {{ source('schema', 'table') }} group by 1 having count(*) > 1 limit 10 "
If duplicates exist:
- Add more columns to make composite key
- Add deduplication logic in model
- Use
instead ofdelete+insertmerge
5. Write the Incremental Model
{{ config( materialized='incremental', incremental_strategy='merge', -- or append, delete+insert unique_key='id', -- MUST be unique on_schema_change='append_new_columns' -- handle new columns ) }} select id, column_a, column_b, updated_at from {{ source('schema', 'table') }} {% if is_incremental() %} where updated_at > (select max(updated_at) from {{ this }}) {% endif %}
6. Build with Full Refresh First
ALWAYS verify with full refresh before trusting incremental logic.
# First run: full refresh to establish baseline dbt build --select <model_name> --full-refresh # Verify output dbt show --select <model_name> --limit 10 dbt show --inline "select count(*) from {{ ref('model_name') }}"
7. Test Incremental Logic
# Run incrementally (no --full-refresh) dbt build --select <model_name> # Verify row count changed appropriately dbt show --inline "select count(*) from {{ ref('model_name') }}"
8. Handle Schema Changes
Set
on_schema_change based on your needs:
| Setting | Behavior |
|---|---|
(default) | New columns in source are ignored |
| New columns added to target |
| Target schema matches source exactly |
| Error if schema changes |
Common Incremental Problems
Problem: Merge Fails with Duplicate Key
Symptom: "Cannot MERGE with duplicate values"
Cause: Multiple rows with same unique_key in source or target.
Fix:
-- Add deduplication using a CTE (cross-database compatible) with deduplicated as ( select *, row_number() over (partition by id order by updated_at desc) as rn from {{ source('schema', 'table') }} {% if is_incremental() %} where updated_at > (select max(updated_at) from {{ this }}) {% endif %} ) select * from deduplicated where rn = 1
Problem: No Partition Pruning (Full Table Scan)
Symptom: Incremental runs take as long as full refresh.
Cause: Dynamic date filter prevents partition pruning.
Fix:
{% if is_incremental() %} -- Use static date instead of subquery for partition pruning where updated_at >= {{ dbt.dateadd('day', -3, dbt.current_timestamp()) }} and updated_at > (select max(updated_at) from {{ this }}) {% endif %}
Problem: Late-Arriving Data is Missed
Symptom: Some records never appear in incremental model.
Cause: Filtering by max(updated_at) misses late arrivals.
Fix: Use a lookback window with a fixed offset from current date:
{% if is_incremental() %} -- Lookback 3 days to catch late-arriving data where updated_at >= {{ dbt.dateadd('day', -3, dbt.current_timestamp()) }} {% endif %}
Alternatively, use a variable for the lookback period:
{% set lookback_days = 3 %} {% if is_incremental() %} where updated_at >= {{ dbt.dateadd('day', -lookback_days, dbt.current_timestamp()) }} {% endif %}
Problem: Schema Drift Causes Errors
Symptom: "Column X not found" after source adds column.
Fix: Set
on_schema_change='append_new_columns' in config.
Problem: Data Drift Over Time
Symptom: Counts diverge between incremental and full refresh.
Fix: Schedule periodic full refresh:
# Weekly full refresh dbt build --select <model_name> --full-refresh
Incremental Strategy Reference
Append (Simplest)
{{ config(materialized='incremental', incremental_strategy='append') }} select * from {{ source('events', 'raw') }} {% if is_incremental() %} where event_timestamp > (select max(event_timestamp) from {{ this }}) {% endif %}
- No unique_key needed
- Fastest performance
- Only use for append-only data (logs, events, immutable records)
Merge (Default)
{{ config( materialized='incremental', incremental_strategy='merge', unique_key='id' ) }} select * from {{ source('crm', 'contacts') }} {% if is_incremental() %} where updated_at > (select max(updated_at) from {{ this }}) {% endif %}
- Requires unique_key
- Handles updates and inserts
- Most common strategy
Delete+Insert (Batch Updates)
{{ config( materialized='incremental', incremental_strategy='delete+insert', unique_key='id' ) }} select * from {{ source('orders', 'raw') }} {% if is_incremental() %} where order_date >= {{ dbt.dateadd('day', -7, dbt.current_timestamp()) }} {% endif %}
- Deletes all matching rows first
- Good for reprocessing batches
- Use when merge has duplicate key issues
Insert Overwrite (Partitioned)
{{ config( materialized='incremental', incremental_strategy='insert_overwrite', partition_by={'field': 'event_date', 'data_type': 'date'} ) }} select * from {{ source('events', 'raw') }} {% if is_incremental() %} where event_date >= {{ dbt.dateadd('day', -3, dbt.current_timestamp()) }} {% endif %}
- Replaces entire partitions
- Best for partitioned tables in BigQuery/Spark
- No unique_key needed (operates on partitions)
Anti-Patterns
- Using incremental for small tables (< 10M rows)
- Not testing with full-refresh first
- Using append strategy when data can be updated
- Not verifying unique_key uniqueness
- Relying on exact timestamp match without lookback
- Never running full refresh (causes data drift)
- Using merge with non-unique keys
Testing Checklist
- Model runs with
--full-refresh - Model runs incrementally (without flag)
- unique_key verified as truly unique
- Row counts reasonable after incremental run
- Late-arriving data handled (lookback window)
- Schema changes handled (on_schema_change set)
- Periodic full refresh scheduled