Data-engineering-skills refactoring-dbt-models

install
source · Clone the upstream repo
git clone https://github.com/AltimateAI/data-engineering-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/AltimateAI/data-engineering-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/dbt/refactoring-dbt-models" ~/.claude/skills/altimateai-data-engineering-skills-refactoring-dbt-models && rm -rf "$T"
manifest: skills/dbt/refactoring-dbt-models/SKILL.md
source content

dbt Refactoring

Find ALL downstream dependencies before changing. Refactor in small steps. Verify output after each change.

Workflow

1. Analyze Current Model

cat models/<path>/<model_name>.sql

Identify refactoring opportunities:

  • CTEs longer than 50 lines → extract to intermediate model
  • Logic repeated across models → extract to macro
  • Multiple joins in sequence → split into steps
  • Complex WHERE clauses → extract to staging filter

2. Find All Downstream Dependencies

CRITICAL: Never refactor without knowing impact.

# Get full dependency tree (model and all its children)
dbt ls --select model_name+ --output list

# Find all models referencing this one
grep -r "ref('model_name')" models/ --include="*.sql"

Report to user: "Found X downstream models: [list]. These will be affected by changes."

3. Check What Columns Downstream Models Use

BEFORE changing any columns, check what downstream models reference:

# For each downstream model, check what columns it uses
cat models/<path>/<downstream_model>.sql | grep -E "model_name\.\w+|alias\.\w+"

If downstream models reference specific columns, you MUST ensure those columns remain available after refactoring.

4. Plan Refactoring Strategy

OpportunityStrategy
Long CTEExtract to intermediate model
Repeated logicCreate macro in
macros/
Complex joinSplit into intermediate models
Multiple concernsSeparate into focused models

5. Execute Refactoring

Pattern: Extract CTE to Model

Before:

-- orders.sql (200 lines)
with customer_metrics as (
    -- 50 lines of complex logic
),
order_enriched as (
    select ...
    from orders
    join customer_metrics on ...
)
select * from order_enriched

After:

-- customer_metrics.sql (new file)
select
    customer_id,
    -- complex logic here
from {{ ref('customers') }}

-- orders.sql (simplified)
with order_enriched as (
    select ...
    from {{ ref('raw_orders') }} orders
    join {{ ref('customer_metrics') }} cm on ...
)
select * from order_enriched

Pattern: Extract to Macro

Before (repeated in multiple models):

case
    when amount < 0 then 'refund'
    when amount = 0 then 'zero'
    else 'positive'
end as amount_category

After:

-- macros/categorize_amount.sql
{% macro categorize_amount(column_name) %}
case
    when {{ column_name }} < 0 then 'refund'
    when {{ column_name }} = 0 then 'zero'
    else 'positive'
end
{% endmacro %}

-- In models:
{{ categorize_amount('amount') }} as amount_category

6. Validate Changes

# Compile to check syntax
dbt compile --select +model_name+

# Build entire lineage
dbt build --select +model_name+

# Check row counts (manual)
# Before: Record expected counts
# After: Verify counts match

7. Verify Output Matches Original

CRITICAL: Refactoring should not change output.

# Compare row counts before and after
dbt show --inline "select count(*) from {{ ref('model_name') }}"

# Spot check key values
dbt show --select <model_name> --limit 10

8. Update Downstream Models

If changing output columns:

  1. Update all downstream refs
  2. Update schema.yml documentation
  3. Re-run downstream tests

Refactoring Checklist

  • All downstream dependencies identified
  • User informed of impact scope
  • One change at a time
  • Compile passes after each change
  • Build passes after each change
  • Output validated (row counts match)
  • Documentation updated
  • Tests still pass

Common Refactoring Triggers

SymptomRefactoring
Model > 200 linesExtract CTEs to models
Same logic in 3+ modelsExtract to macro
5+ joins in one modelCreate intermediate models
Hard to understandAdd CTEs with clear names
Slow performanceSplit to allow parallelization

Anti-Patterns

  • Refactoring without checking downstream impact
  • Making multiple changes at once
  • Not validating output matches after refactoring
  • Extracting prematurely (wait for 3+ uses)
  • Breaking existing tests without updating them