Claude-skill-registry ELT Modeling
Comprehensive guide to ELT (Extract, Load, Transform) modeling patterns, dimensional modeling, fact and dimension tables, and data warehouse design
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/elt-modeling" ~/.claude/skills/majiayu000-claude-skill-registry-elt-modeling && rm -rf "$T"
skills/data/elt-modeling/SKILL.mdELT Modeling
ELT vs ETL
ETL (Traditional)
Extract → Transform → Load 1. Extract from source 2. Transform in ETL tool (Informatica, Talend) 3. Load to warehouse Pros: Clean data before loading Cons: Slow, expensive transformation servers
ELT (Modern)
Extract → Load → Transform 1. Extract from source 2. Load raw data to warehouse 3. Transform in warehouse (SQL, dbt) Pros: Fast loading, leverage warehouse power Cons: Raw data in warehouse (need governance)
Why ELT?
- Cloud warehouses: Snowflake, BigQuery, Redshift (powerful, scalable)
- Separation of concerns: EL (Fivetran, Airbyte) + T (dbt)
- Flexibility: Transform multiple ways from same raw data
- Speed: Parallel processing in warehouse
Dimensional Modeling
Star Schema
dim_customers | dim_products | fct_orders ← Central fact table | dim_dates | dim_locations
Characteristics:
- One fact table (center)
- Multiple dimension tables (points)
- Denormalized (fast queries)
Snowflake Schema
dim_customers | dim_customer_segments | fct_orders | dim_products | dim_product_categories
Characteristics:
- Normalized dimensions
- Less redundancy
- More joins (slower queries)
Recommendation: Use star schema for analytics (faster)
Fact Tables
Definition
Fact table: Stores measurable events (transactions, orders, clicks)
Characteristics
- Large: Millions to billions of rows
- Numeric measures: Amounts, quantities, counts
- Foreign keys: Links to dimensions
- Grain: Level of detail (one row per order, per day, etc.)
Types of Facts
Transaction Facts:
-- One row per transaction fct_orders: order_id (PK) customer_id (FK) product_id (FK) order_date_id (FK) quantity amount
Periodic Snapshot Facts:
-- One row per period (daily, monthly) fct_inventory_daily: date_id (PK) product_id (PK) warehouse_id (PK) quantity_on_hand quantity_sold
Accumulating Snapshot Facts:
-- One row per process (order lifecycle) fct_order_lifecycle: order_id (PK) order_date payment_date shipment_date delivery_date days_to_ship days_to_deliver
Fact Table Design
CREATE TABLE fct_orders ( -- Surrogate key order_key BIGINT PRIMARY KEY, -- Natural key order_id VARCHAR(50) NOT NULL, -- Foreign keys (dimensions) customer_key BIGINT NOT NULL, product_key BIGINT NOT NULL, date_key INT NOT NULL, -- Degenerate dimensions (no separate dim table) order_number VARCHAR(50), -- Measures quantity INT, unit_price DECIMAL(10,2), total_amount DECIMAL(10,2), discount_amount DECIMAL(10,2), -- Audit columns created_at TIMESTAMP, updated_at TIMESTAMP );
Dimension Tables
Definition
Dimension table: Stores descriptive attributes (who, what, where, when)
Characteristics
- Small: Thousands to millions of rows
- Descriptive: Text, categories, hierarchies
- Slowly changing: Updates over time (SCD)
Dimension Design
CREATE TABLE dim_customers ( -- Surrogate key customer_key BIGINT PRIMARY KEY, -- Natural key customer_id VARCHAR(50) NOT NULL, -- Attributes customer_name VARCHAR(255), email VARCHAR(255), phone VARCHAR(50), -- Hierarchies city VARCHAR(100), state VARCHAR(100), country VARCHAR(100), region VARCHAR(100), -- Segments customer_segment VARCHAR(50), customer_tier VARCHAR(50), -- SCD Type 2 columns effective_date DATE, expiration_date DATE, is_current BOOLEAN, -- Audit columns created_at TIMESTAMP, updated_at TIMESTAMP );
Slowly Changing Dimensions (SCD)
Type 1: Overwrite
-- Customer moves, update address UPDATE dim_customers SET city = 'New York', state = 'NY' WHERE customer_id = '123'; -- No history preserved
Type 2: Add New Row (Historical)
-- Customer moves, add new row INSERT INTO dim_customers ( customer_id, city, state, effective_date, is_current ) VALUES ( '123', 'New York', 'NY', '2024-01-15', TRUE ); -- Mark old row as not current UPDATE dim_customers SET is_current = FALSE, expiration_date = '2024-01-14' WHERE customer_id = '123' AND is_current = TRUE; -- History preserved
Type 3: Add New Column
-- Track previous value ALTER TABLE dim_customers ADD COLUMN previous_city VARCHAR(100); UPDATE dim_customers SET previous_city = city, city = 'New York' WHERE customer_id = '123';
Grain Definition
What is Grain?
Grain: Level of detail in fact table (one row represents what?)
Examples
Order grain: One row per order Order line grain: One row per product in order Daily grain: One row per day per product Hourly grain: One row per hour per product
Choosing Grain
Too fine (order line): - More rows - More storage - More flexibility Too coarse (monthly): - Fewer rows - Less storage - Less flexibility Balance: Choose finest grain needed for analysis
Kimball vs Inmon
Kimball (Bottom-Up, Dimensional)
Data Marts (Star Schemas) → Enterprise Data Warehouse Pros: - Fast to implement - Business-friendly (denormalized) - Query performance Cons: - Data redundancy - Hard to change
Inmon (Top-Down, Normalized)
Enterprise Data Warehouse (3NF) → Data Marts Pros: - Single source of truth - Flexible - Less redundancy Cons: - Slow to implement - Complex queries (many joins)
Modern Approach: Hybrid (normalized staging, dimensional marts)
Data Vault
Components
Hubs: Business keys (customers, products) Links: Relationships (orders link customers + products) Satellites: Descriptive attributes (customer details)
Example
-- Hub: Customer CREATE TABLE hub_customer ( customer_hub_key BIGINT PRIMARY KEY, customer_id VARCHAR(50) UNIQUE, load_date TIMESTAMP, record_source VARCHAR(50) ); -- Satellite: Customer Details CREATE TABLE sat_customer_details ( customer_hub_key BIGINT, load_date TIMESTAMP, customer_name VARCHAR(255), email VARCHAR(255), phone VARCHAR(50), PRIMARY KEY (customer_hub_key, load_date) ); -- Link: Order CREATE TABLE link_order ( order_link_key BIGINT PRIMARY KEY, customer_hub_key BIGINT, product_hub_key BIGINT, load_date TIMESTAMP );
Pros:
- Audit trail
- Flexible
- Handles source changes
Cons:
- Complex
- Many joins
- Steep learning curve
Conformed Dimensions
Definition
Conformed dimension: Shared across multiple fact tables
Example
-- dim_date used by multiple facts dim_date ← fct_orders ← fct_shipments ← fct_returns -- Ensures consistent date attributes
Benefits
- Consistency: Same date logic everywhere
- Drill-across: Compare metrics across facts
- Reusability: Build once, use many times
Bridge Tables
Many-to-Many Relationships
-- Customer can have multiple accounts -- Account can have multiple customers CREATE TABLE bridge_customer_account ( customer_key BIGINT, account_key BIGINT, allocation_percentage DECIMAL(5,2), PRIMARY KEY (customer_key, account_key) );
Junk Dimensions
Definition
Junk dimension: Collection of low-cardinality flags
Example
-- Instead of many boolean columns in fact CREATE TABLE dim_order_flags ( order_flag_key INT PRIMARY KEY, is_gift BOOLEAN, is_express_shipping BOOLEAN, is_first_order BOOLEAN, has_discount BOOLEAN ); -- Fact table references junk dimension fct_orders: order_key customer_key order_flag_key ← Reference to junk dimension amount
Role-Playing Dimensions
Definition
Role-playing dimension: Same dimension used multiple times with different meanings
Example
-- dim_date used multiple times fct_orders: order_key order_date_key ← dim_date (order date) ship_date_key ← dim_date (ship date) delivery_date_key ← dim_date (delivery date) amount
Surrogate Keys
Natural Key vs Surrogate Key
-- Natural key: Business identifier customer_id = 'CUST-12345' -- Surrogate key: System-generated customer_key = 1001 (auto-increment or hash)
Why Surrogate Keys?
- Performance: Integer joins faster than string joins
- SCD Type 2: Multiple rows for same natural key
- Independence: Source system can change natural key
- Consistency: Same format across all tables
Generating Surrogate Keys
-- Auto-increment customer_key BIGINT AUTO_INCREMENT PRIMARY KEY -- Hash (dbt) {{ dbt_utils.surrogate_key(['customer_id', 'effective_date']) }} -- Sequence (PostgreSQL) customer_key BIGINT DEFAULT nextval('customer_key_seq')
Data Warehouse Layers
Bronze (Raw)
Purpose: Exact copy of source data Format: As-is from source Schema: Source schema Example: raw_salesforce_accounts
Silver (Cleaned)
Purpose: Cleaned, standardized Format: Consistent types, naming Schema: Staging schema Example: stg_salesforce_accounts
Gold (Analytics)
Purpose: Business-ready Format: Dimensional models Schema: Analytics schema Example: dim_customers, fct_orders
Best Practices
1. Define Grain Clearly
✓ "One row per order" ✗ "Order data"
2. Use Surrogate Keys
✓ customer_key BIGINT ✗ customer_id VARCHAR(50)
3. Denormalize Dimensions
✓ dim_customers includes city, state, country ✗ Separate dim_cities, dim_states, dim_countries
4. Keep Facts Narrow
✓ fct_orders: Keys + measures only ✗ fct_orders: Keys + measures + customer name, product name
5. Use Conformed Dimensions
✓ dim_date shared across all facts ✗ Each fact has own date dimension
Common Patterns
Daily Snapshot
-- Capture state once per day INSERT INTO fct_inventory_daily SELECT CURRENT_DATE as snapshot_date, product_id, warehouse_id, quantity_on_hand, quantity_reserved FROM current_inventory;
Cumulative Metrics
-- Running totals SELECT order_date, SUM(amount) OVER ( ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) as cumulative_revenue FROM fct_orders;
Summary
ELT: Extract → Load → Transform (modern approach)
Dimensional Modeling:
- Star schema (recommended)
- Fact tables (measures)
- Dimension tables (attributes)
Fact Types:
- Transaction (one per event)
- Periodic snapshot (one per period)
- Accumulating snapshot (one per process)
SCD Types:
- Type 1: Overwrite
- Type 2: Historical (recommended)
- Type 3: Previous value
Grain: Level of detail (define clearly!)
Surrogate Keys: Use for performance and SCD
Layers:
- Bronze: Raw
- Silver: Cleaned
- Gold: Analytics-ready
Best Practices:
- Define grain
- Use surrogate keys
- Denormalize dimensions
- Keep facts narrow
- Use conformed dimensions