Claude-Skills analytics-engineer
git clone https://github.com/borghei/Claude-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/borghei/Claude-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data-analytics/analytics-engineer" ~/.claude/skills/borghei-claude-skills-analytics-engineer && rm -rf "$T"
data-analytics/analytics-engineer/SKILL.mdAnalytics Engineer
The agent operates as a senior analytics engineer, building scalable dbt transformation layers, designing dimensional models, writing tested SQL, and managing semantic-layer metric definitions.
Workflow
- Understand the data request -- Identify the business question, required grain, and downstream consumers (dashboard, notebook, reverse-ETL). Confirm source tables exist and check freshness.
- Design the dimensional model -- Choose star or snowflake schema. Map source entities to dimension and fact tables at the correct grain. Document grain, primary keys, and foreign keys.
- Build staging models -- One
model per source table. Rename columns, cast types, filter soft-deletes, and add metadata columns. Validate:stg_
.dbt build --select stg_* - Build intermediate models -- Encapsulate reusable business logic in
models (e.g.,int_
). Keep each CTE single-purpose.int_orders_enriched - Build mart models -- Create
anddim_
models for consumption. Configure materialization (view for staging, incremental for large facts, table for small marts).fct_ - Add tests and documentation -- Every primary key gets
+unique
. Foreign keys getnot_null
. Addrelationships
for enums. Write model descriptions in YAML.accepted_values - Define semantic-layer metrics -- Register metrics (sum, average, count_distinct) with time grains and dimension slices so BI consumers get a single source of truth.
- Validate end-to-end -- Run
, confirm test pass rate = 100%, check row counts against source, and verify dashboard numbers match.dbt build
dbt Project Structure
analytics/ dbt_project.yml models/ staging/ # stg_<source>__<table>.sql (one per source table) intermediate/ # int_<entity>_<verb>.sql (reusable logic) marts/ core/ # dim_*.sql, fct_*.sql (consumption-ready) marketing/ finance/ macros/ # Reusable Jinja helpers tests/ # Custom generic + singular tests seeds/ # Static CSV lookups snapshots/ # SCD Type 2 captures
Concrete Example: Customer Dimension
Staging model (
models/staging/crm/stg_crm__customers.sql):
WITH source AS ( SELECT * FROM {{ source('crm', 'customers') }} ), renamed AS ( SELECT id AS customer_id, TRIM(LOWER(name)) AS customer_name, TRIM(LOWER(email)) AS email, created_at::timestamp AS created_at, updated_at::timestamp AS updated_at, is_active::boolean AS is_active, _fivetran_synced AS _loaded_at FROM source WHERE _fivetran_deleted = false ) SELECT * FROM renamed
Mart model (
models/marts/core/dim_customer.sql):
WITH customers AS ( SELECT * FROM {{ ref('stg_crm__customers') }} ), customer_orders AS ( SELECT customer_id, MIN(order_date) AS first_order_date, MAX(order_date) AS most_recent_order_date, COUNT(*) AS lifetime_orders, SUM(order_amount) AS lifetime_value FROM {{ ref('stg_orders__orders') }} GROUP BY customer_id ), final AS ( SELECT c.customer_id, c.customer_name, c.email, c.created_at, co.first_order_date, co.most_recent_order_date, co.lifetime_orders, co.lifetime_value, CASE WHEN co.lifetime_value >= 10000 THEN 'platinum' WHEN co.lifetime_value >= 5000 THEN 'gold' WHEN co.lifetime_value >= 1000 THEN 'silver' ELSE 'bronze' END AS customer_tier FROM customers c LEFT JOIN customer_orders co ON c.customer_id = co.customer_id ) SELECT * FROM final
Test configuration (
models/marts/core/_core__models.yml):
version: 2 models: - name: dim_customer description: Customer dimension with lifetime order metrics and tier classification. columns: - name: customer_id tests: [unique, not_null] - name: email tests: [unique, not_null] - name: customer_tier tests: - accepted_values: values: ['platinum', 'gold', 'silver', 'bronze'] - name: lifetime_value tests: - dbt_utils.expression_is_true: expression: ">= 0"
Incremental Fact Table Pattern
-- models/marts/core/fct_orders.sql {{ config( materialized='incremental', unique_key='order_id', partition_by={'field': 'order_date', 'data_type': 'date'}, cluster_by=['customer_id', 'product_id'] ) }} WITH orders AS ( SELECT * FROM {{ ref('stg_orders__orders') }} {% if is_incremental() %} WHERE order_date >= (SELECT MAX(order_date) FROM {{ this }}) {% endif %} ), order_items AS ( SELECT * FROM {{ ref('stg_orders__order_items') }} ), final AS ( SELECT o.order_id, o.order_date, o.customer_id, oi.product_id, o.store_id, oi.quantity, oi.unit_price, oi.quantity * oi.unit_price AS line_total, o.discount_amount, o.tax_amount, o.total_amount FROM orders o INNER JOIN order_items oi ON o.order_id = oi.order_id ) SELECT * FROM final
Materialization Strategy
| Layer | Materialization | Rationale |
|---|---|---|
| Staging | View | Thin wrappers; no storage cost |
| Intermediate | Ephemeral / View | Business logic; referenced multiple times |
| Marts (small) | Table | Query performance for BI tools |
| Marts (large) | Incremental | Efficient appends for large fact tables |
Semantic-Layer Metric Definition
# models/marts/core/_core__metrics.yml metrics: - name: revenue label: Total Revenue model: ref('fct_orders') calculation_method: sum expression: total_amount timestamp: order_date time_grains: [day, week, month, quarter, year] dimensions: [customer_tier, product_category, store_region] filters: - field: is_cancelled operator: '=' value: 'false' - name: average_order_value label: Average Order Value model: ref('fct_orders') calculation_method: average expression: total_amount timestamp: order_date time_grains: [day, week, month]
Useful Macros
-- macros/cents_to_dollars.sql {% macro cents_to_dollars(column_name) %} ({{ column_name }} / 100.0)::decimal(18,2) {% endmacro %} -- macros/get_incremental_filter.sql {% macro get_incremental_filter(column_name, lookback_days=3) %} {% if is_incremental() %} WHERE {{ column_name }} >= ( SELECT DATEADD(day, -{{ lookback_days }}, MAX({{ column_name }})) FROM {{ this }} ) {% endif %} {% endmacro %}
CI/CD: Slim CI for Pull Requests
# Only run modified models and their downstream dependents dbt run --select state:modified+ --defer --state ./target-base dbt test --select state:modified+ --defer --state ./target-base
For full CI/CD pipeline configuration, see
REFERENCE.md.
Reference Materials
-- Extended patterns: source config, custom tests, CI/CD workflows, exposures, documentation templatesREFERENCE.md
-- Data modeling best practicesreferences/modeling_patterns.md
-- SQL and dbt conventionsreferences/dbt_style_guide.md
-- Testing strategiesreferences/testing_guide.md
-- Performance tuningreferences/optimization.md
Scripts
python scripts/impact_analyzer.py --model dim_customer python scripts/schema_diff.py --source prod --target dev python scripts/doc_generator.py --format markdown python scripts/quality_scorer.py --model fct_orders
Tool Reference
| Tool | Purpose | Key Flags |
|---|---|---|
| Trace downstream impact of a dbt model via BFS on the manifest DAG | , , |
| Compare two dbt catalog.json files to detect column additions, removals, and type changes | , , |
| Generate markdown documentation (column dictionary, dependencies, tests) for a dbt model | , , |
| Score a dbt model 0-100 based on documentation, testing, and layer-convention adherence | , , |
Troubleshooting
| Problem | Likely Cause | Resolution |
|---|---|---|
fails with "relation does not exist" | Upstream model was not run or materialization changed | Run to build the full upstream chain |
| Incremental model produces duplicates | does not match the actual grain | Verify the config matches the primary key columns; run a full refresh with |
Test failures on after deployment | Source data introduced unexpected NULLs in a previously clean column | Add a staging-layer or adjust the test to severity while investigating upstream |
Schema drift detected by | Upstream source changed column types or removed columns | Coordinate with the data engineering team; update staging model casts and regenerate documentation |
| Semantic-layer metric values differ from dashboard | Dashboard applies its own filters or calculations outside the semantic layer | Move all calculation logic into the semantic layer; audit dashboard-level computed fields |
Slow on large incremental models | Lookback window is too wide or partition pruning is not engaged | Narrow the incremental filter, verify config, and check warehouse query plan |
reports low score despite good coverage | Staging model contains JOINs or GROUP BY operations triggering layer-violation penalties | Refactor aggregation logic into intermediate or mart models; keep staging models as thin wrappers |
Success Criteria
- All dbt models pass
with a 100% test pass rate before merging to production.dbt build - Every model has a YAML description and at least one test per primary key (
+unique
).not_null - Incremental models process new data in under 5 minutes for tables up to 100M rows.
- Schema drift between prod and dev environments is detected and reviewed before each release.
reports >= 80/100 for every mart model.quality_scorer.py- Downstream dashboards refresh within SLA (< 5 s load time) after transformation runs complete.
- Semantic-layer metrics are the single source of truth -- no ad-hoc metric calculations exist in BI tools.
Scope & Limitations
In scope: dbt project design, dimensional modeling (Kimball methodology), SQL transformation logic, data testing, semantic-layer metric definition, CI/CD for dbt, and warehouse query optimization.
Out of scope: Raw data ingestion and extraction (ELT/ETL orchestration tools like Fivetran or Airbyte), data infrastructure provisioning, BI tool configuration beyond semantic-layer integration, and real-time streaming pipelines.
Limitations: The Python tools operate on dbt manifest/catalog JSON artifacts and do not query the warehouse directly. Scoring heuristics in
quality_scorer.py use rule-based deductions that may not cover every project convention. All scripts use the Python standard library only -- no external dependencies required.
Integration Points
- Data Engineer (
): Coordinates on source table contracts, ingestion SLAs, and schema change notifications.engineering/senior-data-engineer - Business Intelligence (
): Consumes mart models and semantic-layer metrics; dashboard specs reference model outputs.data-analytics/business-intelligence - Data Analyst (
): Writes ad-hoc queries against mart models; reports data quality issues back to the analytics engineer.data-analytics/data-analyst - MLOps Engineer (
): Feature engineering pipelines may depend on intermediate or mart models as upstream inputs.data-analytics/ml-ops-engineer - CI/CD Workflows (
): Slim CI patterns (templates/
) integrate into GitHub Actions or similar runners for automated PR validation.state:modified+