Dotfiles databricks-dbsql
install
source · Clone the upstream repo
git clone https://github.com/msbaek/dotfiles
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/msbaek/dotfiles "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/databricks-dbsql" ~/.claude/skills/msbaek-dotfiles-databricks-dbsql && rm -rf "$T"
manifest:
.claude/skills/databricks-dbsql/SKILL.mdsource content
Databricks SQL (DBSQL) - Advanced Features
Quick Reference
| Feature | Key Syntax | Since | Reference |
|---|---|---|---|
| SQL Scripting | , , | DBR 16.3+ | sql-scripting.md |
| Stored Procedures | , | DBR 17.0+ | sql-scripting.md |
| Recursive CTEs | | DBR 17.0+ | sql-scripting.md |
| Transactions | | Preview | sql-scripting.md |
| Materialized Views | | Pro/Serverless | materialized-views-pipes.md |
| Temp Tables | | All | materialized-views-pipes.md |
| Pipe Syntax | operator | DBR 16.1+ | materialized-views-pipes.md |
| Geospatial (H3) | , | DBR 11.2+ | geospatial-collations.md |
| Geospatial (ST) | , , 80+ funcs | DBR 16.0+ | geospatial-collations.md |
| Collations | , , locale-aware | DBR 16.1+ | geospatial-collations.md |
| AI Functions | , , 11+ funcs | DBR 15.1+ | ai-functions.md |
| http_request | | Pro/Serverless | ai-functions.md |
| remote_query | | Pro/Serverless | ai-functions.md |
| read_files | | All | ai-functions.md |
| Data Modeling | Star schema, Liquid Clustering | All | best-practices.md |
Common Patterns
SQL Scripting - Procedural ETL
BEGIN DECLARE v_count INT; DECLARE v_status STRING DEFAULT 'pending'; SET v_count = (SELECT COUNT(*) FROM catalog.schema.raw_orders WHERE status = 'new'); IF v_count > 0 THEN INSERT INTO catalog.schema.processed_orders SELECT *, current_timestamp() AS processed_at FROM catalog.schema.raw_orders WHERE status = 'new'; SET v_status = 'completed'; ELSE SET v_status = 'skipped'; END IF; SELECT v_status AS result, v_count AS rows_processed; END
Stored Procedure with Error Handling
CREATE OR REPLACE PROCEDURE catalog.schema.upsert_customers( IN p_source STRING, OUT p_rows_affected INT ) LANGUAGE SQL SQL SECURITY INVOKER BEGIN DECLARE EXIT HANDLER FOR SQLEXCEPTION BEGIN SET p_rows_affected = -1; SIGNAL SQLSTATE '45000' SET MESSAGE_TEXT = concat('Upsert failed for source: ', p_source); END; MERGE INTO catalog.schema.dim_customer AS t USING (SELECT * FROM identifier(p_source)) AS s ON t.customer_id = s.customer_id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; SET p_rows_affected = (SELECT COUNT(*) FROM identifier(p_source)); END; -- Invoke: CALL catalog.schema.upsert_customers('catalog.schema.staging_customers', ?);
Materialized View with Scheduled Refresh
CREATE OR REPLACE MATERIALIZED VIEW catalog.schema.daily_revenue CLUSTER BY (order_date) SCHEDULE EVERY 1 HOUR COMMENT 'Hourly-refreshed daily revenue by region' AS SELECT order_date, region, SUM(amount) AS total_revenue, COUNT(DISTINCT customer_id) AS unique_customers FROM catalog.schema.fact_orders JOIN catalog.schema.dim_store USING (store_id) GROUP BY order_date, region;
Pipe Syntax - Readable Transformations
-- Traditional SQL rewritten with pipe syntax FROM catalog.schema.fact_orders |> WHERE order_date >= current_date() - INTERVAL 30 DAYS |> AGGREGATE SUM(amount) AS total, COUNT(*) AS cnt GROUP BY region, product_category |> WHERE total > 10000 |> ORDER BY total DESC |> LIMIT 20;
AI Functions - Enrich Data with LLMs
-- Classify support tickets SELECT ticket_id, description, ai_classify(description, ARRAY('billing', 'technical', 'account', 'feature_request')) AS category, ai_analyze_sentiment(description) AS sentiment FROM catalog.schema.support_tickets LIMIT 100; -- Extract entities from text SELECT doc_id, ai_extract(content, ARRAY('person_name', 'company', 'dollar_amount')) AS entities FROM catalog.schema.contracts; -- General-purpose AI query with structured output SELECT ai_query( 'databricks-meta-llama-3-3-70b-instruct', concat('Summarize this customer feedback in JSON with keys: topic, sentiment, action_items. Feedback: ', feedback), returnType => 'STRUCT<topic STRING, sentiment STRING, action_items ARRAY<STRING>>' ) AS analysis FROM catalog.schema.customer_feedback LIMIT 50;
Geospatial - Proximity Search with H3
-- Find stores within 5km of each customer using H3 indexing WITH customer_h3 AS ( SELECT *, h3_longlatash3(longitude, latitude, 7) AS h3_cell FROM catalog.schema.customers ), store_h3 AS ( SELECT *, h3_longlatash3(longitude, latitude, 7) AS h3_cell FROM catalog.schema.stores ) SELECT c.customer_id, s.store_id, ST_Distance( ST_Point(c.longitude, c.latitude), ST_Point(s.longitude, s.latitude) ) AS distance_m FROM customer_h3 c JOIN store_h3 s ON h3_ischildof(c.h3_cell, h3_toparent(s.h3_cell, 5)) WHERE ST_Distance( ST_Point(c.longitude, c.latitude), ST_Point(s.longitude, s.latitude) ) < 5000;
Collation - Case-Insensitive Search
-- Create table with case-insensitive collation CREATE TABLE catalog.schema.products ( product_id BIGINT GENERATED ALWAYS AS IDENTITY, name STRING COLLATE UTF8_LCASE, category STRING COLLATE UTF8_LCASE, price DECIMAL(10, 2) ); -- Queries automatically case-insensitive (no LOWER() needed) SELECT * FROM catalog.schema.products WHERE name = 'MacBook Pro'; -- matches 'macbook pro', 'MACBOOK PRO', etc.
http_request - Call External APIs
-- Set up connection first (one-time) CREATE CONNECTION my_api_conn TYPE HTTP OPTIONS (host 'https://api.example.com', bearer_token secret('scope', 'token')); -- Call API from SQL SELECT order_id, http_request( conn => 'my_api_conn', method => 'POST', path => '/v1/validate', json => to_json(named_struct('order_id', order_id, 'amount', amount)) ).text AS api_response FROM catalog.schema.orders WHERE needs_validation = true;
read_files - Ingest Raw Files
-- Read JSON files from a Volume with schema hints SELECT * FROM read_files( '/Volumes/catalog/schema/raw/events/', format => 'json', schemaHints => 'event_id STRING, timestamp TIMESTAMP, payload MAP<STRING, STRING>', pathGlobFilter => '*.json', recursiveFileLookup => true ); -- Read CSV with options SELECT * FROM read_files( '/Volumes/catalog/schema/raw/sales/', format => 'csv', header => true, delimiter => '|', dateFormat => 'yyyy-MM-dd', schema => 'sale_id INT, sale_date DATE, amount DECIMAL(10,2), store STRING' );
Recursive CTE - Hierarchy Traversal
WITH RECURSIVE org_chart AS ( -- Anchor: top-level managers SELECT employee_id, name, manager_id, 0 AS depth, ARRAY(name) AS path FROM catalog.schema.employees WHERE manager_id IS NULL UNION ALL -- Recursive: direct reports SELECT e.employee_id, e.name, e.manager_id, o.depth + 1, array_append(o.path, e.name) FROM catalog.schema.employees e JOIN org_chart o ON e.manager_id = o.employee_id WHERE o.depth < 10 -- safety limit ) SELECT * FROM org_chart ORDER BY depth, name;
remote_query - Federated Queries
-- Query PostgreSQL via Lakehouse Federation SELECT * FROM remote_query( 'my_postgres_connection', database => 'my_database', query => 'SELECT customer_id, email, created_at FROM customers WHERE active = true' );
Reference Files
Load these for detailed syntax, full parameter lists, and advanced patterns:
| File | Contents | When to Read |
|---|---|---|
| sql-scripting.md | SQL Scripting, Stored Procedures, Recursive CTEs, Transactions | User needs procedural SQL, error handling, loops, dynamic SQL |
| materialized-views-pipes.md | Materialized Views, Temp Tables/Views, Pipe Syntax | User needs MVs, refresh scheduling, temp objects, pipe operator |
| geospatial-collations.md | 39 H3 functions, 80+ ST functions, Collation types and hierarchy | User needs spatial analysis, H3 indexing, case/accent handling |
| ai-functions.md | 13 AI functions, http_request, remote_query, read_files (all options) | User needs AI enrichment, API calls, federation, file ingestion |
| best-practices.md | Data modeling, performance, Liquid Clustering, anti-patterns | User needs architecture guidance, optimization, or modeling advice |
Key Guidelines
- Always use Serverless SQL warehouses for AI functions, MVs, and http_request
- Use
during development with AI functions to control costsLIMIT - Prefer Liquid Clustering over partitioning for new tables (1-4 keys max)
- Use
when unsure about clustering keysCLUSTER BY AUTO - Star schema in Gold layer for BI; OBT acceptable in Silver
- Define PK/FK constraints on dimensional models for query optimization
- Use
for user-facing string columns that need case-insensitive searchCOLLATE UTF8_LCASE - Use MCP tools (
,execute_sql
) to test and validate all SQL before deployingexecute_sql_multi