Dotfiles databricks-ai-functions
Use Databricks built-in AI Functions (ai_classify, ai_extract, ai_summarize, ai_mask, ai_translate, ai_fix_grammar, ai_gen, ai_analyze_sentiment, ai_similarity, ai_parse_document, ai_query, ai_forecast) to add AI capabilities directly to SQL and PySpark pipelines without managing model endpoints. Also covers document parsing and building custom RAG pipelines (parse → chunk → index → query).
git clone https://github.com/msbaek/dotfiles
T=$(mktemp -d) && git clone --depth=1 https://github.com/msbaek/dotfiles "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/databricks-ai-functions" ~/.claude/skills/msbaek-dotfiles-databricks-ai-functions && rm -rf "$T"
.claude/skills/databricks-ai-functions/SKILL.mdDatabricks AI Functions
Official Docs: https://docs.databricks.com/aws/en/large-language-models/ai-functions Individual function reference: https://docs.databricks.com/aws/en/sql/language-manual/functions/
Overview
Databricks AI Functions are built-in SQL and PySpark functions that call Foundation Model APIs directly from your data pipelines — no model endpoint setup, no API keys, no boilerplate. They operate on table columns as naturally as
UPPER() or LENGTH(), and are optimized for batch inference at scale.
There are three categories:
| Category | Functions | Use when |
|---|---|---|
| Task-specific | , , , , , , , , , | The task is well-defined — prefer these always |
| General-purpose | | Complex nested JSON, custom endpoints, multimodal — last resort only |
| Table-valued | | Time series forecasting |
Function selection rule — always prefer a task-specific function over
:ai_query
| Task | Use this | Fall back to when... |
|---|---|---|
| Sentiment scoring | | Never |
| Fixed-label routing | (2–500 labels; add descriptions for accuracy) | Never |
| Entity / field extraction | | Never |
| Summarization | | Never — use for uncapped |
| Grammar correction | | Never |
| Translation | | Target language not in the supported list |
| PII redaction | | Never |
| Free-form generation | | Need structured JSON output |
| Semantic similarity | | Never |
| PDF / document parsing | | Need image-level reasoning |
| Complex JSON / reasoning | — | This is the intended use case for |
Prerequisites
- Databricks SQL warehouse (not Classic) or cluster with DBR 15.1+
- DBR 15.4 ML LTS recommended for batch workloads
- DBR 17.1+ required for
ai_parse_document
requires a Pro or Serverless SQL warehouseai_forecast- Workspace in a supported AWS/Azure region for batch AI inference
- Models run under Apache 2.0 or LLAMA 3.3 Community License — customers are responsible for compliance
Quick Start
Classify, extract, and score sentiment from a text column in a single query:
SELECT ticket_id, ticket_text, ai_classify(ticket_text, ARRAY('urgent', 'not urgent', 'spam')) AS priority, ai_extract(ticket_text, ARRAY('product', 'error_code', 'date')) AS entities, ai_analyze_sentiment(ticket_text) AS sentiment FROM support_tickets;
from pyspark.sql.functions import expr df = spark.table("support_tickets") df = ( df.withColumn("priority", expr("ai_classify(ticket_text, array('urgent', 'not urgent', 'spam'))")) .withColumn("entities", expr("ai_extract(ticket_text, array('product', 'error_code', 'date'))")) .withColumn("sentiment", expr("ai_analyze_sentiment(ticket_text)")) ) # Access nested STRUCT fields from ai_extract df.select("ticket_id", "priority", "sentiment", "entities.product", "entities.error_code", "entities.date").display()
Common Patterns
Pattern 1: Text Analysis Pipeline
Chain multiple task-specific functions to enrich a text column in one pass:
SELECT id, content, ai_analyze_sentiment(content) AS sentiment, ai_summarize(content, 30) AS summary, ai_classify(content, ARRAY('technical', 'billing', 'other')) AS category, ai_fix_grammar(content) AS content_clean FROM raw_feedback;
Pattern 2: PII Redaction Before Storage
from pyspark.sql.functions import expr df_clean = ( spark.table("raw_messages") .withColumn( "message_safe", expr("ai_mask(message, array('person', 'email', 'phone', 'address'))") ) ) df_clean.write.format("delta").mode("append").saveAsTable("catalog.schema.messages_safe")
Pattern 3: Document Ingestion from a Unity Catalog Volume
Parse PDFs/Office docs, then enrich with task-specific functions:
from pyspark.sql.functions import expr df = ( spark.read.format("binaryFile") .load("/Volumes/catalog/schema/landing/documents/") .withColumn("parsed", expr("ai_parse_document(content)")) .selectExpr("path", "parsed:pages[*].elements[*].content AS text_blocks", "parsed:error AS parse_error") .filter("parse_error IS NULL") .withColumn("summary", expr("ai_summarize(text_blocks, 50)")) .withColumn("entities", expr("ai_extract(text_blocks, array('date', 'amount', 'vendor'))")) )
Pattern 4: Semantic Matching / Deduplication
-- Find near-duplicate company names SELECT a.id, b.id, ai_similarity(a.name, b.name) AS score FROM companies a JOIN companies b ON a.id < b.id WHERE ai_similarity(a.name, b.name) > 0.85;
Pattern 5: Complex JSON Extraction with ai_query
(last resort)
ai_queryUse only when the output schema has nested arrays or requires multi-step reasoning that no task-specific function handles:
from pyspark.sql.functions import expr, from_json, col df = ( spark.table("parsed_documents") .withColumn("ai_response", expr(""" ai_query( 'databricks-claude-sonnet-4', concat('Extract invoice as JSON with nested itens array: ', text_blocks), responseFormat => '{"type":"json_object"}', failOnError => false ) """)) .withColumn("invoice", from_json( col("ai_response.response"), "STRUCT<numero:STRING, total:DOUBLE, " "itens:ARRAY<STRUCT<codigo:STRING, descricao:STRING, qtde:DOUBLE, vlrUnit:DOUBLE>>>" )) )
Pattern 6: Time Series Forecasting
SELECT * FROM ai_forecast( observed => TABLE(SELECT date, sales FROM daily_sales), horizon => '2026-12-31', time_col => 'date', value_col => 'sales' ); -- Returns: date, sales_forecast, sales_upper, sales_lower
Reference Files
- 1-task-functions.md — Full syntax, parameters, SQL + PySpark examples for all 9 task-specific functions (
,ai_analyze_sentiment
,ai_classify
,ai_extract
,ai_fix_grammar
,ai_gen
,ai_mask
,ai_similarity
,ai_summarize
) andai_translateai_parse_document - 2-ai-query.md —
complete reference: all parameters, structured output withai_query
, multimodalresponseFormat
, UDF patterns, and error handlingfiles => - 3-ai-forecast.md —
parameters, single-metric, multi-group, multi-metric, and confidence interval patternsai_forecast - 4-document-processing-pipeline.md — End-to-end batch document processing pipeline using AI Functions in a Lakeflow Declarative Pipeline; includes
centralization, function selection logic, custom RAG pipeline (parse → chunk → Vector Search), and DSPy/LangChain guidance for near-real-time variantsconfig.yml
Common Issues
| Issue | Solution |
|---|---|
not found | Requires DBR 17.1+. Check cluster runtime. |
fails | Requires Pro or Serverless SQL warehouse — not available on Classic or Starter. |
| All functions return NULL | Input column is NULL. Filter with before calling. |
fails for a language | Supported: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai. Use with a multilingual model for others. |
returns unexpected labels | Use clear, mutually exclusive label names. Fewer labels (2–5) produces more reliable results. |
raises on some rows in a batch job | Add — returns a STRUCT with and instead of raising. |
| Batch job runs slowly | Use DBR 15.4 ML LTS cluster (not serverless or interactive) for optimized batch inference throughput. |
| Want to swap models without editing pipeline code | Store all model names and prompts in — see 4-document-processing-pipeline.md for the pattern. |