Skillshub databricks-observability
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/databricks-observability" ~/.claude/skills/comeonoliver-skillshub-databricks-observability && rm -rf "$T"
manifest:
skills/jeremylongshore/claude-code-plugins-plus-skills/databricks-observability/SKILL.mdsource content
Databricks Observability
Overview
Monitor Databricks jobs, clusters, SQL warehouses, and costs using system tables in the
system catalog. System tables provide queryable observability data: system.lakeflow (job runs), system.billing (costs), system.query (SQL history), system.access (audit logs), and system.compute (cluster metrics). Data updates throughout the day, not real-time.
Prerequisites
- Databricks Premium or Enterprise with Unity Catalog enabled
- Access to
,system.billing
,system.lakeflow
, andsystem.query
schemassystem.access - SQL warehouse for running monitoring queries
Instructions
Step 1: Job Health Monitoring
-- Job success/failure over last 24 hours SELECT COUNT(CASE WHEN result_state = 'SUCCESS' THEN 1 END) AS succeeded, COUNT(CASE WHEN result_state = 'FAILED' THEN 1 END) AS failed, COUNT(CASE WHEN result_state = 'TIMED_OUT' THEN 1 END) AS timed_out, ROUND(100.0 * COUNT(CASE WHEN result_state = 'SUCCESS' THEN 1 END) / COUNT(*), 1) AS success_rate_pct, ROUND(AVG(TIMESTAMPDIFF(MINUTE, start_time, end_time)), 1) AS avg_duration_min FROM system.lakeflow.job_run_timeline WHERE start_time > current_timestamp() - INTERVAL 24 HOURS; -- Failed jobs with error details SELECT job_id, run_name, result_state, start_time, end_time, TIMESTAMPDIFF(MINUTE, start_time, end_time) AS duration_min, error_message FROM system.lakeflow.job_run_timeline WHERE result_state = 'FAILED' AND start_time > current_timestamp() - INTERVAL 24 HOURS ORDER BY start_time DESC;
Step 2: Cluster Utilization and Costs
-- DBU consumption by cluster (last 7 days) SELECT usage_metadata.cluster_id, COALESCE(usage_metadata.cluster_name, 'unnamed') AS cluster_name, sku_name, SUM(usage_quantity) AS total_dbus, ROUND(SUM(usage_quantity * p.pricing.default), 2) AS cost_usd FROM system.billing.usage u LEFT JOIN system.billing.list_prices p ON u.sku_name = p.sku_name WHERE u.usage_date >= current_date() - INTERVAL 7 DAYS GROUP BY usage_metadata.cluster_id, cluster_name, u.sku_name ORDER BY cost_usd DESC LIMIT 20;
Step 3: SQL Warehouse Performance
-- Slow queries (>30s) on SQL warehouses SELECT warehouse_id, statement_id, executed_by, ROUND(total_duration_ms / 1000, 1) AS duration_sec, rows_produced, ROUND(bytes_scanned / 1048576, 1) AS scanned_mb, LEFT(statement_text, 200) AS query_preview FROM system.query.history WHERE total_duration_ms > 30000 AND start_time > current_timestamp() - INTERVAL 24 HOURS ORDER BY total_duration_ms DESC LIMIT 50; -- Warehouse queue times (right-sizing indicator) SELECT warehouse_id, warehouse_name, COUNT(*) AS query_count, ROUND(AVG(total_duration_ms) / 1000, 1) AS avg_sec, ROUND(MAX(queue_duration_ms) / 1000, 1) AS max_queue_sec FROM system.query.history WHERE start_time > current_timestamp() - INTERVAL 7 DAYS GROUP BY warehouse_id, warehouse_name;
Step 4: Cost-per-Job Analysis
SELECT j.name AS job_name, COUNT(DISTINCT r.run_id) AS run_count, ROUND(AVG(TIMESTAMPDIFF(MINUTE, r.start_time, r.end_time)), 1) AS avg_min, ROUND(SUM(b.usage_quantity), 1) AS total_dbus, ROUND(SUM(b.usage_quantity * p.pricing.default), 2) AS total_cost_usd FROM system.lakeflow.job_run_timeline r JOIN system.lakeflow.jobs j ON r.job_id = j.job_id LEFT JOIN system.billing.usage b ON r.run_id = b.usage_metadata.job_run_id LEFT JOIN system.billing.list_prices p ON b.sku_name = p.sku_name WHERE r.start_time > current_timestamp() - INTERVAL 7 DAYS GROUP BY j.name ORDER BY total_cost_usd DESC LIMIT 15;
Step 5: SQL Alerts for Automated Notifications
-- Create as SQL Alert: trigger when failure_count > 3 -- Schedule: every 15 minutes -- Notification destination: Slack/email SELECT COUNT(*) AS failure_count FROM system.lakeflow.job_run_timeline WHERE result_state = 'FAILED' AND start_time > current_timestamp() - INTERVAL 1 HOUR;
from databricks.sdk import WorkspaceClient w = WorkspaceClient() # Create SQL alert programmatically alert = w.alerts.create( name="Hourly Job Failure Alert", query_id="<saved-query-id>", options={"column": "failure_count", "op": ">", "value": "3"}, rearm=900, # re-alert after 15 min if still triggered )
Step 6: Export Metrics to External Systems
from databricks.sdk import WorkspaceClient w = WorkspaceClient() # Export cluster state metrics for Prometheus/Datadog for cluster in w.clusters.list(): if cluster.state.value == "RUNNING": print(f"databricks_cluster_workers{{name=\"{cluster.cluster_name}\"}} " f"{cluster.num_workers}") print(f"databricks_cluster_running{{name=\"{cluster.cluster_name}\"}} 1") # Export job success rate for Grafana runs = list(w.jobs.list_runs(limit=100, completed_only=True)) success = sum(1 for r in runs if r.state.result_state and r.state.result_state.value == "SUCCESS") print(f"databricks_job_success_rate {success / len(runs):.2f}")
Step 7: Audit Log Monitoring
-- Security: who accessed what in the last 7 days SELECT event_time, user_identity.email, action_name, request_params, response.status_code FROM system.access.audit WHERE service_name IN ('unityCatalog', 'jobs', 'clusters') AND event_date >= current_date() - 7 AND action_name NOT IN ('getStatus', 'list') -- exclude noisy reads ORDER BY event_time DESC LIMIT 100;
Output
- Job health dashboard (success rate, duration, failures)
- Cluster cost breakdown by team and SKU
- SQL warehouse performance report (slow queries, queue times)
- Per-job cost analysis
- Automated SQL alerts with Slack/email notifications
- External metric export for Prometheus/Grafana
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| System tables empty | Unity Catalog not enabled | Enable in Account Console > Settings |
| Schema not accessible | Request admin to grant |
| Billing data delayed | System table refresh lag (up to 24h) | Use for trends and alerts, not real-time |
| Query history missing | Serverless queries not tracked | Use classic SQL warehouse or check retention |
Examples
Daily Standup Dashboard
-- Single query for daily pipeline health SELECT 'Last 24h' AS period, COUNT(*) AS total_runs, COUNT(CASE WHEN result_state = 'SUCCESS' THEN 1 END) AS ok, COUNT(CASE WHEN result_state = 'FAILED' THEN 1 END) AS failed, ROUND(AVG(TIMESTAMPDIFF(MINUTE, start_time, end_time)), 1) AS avg_min FROM system.lakeflow.job_run_timeline WHERE start_time > current_timestamp() - INTERVAL 24 HOURS;