Skillshub databricks-cost-tuning
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/databricks-cost-tuning" ~/.claude/skills/comeonoliver-skillshub-databricks-cost-tuning && rm -rf "$T"
manifest:
skills/jeremylongshore/claude-code-plugins-plus-skills/databricks-cost-tuning/SKILL.mdsource content
Databricks Cost Tuning
Overview
Reduce Databricks spending through cluster policies, spot instances, SQL warehouse right-sizing, and cost governance. Databricks charges per DBU (Databricks Unit) with rates varying by compute type: Jobs Compute ($0.15/DBU), All-Purpose Compute ($0.40/DBU), SQL Compute ($0.22/DBU), Serverless ($0.07/DBU). System tables (
system.billing.usage and system.billing.list_prices) provide cost visibility.
Prerequisites
- Databricks Premium or Enterprise workspace
- Access to
andsystem.billing.usage
tablessystem.billing.list_prices - Workspace admin for cluster policy creation
Instructions
Step 1: Identify Top Cost Drivers
-- Top 10 most expensive resources this month SELECT cluster_id, COALESCE(usage_metadata.cluster_name, 'unnamed') AS cluster_name, sku_name, SUM(usage_quantity) AS total_dbus, ROUND(SUM(usage_quantity * p.pricing.default), 2) AS estimated_cost_usd FROM system.billing.usage u LEFT JOIN system.billing.list_prices p ON u.sku_name = p.sku_name WHERE u.usage_date >= date_trunc('month', current_date()) GROUP BY cluster_id, cluster_name, u.sku_name ORDER BY estimated_cost_usd DESC LIMIT 10; -- Cost by team (requires cluster tags) SELECT usage_metadata.cluster_tags.Team AS team, sku_name, ROUND(SUM(usage_quantity), 1) AS total_dbus, ROUND(SUM(usage_quantity * p.pricing.default), 2) AS cost_usd FROM system.billing.usage u LEFT JOIN system.billing.list_prices p ON u.sku_name = p.sku_name WHERE u.usage_date >= date_trunc('month', current_date()) GROUP BY team, u.sku_name ORDER BY cost_usd DESC;
Step 2: Enforce Cluster Policies
Cluster policies restrict what users can configure, preventing runaway costs.
from databricks.sdk import WorkspaceClient w = WorkspaceClient() # Create a cost-optimized policy for interactive clusters policy = w.cluster_policies.create( name="cost-optimized-interactive", definition="""{ "autotermination_minutes": { "type": "range", "minValue": 10, "maxValue": 60, "defaultValue": 20 }, "num_workers": { "type": "range", "minValue": 0, "maxValue": 8 }, "node_type_id": { "type": "allowlist", "values": ["m5.xlarge", "m5.2xlarge", "c5.xlarge", "c5.2xlarge"], "defaultValue": "m5.xlarge" }, "aws_attributes.availability": { "type": "fixed", "value": "SPOT_WITH_FALLBACK" }, "custom_tags.CostCenter": { "type": "fixed", "value": "engineering" } }""", ) # Assign policy to a group w.cluster_policies.set_permissions( cluster_policy_id=policy.policy_id, access_control_list=[{ "group_name": "data-analysts", "all_permissions": [{"permission_level": "CAN_USE"}], }], )
Step 3: Spot Instances for Batch Jobs
Spot instances save 60-90% on worker nodes. Always keep the driver on-demand.
# Job cluster config with spot instances job_cluster_config = { "spark_version": "14.3.x-scala2.12", "node_type_id": "i3.xlarge", "num_workers": 4, "aws_attributes": { "availability": "SPOT_WITH_FALLBACK", "first_on_demand": 1, # Driver is on-demand "spot_bid_price_percent": 100, # Pay up to on-demand price "zone_id": "auto", # Let Databricks pick cheapest AZ }, } # Workers use spot, driver uses on-demand # If spot instances unavailable, falls back to on-demand automatically
Step 4: Right-Size SQL Warehouses
-- Check warehouse utilization SELECT warehouse_id, warehouse_name, COUNT(*) AS query_count, ROUND(AVG(total_duration_ms) / 1000, 1) AS avg_duration_sec, ROUND(MAX(queue_duration_ms) / 1000, 1) AS max_queue_sec, ROUND(AVG(queue_duration_ms) / 1000, 1) AS avg_queue_sec FROM system.query.history WHERE start_time > current_timestamp() - INTERVAL 7 DAYS GROUP BY warehouse_id, warehouse_name; -- If avg_queue_sec is near 0 → warehouse is oversized, reduce cluster_size -- If avg_queue_sec > 30 → warehouse needs more capacity or auto-scaling
# Migrate to serverless for bursty workloads (cheaper per-second billing) for wh in w.warehouses.list(): print(f"{wh.name}: type={wh.warehouse_type}, size={wh.cluster_size}, " f"auto_stop={wh.auto_stop_mins}min, state={wh.state}") # Serverless (~$0.07/DBU) vs Classic (~$0.22/DBU) for bursty workloads
Step 5: Auto-Terminate Idle Development Clusters
# Find and set aggressive auto-termination on dev clusters databricks clusters list --output JSON | \ jq -r '.[] | select(.cluster_name | test("dev|sandbox|test")) | .cluster_id' | \ while read CID; do echo "Setting 15min auto-stop on cluster $CID" databricks clusters edit --cluster-id "$CID" --json '{"autotermination_minutes": 15}' done
# Find idle running clusters wasting money from datetime import datetime, timedelta for c in w.clusters.list(): if c.state.value == "RUNNING" and c.last_activity_time: idle_minutes = (datetime.now().timestamp() * 1000 - c.last_activity_time) / 60000 if idle_minutes > 60: print(f"IDLE {idle_minutes:.0f}min: {c.cluster_name} " f"({c.num_workers} x {c.node_type_id})")
Step 6: Instance Pools for Faster + Cheaper Startup
# Create a pool of pre-allocated instances pool = w.instance_pools.create( instance_pool_name="etl-pool", node_type_id="i3.xlarge", min_idle_instances=2, # Keep 2 warm for instant startup max_capacity=10, idle_instance_autotermination_minutes=15, aws_attributes={"availability": "SPOT_WITH_FALLBACK"}, ) # Reference in job cluster config job_cluster = { "instance_pool_id": pool.instance_pool_id, "num_workers": 4, # No node_type_id needed — inherited from pool }
Output
- Cost breakdown by cluster, team, and SKU
- Cluster policies enforcing auto-termination and instance limits
- Spot instance config for 60-90% savings on batch workers
- SQL warehouse utilization report for right-sizing
- Instance pools for faster cluster startup
- Idle cluster detection script
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| Spot interruption | Cloud provider reclaiming capacity | auto-recovers; checkpoint long jobs |
| Policy too restrictive | Workers can't handle workload | Increase or add larger instance types |
| SQL warehouse idle but running | Auto-stop not configured | Set to 5-10 for serverless |
| Billing data not available | System tables not enabled | Enable in Account Console > System Tables |
Examples
Monthly Cost Report
SELECT date_trunc('week', usage_date) AS week, sku_name, ROUND(SUM(usage_quantity), 0) AS total_dbus FROM system.billing.usage WHERE usage_date >= current_date() - INTERVAL 30 DAYS GROUP BY week, sku_name ORDER BY week, total_dbus DESC;
Cost Savings Checklist
- Auto-termination enabled on ALL interactive clusters (15-30 min)
- Spot instances enabled for all batch job workers
- Instance pools for frequently-used cluster configs
- Serverless SQL warehouses for bursty query workloads
- Cluster policies assigned to all non-admin groups
- Team tags on all clusters for cost attribution
- Weekly cost review using
system.billing.usage