Skillshub databricks-incident-runbook
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/databricks-incident-runbook" ~/.claude/skills/comeonoliver-skillshub-databricks-incident-runbook && rm -rf "$T"
manifest:
skills/jeremylongshore/claude-code-plugins-plus-skills/databricks-incident-runbook/SKILL.mdsource content
Databricks Incident Runbook
Overview
Rapid incident response for Databricks: triage script, decision tree, immediate actions by error type, communication templates, evidence collection, and postmortem template. Designed for on-call engineers to follow during live incidents.
Severity Levels
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| P1 | Production pipeline down | < 15 min | Critical ETL failed, data not updating |
| P2 | Degraded performance | < 1 hour | Slow queries, partial failures, stale data |
| P3 | Non-critical issues | < 4 hours | Dev cluster issues, non-critical job delays |
| P4 | No user impact | Next business day | Monitoring gaps, cleanup needed |
Instructions
Step 1: Quick Triage (Run First)
#!/bin/bash set -euo pipefail echo "=== DATABRICKS TRIAGE $(date -u +%H:%M:%S\ UTC) ===" # 1. Is Databricks itself down? echo "--- Platform Status ---" curl -s https://status.databricks.com/api/v2/status.json | \ jq -r '.status.description // "UNKNOWN"' # 2. Can we reach the workspace? echo "--- Workspace ---" if databricks current-user me --output json 2>/dev/null | jq -r .userName; then echo "API: CONNECTED" else echo "API: UNREACHABLE — check VPN/firewall/token" fi # 3. Recent failures echo "--- Failed Runs (last 1h) ---" databricks runs list --limit 20 --output json 2>/dev/null | \ jq -r '.runs[]? | select(.state.result_state == "FAILED") | "\(.run_id): \(.run_name // "unnamed") — \(.state.state_message // "no message")"' || \ echo "Could not fetch runs" # 4. Cluster health echo "--- Clusters in ERROR state ---" databricks clusters list --output json 2>/dev/null | \ jq -r '.[]? | select(.state == "ERROR") | "\(.cluster_id): \(.cluster_name) — \(.termination_reason.code // "unknown")"' || \ echo "Could not fetch clusters"
Step 2: Decision Tree
Is the issue affecting production data pipelines? ├─ YES: Is it a single job or multiple? │ ├─ SINGLE JOB │ │ ├─ Cluster failed to start → Step 3a │ │ ├─ Code/logic error → Step 3b │ │ ├─ Data quality issue → Step 3c │ │ └─ Permission error → Step 3d │ │ │ └─ MULTIPLE JOBS → Likely infrastructure │ ├─ Check platform status (status.databricks.com) │ ├─ Check workspace quotas (Admin Console) │ └─ Check network/VPN connectivity │ └─ NO: Is it performance? ├─ Slow queries → Check query plan, warehouse sizing ├─ Slow cluster startup → Check instance availability └─ Data freshness → Check upstream dependencies
Step 3a: Cluster Failed to Start
CLUSTER_ID="your-cluster-id" # Get termination reason databricks clusters get --cluster-id $CLUSTER_ID | \ jq '{state, termination_reason}' # Check recent events databricks clusters events --cluster-id $CLUSTER_ID --limit 10 | \ jq '.events[] | "\(.timestamp): \(.type) — \(.details // "none")"' # Common fixes: # QUOTA_EXCEEDED → Terminate idle clusters # CLOUD_PROVIDER_LAUNCH_FAILURE → Check instance availability in region # DRIVER_UNREACHABLE → Network/security group issue # Quick fix: restart databricks clusters start --cluster-id $CLUSTER_ID
Step 3b: Code/Logic Error
RUN_ID="your-run-id" # Get run details and error databricks runs get --run-id $RUN_ID | jq '{ state: .state, tasks: [.tasks[]? | {key: .task_key, result: .state.result_state, error: .state.state_message}] }' # Get task output for failed tasks databricks runs get-output --run-id $RUN_ID | jq '{ error: .error, trace: (.error_trace // "" | .[0:1000]) }' # Repair failed tasks only (skip successful ones) databricks runs repair --run-id $RUN_ID --rerun-tasks FAILED
Step 3c: Data Quality Issue
-- Quick data sanity check SELECT COUNT(*) AS total_rows, COUNT(DISTINCT id) AS unique_ids, SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) AS null_amounts, MIN(created_at) AS oldest, MAX(created_at) AS newest FROM prod_catalog.silver.orders WHERE created_at > current_timestamp() - INTERVAL 1 DAY; -- Check recent table changes DESCRIBE HISTORY prod_catalog.silver.orders LIMIT 10; -- Restore to previous version if corrupted RESTORE TABLE prod_catalog.silver.orders TO VERSION AS OF 5;
Step 3d: Permission Error
# Check current user databricks current-user me # Check job permissions databricks permissions get jobs --job-id $JOB_ID # Fix permissions databricks permissions update jobs --job-id $JOB_ID --json '{ "access_control_list": [{ "user_name": "service-principal@company.com", "permission_level": "CAN_MANAGE_RUN" }] }'
Step 4: Communication
Internal (Slack)
:red_circle: **P1 INCIDENT: [Brief Description]** **Status:** INVESTIGATING **Impact:** [What data/users are affected] **Started:** [Time UTC] **Current Action:** [What you're doing now] **Next Update:** [+30 min] **IC:** @[your-name]
External (Status Page)
**Data Pipeline Delay** We are experiencing delays in data processing. Dashboard data may be up to [X] hours stale. Started: [Time] UTC Status: Actively investigating Next update: [Time] UTC
Step 5: Evidence Collection
#!/bin/bash INCIDENT_ID=$1 RUN_ID=$2 CLUSTER_ID=$3 mkdir -p "incident-$INCIDENT_ID" # Collect everything databricks runs get --run-id $RUN_ID --output json > "incident-$INCIDENT_ID/run.json" 2>&1 databricks runs get-output --run-id $RUN_ID --output json > "incident-$INCIDENT_ID/output.json" 2>&1 if [ -n "$CLUSTER_ID" ]; then databricks clusters get --cluster-id $CLUSTER_ID --output json > "incident-$INCIDENT_ID/cluster.json" 2>&1 databricks clusters events --cluster-id $CLUSTER_ID --limit 50 --output json > "incident-$INCIDENT_ID/events.json" 2>&1 fi tar -czf "incident-$INCIDENT_ID.tar.gz" "incident-$INCIDENT_ID" echo "Evidence: incident-$INCIDENT_ID.tar.gz"
Step 6: Postmortem Template
## Incident: [Title] **Date:** YYYY-MM-DD | **Duration:** Xh Ym | **Severity:** P[1-4] **IC:** [Name] ### Summary [1-2 sentences: what happened and what was the impact] ### Timeline (UTC) | Time | Event | |------|-------| | HH:MM | Alert fired / issue detected | | HH:MM | Investigation started | | HH:MM | Root cause identified | | HH:MM | Mitigation applied | | HH:MM | Resolved | ### Root Cause [Technical explanation] ### Impact - Tables affected: [list] - Data staleness: [hours] - Users affected: [count/teams] ### Action Items | Priority | Action | Owner | Due | |----------|--------|-------|-----| | P1 | [Preventive fix] | [Name] | [Date] | | P2 | [Monitoring gap] | [Name] | [Date] |
Output
- Issue triaged and severity assigned
- Root cause identified via decision tree
- Immediate remediation applied
- Stakeholders notified with structured updates
- Evidence collected for postmortem
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| Can't reach API | Token expired or VPN down | Re-auth: |
fails | Run too old for repair | Create new run with same config |
fails | VACUUM already cleaned old versions | Restore from backup or replay pipeline |
| Cluster restart loops | Init script failing | Check cluster events for init script errors |
Examples
One-Line Health Checks
# Last 5 runs for a job databricks runs list --job-id $JID --limit 5 | jq '.runs[] | "\(.state.result_state): \(.run_name)"' # Quick cluster restart databricks clusters restart --cluster-id $CID && echo "Restart initiated" # Cancel all active runs for a job databricks runs list --job-id $JID --active-only | jq -r '.runs[].run_id' | \ xargs -I{} databricks runs cancel --run-id {}
Resources
Next Steps
For data handling and compliance, see
databricks-data-handling.