dbr-logs
This skill should be used when the user asks to fetch, search, or analyze Databricks job logs. Trigger phrases include "job logs", "databricks logs", "executor logs", "driver logs", "spark job failed", "check logs for", "why did my job fail", "OOM error in job", "check run logs", or requests to debug a Databricks job failure. Not applicable to general Spark code questions or Databricks cluster configuration.
git clone https://github.com/zencity/databricks-logs-reader
T=$(mktemp -d) && git clone --depth=1 https://github.com/zencity/databricks-logs-reader "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/dbr-logs" ~/.claude/skills/zencity-databricks-logs-reader-dbr-logs && rm -rf "$T"
skills/dbr-logs/SKILL.mddbr-logs: Fetch and Analyze Databricks Job Logs
Follow these steps to fetch, analyze, and explain Databricks job logs.
Step 0: Ensure CLI is available
Check if the
dbr-logs CLI is accessible. Try each tier in order:
which dbr-logs
- Found -> use
directlydbr-logs - Not found -> check for
:uvxwhich uvx- If
available -> useuvx
for all commands belowuvx --from dbr-logs dbr-logs <args> - If
not available -> ask the user:uvx
CLI not found. Install options:dbr-logsuv tool install dbr-logspip install dbr-logs
Want me to install it?
- If user declines -> fall back to raw
/databricks fs ls
commands. Warn: "Using raw Databricks CLI (no log merging or filtering). Install dbr-logs for a better experience." Loaddatabricks fs cat
for directory layout guidance.references/log-structure.md
- If
For the rest of these instructions,
DBR_LOGS refers to whichever invocation method was resolved above (dbr-logs, uvx --from dbr-logs dbr-logs, etc.).
Step 1: Resolve the target job
- If the user provides a job name -> use it directly
- If the user provides a Databricks URL -> pass the full URL as the positional argument (the CLI parses job/run from it)
- If the user describes a failure without naming a job -> ask which job to investigate
- If the user specifies a source (e.g. "check executor logs", "look at the driver") -> use
accordingly--source - Default environment is
. Only addprod
if the user specifies a different environment.--env <env>
Step 2: Fetch logs via CLI
Run
DBR_LOGS with appropriate flags. Always use --format jsonl when you (Claude) are consuming the output — structured data is easier to analyze. Use --format text only when the user wants raw output displayed directly.
Priority: match the user's intent. If the user asks to search for a specific string or pattern, pipe the output to
grep rather than adding --level filtering — the match may appear at any log level (INFO, DEBUG, etc.). Only default to --level ERROR,WARN when the user asks about failures/errors without specifying what to search for. Similarly, if the user specifies a source (e.g. "executor logs"), honor that with --source rather than fetching all sources.
Always use
unless the user explicitly asks for raw/unfiltered output. This suppresses Spark/JVM noise (thread dumps, shuffle lifecycle, task assignments) that buries application logs.--focus
Common patterns
# User asks about errors/failures (no specific search term) DBR_LOGS <job-name> --level ERROR,WARN --focus --format jsonl # Specific run DBR_LOGS <job-name> --run-id <run-id> --level ERROR,WARN --focus --format jsonl # User says "check executor logs" (honor the source, fetch all levels) DBR_LOGS <job-name> --source executor --focus --format jsonl # Executor errors specifically DBR_LOGS <job-name> --source executor --level ERROR,WARN --focus --format jsonl # Single executor deep dive DBR_LOGS <job-name> --source executor:3 --focus --format jsonl # User asks to search for a specific string (pipe to grep, no --level) DBR_LOGS <job-name> --focus --format jsonl | grep "partition count" # Search for a specific error pattern DBR_LOGS <job-name> --focus --format jsonl | grep "OutOfMemoryError" # Driver only DBR_LOGS <job-name> --source driver --focus --format jsonl # Include log4j or stacktrace files DBR_LOGS <job-name> --include-log4j --include-stacktrace --focus --format jsonl # Logs from the last hour DBR_LOGS <job-name> --since 1h --focus --format jsonl # Staging environment DBR_LOGS <job-name> --env staging --focus --format jsonl # Raw unfiltered output (no noise suppression) DBR_LOGS <job-name> --format jsonl
CLI reference
| Option | Short | Description |
|---|---|---|
| positional | Job name or Databricks workspace URL |
| | Run ID. Omit for latest run. |
| | (default), , |
| | Databricks CLI profile name |
| | , , , (default) |
| , , (default) | |
| | Exact match, comma-separated: , , , |
| Include driver log4j files | |
| Include driver stacktrace files | |
| | or |
| | Show only last N lines |
| Logs since time (e.g. , , ISO datetime) | |
| Suppress Spark/JVM noise (thread dumps, shuffle, task lifecycle) |
Step 3: Analyze the output
Parse the JSONL output and look for these root cause patterns:
| Pattern | Likely cause | Key fields to check |
|---|---|---|
/ | Executor or driver memory too small | Which source (driver vs executor), heap vs off-heap |
/ / | Network or shuffle issues, node went unhealthy | Target IP, timeout duration, which executor |
| Missing table, view, or path | Resource name in error message |
| SQL/schema issues (column not found, type mismatch) | SQL statement or column name |
| Data skew or stuck tasks | Task IDs, duration, which executor |
/ | Concurrent writes or stale metadata | File path |
| Upstream task failure cascade | Root cause in "caused by" chain |
| Python-side error propagated to JVM | Python traceback in the message |
When analyzing:
- Group errors by source (driver vs specific executors)
- Identify the root cause — often the first error chronologically is the root cause; later errors are cascading failures
- Note the timeline — when errors started, how long the job ran before failing
- Check for patterns across executors — same error on all executors suggests a systemic issue; one executor suggests data skew or node problem
Step 4: Present findings and suggest next steps
Structure your response as:
- Summary: What happened, which run, when
- Errors found: Grouped by source, with key log lines quoted
- Root cause assessment: Best determination of why the job failed
- Suggested actions based on error type:
| Error type | Suggested actions |
|---|---|
| OOM | Increase executor/driver memory, check for data skew, reduce partition size |
| Shuffle/network | Enable shuffle retry settings, check cluster health, increase shuffle partitions |
| Missing resource | Verify table/path exists, check permissions, check if upstream job ran |
| Schema/SQL | Fix column references, check for schema evolution, verify data types |
| Hanging tasks | Increase shuffle partitions, check for data skew, salting join keys |
| Concurrent write | Check for overlapping job schedules, enable Delta conflict resolution |
If the error is unclear, suggest:
- Checking a specific executor's full logs (
)--source executor:N - Looking at driver log4j for more context (
)--include-log4j - Comparing with a previous successful run
- Widening the log level to include WARN or INFO