dbr-logs

This skill should be used when the user asks to fetch, search, or analyze Databricks job logs. Trigger phrases include "job logs", "databricks logs", "executor logs", "driver logs", "spark job failed", "check logs for", "why did my job fail", "OOM error in job", "check run logs", or requests to debug a Databricks job failure. Not applicable to general Spark code questions or Databricks cluster configuration.

install

source · Clone the upstream repo

git clone https://github.com/zencity/databricks-logs-reader

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/zencity/databricks-logs-reader "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/dbr-logs" ~/.claude/skills/zencity-databricks-logs-reader-dbr-logs && rm -rf "$T"

manifest: skills/dbr-logs/SKILL.md

source content

dbr-logs: Fetch and Analyze Databricks Job Logs

Follow these steps to fetch, analyze, and explain Databricks job logs.

Step 0: Ensure CLI is available

Check if the

dbr-logs

CLI is accessible. Try each tier in order:

which dbr-logs

Found -> use
```
dbr-logs
```
directly
Not found -> check for
```
uvx
```
:
```
which uvx
```
- If
```
uvx
```
  available -> use
```
uvx --from dbr-logs dbr-logs <args>
```
  for all commands below
- If
```
uvx
```
  not available -> ask the user:
```
dbr-logs
```
  CLI not found. Install options:
  - ```
  uv tool install dbr-logs
```
- ```
pip install dbr-logs
```
  Want me to install it?
- If user declines -> fall back to raw
```
databricks fs ls
```
  /
```
databricks fs cat
```
  commands. Warn: "Using raw Databricks CLI (no log merging or filtering). Install dbr-logs for a better experience." Load
```
references/log-structure.md
```
  for directory layout guidance.

For the rest of these instructions,

DBR_LOGS

refers to whichever invocation method was resolved above (

dbr-logs

uvx --from dbr-logs dbr-logs

, etc.).

Step 1: Resolve the target job

If the user provides a job name -> use it directly
If the user provides a Databricks URL -> pass the full URL as the positional argument (the CLI parses job/run from it)
If the user describes a failure without naming a job -> ask which job to investigate
If the user specifies a source (e.g. "check executor logs", "look at the driver") -> use
```
--source
```
accordingly
Default environment is
```
prod
```
. Only add
```
--env <env>
```
if the user specifies a different environment.

Step 2: Fetch logs via CLI

Run

DBR_LOGS

with appropriate flags. Always use
--format jsonl
when you (Claude) are consuming the output — structured data is easier to analyze. Use

--format text

only when the user wants raw output displayed directly.

Priority: match the user's intent. If the user asks to search for a specific string or pattern, pipe the output to

grep

rather than adding

--level

filtering — the match may appear at any log level (INFO, DEBUG, etc.). Only default to

--level ERROR,WARN

when the user asks about failures/errors without specifying what to search for. Similarly, if the user specifies a source (e.g. "executor logs"), honor that with

--source

rather than fetching all sources.

Always use

--focus

unless the user explicitly asks for raw/unfiltered output. This suppresses Spark/JVM noise (thread dumps, shuffle lifecycle, task assignments) that buries application logs.

Common patterns

# User asks about errors/failures (no specific search term)
DBR_LOGS <job-name> --level ERROR,WARN --focus --format jsonl

# Specific run
DBR_LOGS <job-name> --run-id <run-id> --level ERROR,WARN --focus --format jsonl

# User says "check executor logs" (honor the source, fetch all levels)
DBR_LOGS <job-name> --source executor --focus --format jsonl

# Executor errors specifically
DBR_LOGS <job-name> --source executor --level ERROR,WARN --focus --format jsonl

# Single executor deep dive
DBR_LOGS <job-name> --source executor:3 --focus --format jsonl

# User asks to search for a specific string (pipe to grep, no --level)
DBR_LOGS <job-name> --focus --format jsonl | grep "partition count"

# Search for a specific error pattern
DBR_LOGS <job-name> --focus --format jsonl | grep "OutOfMemoryError"

# Driver only
DBR_LOGS <job-name> --source driver --focus --format jsonl

# Include log4j or stacktrace files
DBR_LOGS <job-name> --include-log4j --include-stacktrace --focus --format jsonl

# Logs from the last hour
DBR_LOGS <job-name> --since 1h --focus --format jsonl

# Staging environment
DBR_LOGS <job-name> --env staging --focus --format jsonl

# Raw unfiltered output (no noise suppression)
DBR_LOGS <job-name> --format jsonl

CLI reference

Option	Short	Description
`<job>`	positional	Job name or Databricks workspace URL
`--run-id`	`-r`	Run ID. Omit for latest run.
`--env`	`-e`	`prod` (default), `staging` , `ondemand`
`--dbr-profile`	`-p`	Databricks CLI profile name
`--source`	`-s`	`driver` , `executor` , `executor:N` , `all` (default)
`--stream`		`stderr` , `stdout` , `all` (default)
`--level`	`-l`	Exact match, comma-separated: `ERROR` , `WARN` , `INFO` , `DEBUG`
`--include-log4j`		Include driver log4j files
`--include-stacktrace`		Include driver stacktrace files
`--format`	`-f`	`text` or `jsonl`
`--tail`	`-n`	Show only last N lines
`--since`		Logs since time (e.g. `1h` , `30m` , ISO datetime)
`--focus`		Suppress Spark/JVM noise (thread dumps, shuffle, task lifecycle)

Step 3: Analyze the output

Parse the JSONL output and look for these root cause patterns:

Pattern	Likely cause	Key fields to check
`OutOfMemoryError` / `java.lang.OutOfMemoryError`	Executor or driver memory too small	Which source (driver vs executor), heap vs off-heap
`Connection refused` / `ShuffleBlockFetcher` / `TransportChannelHandler`	Network or shuffle issues, node went unhealthy	Target IP, timeout duration, which executor
`RESOURCE_DOES_NOT_EXIST`	Missing table, view, or path	Resource name in error message
`AnalysisException`	SQL/schema issues (column not found, type mismatch)	SQL statement or column name
`HangingTaskDetector`	Data skew or stuck tasks	Task IDs, duration, which executor
`FileNotFoundException` / `FileAlreadyExistsException`	Concurrent writes or stale metadata	File path
`SparkException: Job aborted`	Upstream task failure cascade	Root cause in "caused by" chain
`Py4JJavaError`	Python-side error propagated to JVM	Python traceback in the message

When analyzing:

Group errors by source (driver vs specific executors)
Identify the root cause — often the first error chronologically is the root cause; later errors are cascading failures
Note the timeline — when errors started, how long the job ran before failing
Check for patterns across executors — same error on all executors suggests a systemic issue; one executor suggests data skew or node problem

Step 4: Present findings and suggest next steps

Structure your response as:

Summary: What happened, which run, when
Errors found: Grouped by source, with key log lines quoted
Root cause assessment: Best determination of why the job failed
Suggested actions based on error type:

Error type	Suggested actions
OOM	Increase executor/driver memory, check for data skew, reduce partition size
Shuffle/network	Enable shuffle retry settings, check cluster health, increase shuffle partitions
Missing resource	Verify table/path exists, check permissions, check if upstream job ran
Schema/SQL	Fix column references, check for schema evolution, verify data types
Hanging tasks	Increase shuffle partitions, check for data skew, salting join keys
Concurrent write	Check for overlapping job schedules, enable Delta conflict resolution

If the error is unclear, suggest:

Checking a specific executor's full logs (
```
--source executor:N
```
)
Looking at driver log4j for more context (
```
--include-log4j
```
)
Comparing with a previous successful run
Widening the log level to include WARN or INFO