Dotfiles databricks-zerobus-ingest
Build Zerobus Ingest clients for near real-time data ingestion into Databricks Delta tables via gRPC. Use when creating producers that write directly to Unity Catalog tables without a message bus, working with the Zerobus Ingest SDK in Python/Java/Go/TypeScript/Rust, generating Protobuf schemas from UC tables, or implementing stream-based ingestion with ACK handling and retry logic.
git clone https://github.com/msbaek/dotfiles
T=$(mktemp -d) && git clone --depth=1 https://github.com/msbaek/dotfiles "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/databricks-zerobus-ingest" ~/.claude/skills/msbaek-dotfiles-databricks-zerobus-ingest && rm -rf "$T"
.claude/skills/databricks-zerobus-ingest/SKILL.mdZerobus Ingest
Build clients that ingest data directly into Databricks Delta tables via the Zerobus gRPC API.
Status: GA (Generally Available since February 2026; billed under Lakeflow Jobs Serverless SKU)
Documentation:
What Is Zerobus Ingest?
Zerobus Ingest is a serverless connector that enables direct, record-by-record data ingestion into Delta tables via gRPC. It eliminates the need for message bus infrastructure (Kafka, Kinesis, Event Hub) for lakehouse-bound data. The service validates schemas, materializes data to target tables, and sends durability acknowledgments back to the client.
Core pattern: SDK init -> create stream -> ingest records -> handle ACKs -> flush -> close
Quick Decision: What Are You Building?
| Scenario | Language | Serialization | Reference |
|---|---|---|---|
| Quick prototype / test harness | Python | JSON | 2-python-client.md |
| Production Python producer | Python | Protobuf | 2-python-client.md + 4-protobuf-schema.md |
| JVM microservice | Java | Protobuf | 3-multilanguage-clients.md |
| Go service | Go | JSON or Protobuf | 3-multilanguage-clients.md |
| Node.js / TypeScript app | TypeScript | JSON | 3-multilanguage-clients.md |
| High-performance system service | Rust | JSON or Protobuf | 3-multilanguage-clients.md |
| Schema generation from UC table | Any | Protobuf | 4-protobuf-schema.md |
| Retry / reconnection logic | Any | Any | 5-operations-and-limits.md |
If not specified, default to python.
Common Libraries
These libraries are essential for ZeroBus data ingestion:
- databricks-sdk>=0.85.0: Databricks workspace client for authentication and metadata
- databricks-zerobus-ingest-sdk>=1.0.0: ZeroBus SDK for high-performance streaming ingestion
- grpcio-tools
These are typically NOT pre-installed on Databricks. Install them using
tool:execute_code
: "%pip install databricks-sdk>=VERSION databricks-zerobus-ingest-sdk>=VERSION"code
Save the returned
cluster_id and context_id for subsequent calls.
Smart Installation Approach
Check protobuf version first, then install compatible
grpcio-tools import google.protobuf runtime_version = google.protobuf.version print(f"Runtime protobuf version: {runtime_version}")
if runtime_version.startswith("5.26") or runtime_version.startswith("5.29"): %pip install grpcio-tools==1.62.0 else: %pip install grpcio-tools # Use latest for newer protobuf versions
Prerequisites
You must never execute the skill without confirming the below objects are valid:
- A Unity Catalog managed Delta table to ingest into
- A service principal id and secret with
andMODIFY
on the target tableSELECT - The Zerobus server endpoint for your workspace region
- The Zerobus Ingest SDK installed for your target language
See 1-setup-and-authentication.md for complete setup instructions.
Minimal Python Example (JSON)
import json from zerobus.sdk.sync import ZerobusSdk from zerobus.sdk.shared import RecordType, StreamConfigurationOptions, TableProperties sdk = ZerobusSdk(server_endpoint, workspace_url) options = StreamConfigurationOptions(record_type=RecordType.JSON) table_props = TableProperties(table_name) stream = sdk.create_stream(client_id, client_secret, table_props, options) try: record = {"device_name": "sensor-1", "temp": 22, "humidity": 55} stream.ingest_record(json.dumps(record)) stream.flush() finally: stream.close()
Detailed guides
| Topic | File | When to Read |
|---|---|---|
| Setup & Auth | 1-setup-and-authentication.md | Endpoint formats, service principals, SDK install |
| Python Client | 2-python-client.md | Sync/async Python, JSON and Protobuf flows, reusable client class |
| Multi-Language | 3-multilanguage-clients.md | Java, Go, TypeScript, Rust SDK examples |
| Protobuf Schema | 4-protobuf-schema.md | Generate .proto from UC table, compile, type mappings |
| Operations & Limits | 5-operations-and-limits.md | ACK handling, retries, reconnection, throughput limits, constraints |
You must always follow all the steps in the Workflow
Workflow
- Display the plan of your execution
- Determinate the type of client
- Get schema Always use 4-protobuf-schema.md. Execute using the
MCP toolexecute_code - Write Python code to a local file follow the instructions in the relevant guide to ingest with zerobus in the project (e.g.,
).scripts/zerobus_ingest.py - Execute on Databricks using the
MCP tool (withexecute_code
parameter)file_path - If execution fails: Edit the local file to fix the error, then re-execute
- Reuse the context for follow-up executions by passing the returned
andcluster_idcontext_id
Important
- Never install local packages
- Always validate MCP server requirement before execution
- Serverless limitation: The Zerobus SDK cannot pip-install on serverless compute. Use classic compute clusters, or use the Zerobus REST API (Beta) for notebook-based ingestion without the SDK.
- Explicit table grants: Service principals need explicit
andMODIFY
grants on the target table. Schema-level inherited permissions may not be sufficient for theSELECT
OAuth flow.authorization_details
Context Reuse Pattern
The first execution auto-selects a running cluster and creates an execution context. Reuse this context for follow-up calls - it's much faster (~1s vs ~15s) and shares variables/imports:
First execution - use
execute_code tool:
: "scripts/zerobus_ingest.py"file_path
Returns:
{ success, output, error, cluster_id, context_id, ... }
Save
cluster_id and context_id for follow-up calls.
If execution fails:
- Read the error from the result
- Edit the local Python file to fix the issue
- Re-execute with same context using
tool:execute_code
: "scripts/zerobus_ingest.py"file_path
: "<saved_cluster_id>"cluster_id
: "<saved_context_id>"context_id
Follow-up executions reuse the context (faster, shares state):
: "scripts/validate_ingestion.py"file_path
: "<saved_cluster_id>"cluster_id
: "<saved_context_id>"context_id
Handling Failures
When execution fails:
- Read the error from the result
- Edit the local Python file to fix the issue
- Re-execute using the same
andcluster_id
(faster, keeps installed libraries)context_id - If the context is corrupted, omit
to create a fresh onecontext_id
Installing Libraries
Databricks provides Spark, pandas, numpy, and common data libraries by default. Only install a library if you get an import error.
Use
execute_code tool:
: "%pip install databricks-zerobus-ingest-sdk>=1.0.0"code
: "<cluster_id>"cluster_id
: "<context_id>"context_id
The library is immediately available in the same context.
Note: Keeping the same
context_id means installed libraries persist across calls.
🚨 Critical Learning: Timestamp Format Fix
BREAKTHROUGH: ZeroBus requires timestamp fields as Unix integer timestamps, NOT string timestamps. The timestamp generation must use microseconds for Databricks.
Key Concepts
- gRPC + Protobuf: Zerobus uses gRPC as its transport protocol. Any application that can communicate via gRPC and construct Protobuf messages can produce to Zerobus.
- JSON or Protobuf serialization: JSON for quick starts; Protobuf for type safety, forward compatibility, and performance.
- At-least-once delivery: The connector provides at-least-once guarantees. Design consumers to handle duplicates.
- Durability ACKs: Each ingested record returns a
. UseRecordAcknowledgment
to ensure all buffered records are durably written, or useflush()
for offset-based tracking.wait_for_offset(offset) - No table management: Zerobus does not create or alter tables. You must pre-create your target table and manage schema evolution yourself.
- Single-AZ durability: The service runs in a single availability zone. Plan for potential zone outages.
Common Issues
| Issue | Solution |
|---|---|
| Connection refused | Verify server endpoint format matches your cloud (AWS vs Azure). Check firewall allowlists. |
| Authentication failed | Confirm service principal client_id/secret. Verify GRANT statements on the target table. |
| Schema mismatch | Ensure record fields match the target table schema exactly. Regenerate .proto if table changed. |
| Stream closed unexpectedly | Implement retry with exponential backoff and stream reinitialization. See 5-operations-and-limits.md. |
| Throughput limits hit | Max 100 MB/s and 15,000 rows/s per stream. Open multiple streams or contact Databricks. |
| Region not supported | Check supported regions in 5-operations-and-limits.md. |
| Table not found | Ensure table is a managed Delta table in a supported region with correct three-part name. |
| SDK install fails on serverless | The Zerobus SDK cannot be pip-installed on serverless compute. Use classic compute clusters or the REST API (Beta) from notebooks. |
| Error 4024 / authorization_details | Service principal lacks explicit table-level grants. Grant and directly on the target table — schema-level inherited grants may be insufficient. |
Related Skills
- databricks-python-sdk - General SDK patterns and WorkspaceClient for table/schema management
- databricks-spark-declarative-pipelines - Downstream pipeline processing of ingested data
- databricks-unity-catalog - Managing catalogs, schemas, and tables that Zerobus writes to
- databricks-synthetic-data-gen - Generate test data to feed into Zerobus producers
- databricks-config - Profile and authentication setup