Claude-skill-registry databricks-asset-bundles
Modern deployment with Databricks Asset Bundles (DAB), supporting multi-environment configurations and CI/CD integration.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/databricks-asset-bundles" ~/.claude/skills/majiayu000-claude-skill-registry-databricks-asset-bundles && rm -rf "$T"
manifest:
skills/data/databricks-asset-bundles/SKILL.mdsource content
Databricks Asset Bundles Skill
Overview
Databricks Asset Bundles (DAB) is a modern deployment framework that packages notebooks, DLT pipelines, jobs, and configurations into versioned, environment-aware bundles. It enables Infrastructure as Code for Databricks.
Key Benefits:
- Infrastructure as Code
- Multi-environment support (dev, staging, prod)
- Version control for all artifacts
- Automated deployment
- Environment-specific configurations
- Integrated with CI/CD
When to Use This Skill
Use Databricks Asset Bundles when you need to:
- Deploy pipelines across multiple environments
- Implement Infrastructure as Code
- Automate deployment workflows
- Manage environment-specific configurations
- Version control Databricks artifacts
- Enable collaborative development
- Standardize deployment processes
Core Concepts
1. Bundle Structure
Standard Bundle Layout:
my-bundle/ ├── databricks.yml # Main configuration ├── environments/ │ ├── dev.yml # Development overrides │ ├── staging.yml # Staging overrides │ └── prod.yml # Production overrides ├── src/ │ ├── notebooks/ │ │ ├── bronze_ingestion.py │ │ └── silver_transformation.py │ └── pipelines/ │ └── dlt_pipeline.py ├── resources/ │ ├── jobs.yml │ ├── pipelines.yml │ └── clusters.yml └── tests/ └── test_transformations.py
2. Main Configuration
databricks.yml:
bundle: name: data-platform-bundle # Optional git configuration git: branch: main origin_url: https://github.com/org/repo.git workspace: host: https://your-workspace.databricks.com root_path: /Workspace/bundles/${bundle.name} # Define variables variables: catalog_name: description: "Unity Catalog name" default: "dev_catalog" storage_path: description: "Base storage path" default: "/mnt/dev/data" cluster_size: description: "Cluster size" default: "small" # Include other configuration files include: - resources/*.yml # Define resources resources: jobs: daily_pipeline: name: "[${bundle.environment}] Daily Pipeline" tasks: - task_key: bronze_ingestion notebook_task: notebook_path: ./src/notebooks/bronze_ingestion source: WORKSPACE base_parameters: catalog: ${var.catalog_name} storage: ${var.storage_path} new_cluster: num_workers: 2 spark_version: 13.3.x-scala2.12 node_type_id: i3.xlarge spark_conf: spark.databricks.delta.preview.enabled: "true" - task_key: silver_transformation depends_on: - task_key: bronze_ingestion notebook_task: notebook_path: ./src/notebooks/silver_transformation source: WORKSPACE job_cluster_key: shared_cluster job_clusters: - job_cluster_key: shared_cluster new_cluster: num_workers: "${var.cluster_size == 'small' ? 2 : 8}" spark_version: 13.3.x-scala2.12 node_type_id: i3.xlarge schedule: quartz_cron_expression: "0 0 1 * * ?" # Daily at 1 AM timezone_id: "America/New_York" email_notifications: on_failure: - data-team@company.com pipelines: bronze_to_gold: name: "[${bundle.environment}] Bronze to Gold Pipeline" target: ${var.catalog_name} storage: ${var.storage_path}/dlt libraries: - notebook: path: ./src/pipelines/dlt_pipeline.py clusters: - label: default num_workers: 4 node_type_id: i3.xlarge configuration: source_path: ${var.storage_path}/landing checkpoint_path: ${var.storage_path}/checkpoints development: false continuous: false targets: dev: mode: development workspace: host: https://dev-workspace.databricks.com root_path: /Workspace/dev/${bundle.name} variables: catalog_name: dev_catalog storage_path: /mnt/dev/data cluster_size: small staging: mode: production workspace: host: https://staging-workspace.databricks.com root_path: /Workspace/staging/${bundle.name} variables: catalog_name: staging_catalog storage_path: /mnt/staging/data cluster_size: medium prod: mode: production workspace: host: https://prod-workspace.databricks.com root_path: /Workspace/prod/${bundle.name} variables: catalog_name: prod_catalog storage_path: /mnt/prod/data cluster_size: large
3. Environment-Specific Configuration
environments/prod.yml:
# Production-specific overrides variables: catalog_name: prod_catalog storage_path: /mnt/prod/data cluster_size: large resources: jobs: daily_pipeline: # Production-specific settings max_concurrent_runs: 1 timeout_seconds: 7200 job_clusters: - job_cluster_key: shared_cluster new_cluster: num_workers: 8 node_type_id: i3.2xlarge autoscale: min_workers: 4 max_workers: 16 email_notifications: on_start: - data-team@company.com on_success: - data-team@company.com on_failure: - data-team@company.com - oncall@company.com pipelines: bronze_to_gold: development: false continuous: true # Continuous processing in prod clusters: - label: default num_workers: 8 node_type_id: i3.2xlarge autoscale: min_workers: 4 max_workers: 16 notifications: - email_recipients: - data-team@company.com on_failure: true on_success: false
4. Deployment Workflow
CLI Commands:
# Install Databricks CLI curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh # Authenticate databricks auth login --host https://your-workspace.databricks.com # Validate bundle databricks bundle validate -t dev # Deploy to development databricks bundle deploy -t dev # Run a job databricks bundle run -t dev daily_pipeline # Deploy to production databricks bundle deploy -t prod # Destroy bundle (cleanup) databricks bundle destroy -t dev
Implementation Patterns
Pattern 1: Multi-Environment Pipeline
Complete Bundle with Environment Variations:
# databricks.yml bundle: name: customer-analytics variables: environment: description: "Deployment environment" catalog: description: "Unity Catalog" min_workers: description: "Minimum cluster workers" default: 2 max_workers: description: "Maximum cluster workers" default: 8 resources: jobs: customer_pipeline: name: "[${var.environment}] Customer Analytics Pipeline" tasks: - task_key: ingest notebook_task: notebook_path: ./notebooks/ingest_customers new_cluster: num_workers: ${var.min_workers} spark_version: 13.3.x-scala2.12 node_type_id: i3.xlarge - task_key: transform depends_on: - task_key: ingest notebook_task: notebook_path: ./notebooks/transform_customers new_cluster: autoscale: min_workers: ${var.min_workers} max_workers: ${var.max_workers} spark_version: 13.3.x-scala2.12 node_type_id: i3.xlarge - task_key: aggregate depends_on: - task_key: transform notebook_task: notebook_path: ./notebooks/aggregate_metrics new_cluster: num_workers: ${var.min_workers} spark_version: 13.3.x-scala2.12 node_type_id: i3.xlarge targets: dev: variables: environment: dev catalog: dev_catalog min_workers: 2 max_workers: 4 prod: variables: environment: prod catalog: prod_catalog min_workers: 4 max_workers: 16
Pattern 2: Modular Configuration
Split Configuration Across Files:
# databricks.yml bundle: name: data-platform include: - resources/jobs/*.yml - resources/pipelines/*.yml - resources/clusters/*.yml # resources/jobs/ingestion_jobs.yml resources: jobs: ingest_customers: name: "[${bundle.environment}] Ingest Customers" tasks: - task_key: main notebook_task: notebook_path: ./notebooks/ingest_customers ingest_orders: name: "[${bundle.environment}] Ingest Orders" tasks: - task_key: main notebook_task: notebook_path: ./notebooks/ingest_orders # resources/pipelines/dlt_pipelines.yml resources: pipelines: customer_pipeline: name: "[${bundle.environment}] Customer DLT Pipeline" target: ${var.catalog}.customer libraries: - notebook: path: ./pipelines/customer_dlt order_pipeline: name: "[${bundle.environment}] Order DLT Pipeline" target: ${var.catalog}.orders libraries: - notebook: path: ./pipelines/order_dlt
Pattern 3: Python Deployment Script
Automated Deployment:
""" Automated bundle deployment script. """ import subprocess import sys from typing import Dict, Any class BundleDeployer: """Deploy Databricks Asset Bundles.""" def __init__(self, bundle_path: str): self.bundle_path = bundle_path def validate(self, target: str) -> bool: """Validate bundle configuration.""" print(f"Validating bundle for target: {target}") result = subprocess.run( ["databricks", "bundle", "validate", "-t", target], cwd=self.bundle_path, capture_output=True, text=True ) if result.returncode != 0: print(f"Validation failed: {result.stderr}") return False print("Validation successful") return True def deploy(self, target: str, force: bool = False) -> bool: """Deploy bundle to target environment.""" if not self.validate(target): return False print(f"Deploying bundle to {target}") cmd = ["databricks", "bundle", "deploy", "-t", target] if force: cmd.append("--force") result = subprocess.run( cmd, cwd=self.bundle_path, capture_output=True, text=True ) if result.returncode != 0: print(f"Deployment failed: {result.stderr}") return False print(f"Deployment successful: {result.stdout}") return True def run_job(self, target: str, job_key: str) -> bool: """Run a specific job from bundle.""" print(f"Running job: {job_key} on {target}") result = subprocess.run( ["databricks", "bundle", "run", "-t", target, job_key], cwd=self.bundle_path, capture_output=True, text=True ) if result.returncode != 0: print(f"Job run failed: {result.stderr}") return False print(f"Job started: {result.stdout}") return True def destroy(self, target: str, auto_approve: bool = False) -> bool: """Destroy bundle resources.""" print(f"WARNING: Destroying bundle resources in {target}") cmd = ["databricks", "bundle", "destroy", "-t", target] if auto_approve: cmd.append("--auto-approve") result = subprocess.run( cmd, cwd=self.bundle_path, capture_output=True, text=True ) if result.returncode != 0: print(f"Destroy failed: {result.stderr}") return False print("Bundle resources destroyed") return True # Usage if __name__ == "__main__": deployer = BundleDeployer("./my-bundle") # Deploy to development if deployer.deploy("dev"): deployer.run_job("dev", "daily_pipeline") # Deploy to production (requires approval) if len(sys.argv) > 1 and sys.argv[1] == "--prod": deployer.deploy("prod")
Pattern 4: GitOps Integration
GitHub Actions Workflow:
# .github/workflows/bundle-deploy.yml name: Deploy Databricks Bundle on: push: branches: [main, develop] pull_request: branches: [main] workflow_dispatch: inputs: environment: description: 'Target environment' required: true type: choice options: - dev - staging - prod jobs: validate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Install Databricks CLI run: | curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh - name: Validate Bundle env: DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }} DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }} run: | cd bundle/ databricks bundle validate -t dev deploy-dev: needs: validate if: github.ref == 'refs/heads/develop' runs-on: ubuntu-latest environment: development steps: - uses: actions/checkout@v3 - name: Install Databricks CLI run: | curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh - name: Deploy to Development env: DATABRICKS_HOST: ${{ secrets.DEV_DATABRICKS_HOST }} DATABRICKS_TOKEN: ${{ secrets.DEV_DATABRICKS_TOKEN }} run: | cd bundle/ databricks bundle deploy -t dev deploy-prod: needs: validate if: github.ref == 'refs/heads/main' runs-on: ubuntu-latest environment: production steps: - uses: actions/checkout@v3 - name: Install Databricks CLI run: | curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh - name: Deploy to Production env: DATABRICKS_HOST: ${{ secrets.PROD_DATABRICKS_HOST }} DATABRICKS_TOKEN: ${{ secrets.PROD_DATABRICKS_TOKEN }} run: | cd bundle/ databricks bundle deploy -t prod
Best Practices
1. Bundle Organization
- Keep bundle files under version control
- Use environment-specific overrides
- Separate resources into logical files
- Document variable purposes
- Include README for bundle usage
2. Environment Management
# Use consistent naming targets: dev: mode: development # Enables faster iterations staging: mode: production # Production-like behavior prod: mode: production # Full production settings
3. Variable Usage
# Define reusable variables variables: project_name: description: "Project identifier" default: "customer-analytics" # Use variables consistently resources: jobs: ${var.project_name}_job: name: "[${bundle.environment}] ${var.project_name}"
4. Testing Strategy
# Test bundle locally databricks bundle validate -t dev # Deploy to dev for testing databricks bundle deploy -t dev # Run integration tests databricks bundle run -t dev test_job # Deploy to prod after validation databricks bundle deploy -t prod
Common Pitfalls to Avoid
Don't:
- Hard-code environment-specific values
- Skip validation before deployment
- Modify resources outside of bundles
- Use development mode in production
- Deploy without testing
Do:
- Use variables for environment differences
- Always validate before deploying
- Manage all resources through bundles
- Use production mode for prod
- Test in lower environments first
Complete Examples
See
/examples/ directory for:
: Full bundle structurecomplete_bundle_project/
: Cross-workspace deploymentmulti_workspace_deployment/
Related Skills
: Deploy DLT pipelinesdelta-live-tables
: Automate deploymentscicd-workflows
: Test before deploytesting-patterns
: Deploy data productsdata-products