Trending-skills openduck-distributed-duckdb
OpenDuck — open-source distributed DuckDB with differential storage, hybrid dual execution, and transparent remote database attach
git clone https://github.com/Aradotso/trending-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aradotso/trending-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/openduck-distributed-duckdb" ~/.claude/skills/aradotso-trending-skills-openduck-distributed-duckdb && rm -rf "$T"
skills/openduck-distributed-duckdb/SKILL.mdOpenDuck Distributed DuckDB
Skill by ara.so — Daily 2026 Skills collection
OpenDuck is an open-source implementation of distributed DuckDB featuring differential storage (append-only immutable layers via Postgres + object store), hybrid dual execution (single queries split across local and remote workers), and transparent remote database attach via
ATTACH 'openduck:mydb'. It is architecturally inspired by MotherDuck but fully open protocol (gRPC + Arrow IPC).
Architecture Overview
DuckDB client (openduck extension) └─ ATTACH 'openduck:mydb?endpoint=...' AS cloud └─ gRPC + Arrow IPC └─ Gateway (Rust) ├─ auth / routing / plan splitting ├─ Worker 1 (embedded DuckDB) └─ Worker N (embedded DuckDB) ├─ Postgres (metadata) └─ Object store (sealed layers)
Key concepts:
/OpenDuckCatalog
— remote tables appear as first-class DuckDB catalog entriesOpenDuckTableEntry- Hybrid execution — gateway labels operators
orLOCAL
, insertsREMOTE
operators at boundariesBridge - Differential storage — immutable sealed layers, snapshot isolation, one write path, many readers
- Protocol — only 2 gRPC RPCs defined in
proto/openduck/v1/execution.proto
Repository Layout
crates/ exec-gateway/ # auth, routing, hybrid plan splitting exec-worker/ # embedded DuckDB, Arrow IPC streaming exec-proto/ # protobuf/tonic codegen openduck-cli/ # unified CLI (serve|gateway|worker) diff-*/ # differential storage (layers, metadata, FUSE) extensions/ openduck/ # DuckDB C++ extension (StorageExtension + Catalog) clients/ python/ # pip-installable openduck wrapper proto/ openduck/v1/ # execution.proto
Installation & Build
Prerequisites
- Rust toolchain (stable)
- C++ build tools,
,vcpkg
(macOS:bison
)brew install bison - DuckDB development headers (handled by the extension Makefile)
1. Build the Rust backend
git clone https://github.com/CITGuru/openduck cd openduck cargo build --workspace
2. Build the DuckDB C++ extension
cd extensions/openduck make # Output: # extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension
3. Install the Python client (optional)
pip install -e clients/python
Running the Server
# Required env vars export OPENDUCK_TOKEN=your-secret-token # Start all-in-one (gateway + worker) cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN # Or run gateway and worker separately cargo run -p openduck-cli -- gateway --port 7878 cargo run -p openduck-cli -- worker --gateway http://localhost:7878
Connecting from Python
Via openduck wrapper (recommended)
import openduck # auto-detects extension from build tree or OPENDUCK_EXTENSION_PATH con = openduck.connect("mydb") # uses OPENDUCK_TOKEN env var con.sql("SELECT 1 AS x").show() con.sql("SELECT * FROM cloud.users LIMIT 10").show()
Via raw DuckDB SDK
import duckdb import os ext_path = os.environ["OPENDUCK_EXTENSION_PATH"] # e.g. extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension con = duckdb.connect(config={"allow_unsigned_extensions": "true"}) con.execute(f"LOAD '{ext_path}';") con.execute( "ATTACH 'openduck:mydb" "?endpoint=http://localhost:7878" f"&token={os.environ[\"OPENDUCK_TOKEN\"]}' AS cloud;" ) # Query remote table con.sql("SELECT * FROM cloud.users LIMIT 10").show() # Hybrid query — local table joined with remote table con.sql(""" SELECT l.product_id, l.name, r.total_sales FROM local.products l JOIN cloud.sales r ON l.product_id = r.product_id WHERE r.total_sales > 1000 """).show()
Environment variables
export OPENDUCK_TOKEN=your-secret-token export OPENDUCK_EXTENSION_PATH=extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension export OPENDUCK_ENDPOINT=http://localhost:7878 # default
Connecting from the CLI
duckdb -unsigned -c " LOAD 'extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension'; ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token=${OPENDUCK_TOKEN}' AS cloud; SHOW ALL TABLES; SELECT * FROM cloud.users LIMIT 5; "
Connecting from Rust
use duckdb::{Connection, Result}; fn main() -> Result<()> { let conn = Connection::open_in_memory()?; let ext_path = std::env::var("OPENDUCK_EXTENSION_PATH").unwrap(); let token = std::env::var("OPENDUCK_TOKEN").unwrap(); conn.execute_batch(&format!(r#" SET allow_unsigned_extensions = true; LOAD '{ext_path}'; ATTACH 'openduck:mydb?endpoint=http://localhost:7878&token={token}' AS cloud; "#))?; let mut stmt = conn.prepare("SELECT * FROM cloud.users LIMIT 10")?; let rows = stmt.query_map([], |row| { Ok(row.get::<_, String>(0)?) })?; for row in rows { println!("{}", row?); } Ok(()) }
Hybrid Execution Pattern
Hybrid execution happens automatically — the gateway splits the logical plan:
[LOCAL] HashJoin(l.id = r.id) [LOCAL] Scan(products) ← runs on your machine [LOCAL] Bridge(REMOTE→LOCAL) [REMOTE] Scan(sales) ← runs on remote worker
Write queries naturally — the extension handles routing:
# This single query runs across two engines transparently con.sql(""" SELECT p.category, SUM(s.amount) AS revenue FROM local.products p -- local table JOIN cloud.sales s -- remote table ON p.id = s.product_id GROUP BY p.category ORDER BY revenue DESC """).show()
Differential Storage
Differential storage is managed server-side. Key properties:
- Append-only sealed layers stored in object storage (S3-compatible)
- Postgres stores layer metadata and snapshot pointers
- Snapshot isolation — readers always see a consistent view
- One serialized write path — many concurrent readers
From a client perspective it is fully transparent. DuckDB sees normal table semantics.
ATTACH URL Reference
openduck:<database_name>?endpoint=<url>&token=<token>
| Parameter | Default | Description |
|---|---|---|
| | Gateway URL |
| env var | Auth token matching server config |
Examples:
-- Local dev ATTACH 'openduck:mydb?token=dev-token' AS cloud; -- Remote server, explicit endpoint ATTACH 'openduck:mydb?endpoint=https://my-server.example.com&token=prod-token' AS cloud; -- Alias: od: also works ATTACH 'od:mydb?endpoint=http://localhost:7878&token=dev-token' AS cloud;
DuckLake Integration
OpenDuck and DuckLake operate at different layers and complement each other:
import duckdb, os ext_path = os.environ["OPENDUCK_EXTENSION_PATH"] token = os.environ["OPENDUCK_TOKEN"] con = duckdb.connect(config={"allow_unsigned_extensions": "true"}) con.execute(f"LOAD '{ext_path}';") # Attach DuckLake catalog (managed by remote worker backed by DuckLake) con.execute(f"ATTACH 'openduck:lakehouse?endpoint=http://localhost:7878&token={token}' AS lh;") # Query DuckLake tables transparently via OpenDuck transport con.sql("SELECT * FROM lh.events WHERE event_date = today()").show() # Hybrid: local scratch data joined with remote DuckLake table con.sql(""" SELECT l.session_id, r.user_email FROM memory.sessions l JOIN lh.users r ON l.user_id = r.id """).show()
Protocol Reference
The wire protocol is intentionally minimal. See
proto/openduck/v1/execution.proto:
— send SQL, receive a query handleExecuteQuery
— stream Arrow IPC record batches back to clientStreamResults
Any gRPC service implementing these two RPCs is a valid OpenDuck backend. You can replace the Rust gateway with a custom implementation in any language.
Common Patterns
Check which tables are available remotely
-- After ATTACH ... AS cloud SHOW ALL TABLES; SELECT table_name FROM information_schema.tables WHERE table_schema = 'cloud';
Write to a remote table
INSERT INTO cloud.events SELECT * FROM read_parquet('local_dump.parquet');
Create a remote table from local data
CREATE TABLE cloud.new_table AS SELECT * FROM local_csv LIMIT 0; INSERT INTO cloud.new_table SELECT * FROM local_csv;
Export remote query result to local Parquet
con.sql("SELECT * FROM cloud.large_table WHERE region = 'us-east'") \ .write_parquet("output/us_east.parquet")
Troubleshooting
Extension is not trusted
/ signature error
Extension is not trusted# Always set allow_unsigned_extensions before loading con = duckdb.connect(config={"allow_unsigned_extensions": "true"})
Or in CLI:
duckdb -unsigned
LOAD
fails — extension not found
LOAD# Set the env var to the exact built path export OPENDUCK_EXTENSION_PATH=$(pwd)/extensions/openduck/build/release/extension/openduck/openduck.duckdb_extension ls -la $OPENDUCK_EXTENSION_PATH # confirm it exists
Connection refused to gateway
# Verify server is running cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN # Default port is 7878 — check firewall / port binding curl http://localhost:7878/health
Token mismatch / auth failure
# Server token and client token must match exactly export OPENDUCK_TOKEN=same-value-on-both-sides # Server: cargo run -p openduck-cli -- serve -d mydb -t $OPENDUCK_TOKEN # Client: ATTACH '...&token=same-value-on-both-sides' AS cloud;
Build fails on macOS — bison version
brew install bison export PATH="$(brew --prefix bison)/bin:$PATH" cd extensions/openduck && make
Extension version mismatch with DuckDB
The extension must be built against the same DuckDB version as the Python package:
python -c "import duckdb; print(duckdb.__version__)" # Ensure the extension Makefile targets the same version # Check extensions/openduck/Makefile for DUCKDB_VERSION
OpenDuck vs Alternatives
| Feature | OpenDuck | Arrow Flight SQL | DuckLake |
|---|---|---|---|
| Remote attach UX | | Separate driver | |
| Hybrid execution | ✅ split plan | ❌ full remote | ❌ |
| DuckDB catalog integration | ✅ native | ❌ | ✅ |
| Protocol RPCs | 2 | ~15 | N/A |
| Differential storage | ✅ | ❌ | via Parquet layers |
| Self-hosted | ✅ | ✅ | ✅ |