Agent-skills chdb-datastore
install
source · Clone the upstream repo
git clone https://github.com/ClickHouse/agent-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ClickHouse/agent-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/chdb-datastore" ~/.claude/skills/clickhouse-agent-skills-chdb-datastore && rm -rf "$T"
manifest:
skills/chdb-datastore/SKILL.mdsource content
chdb DataStore — It's Just Faster Pandas
The Key Insight
# Change this: import pandas as pd # To this: import chdb.datastore as pd # Everything else stays the same.
DataStore is a lazy, ClickHouse-backed pandas replacement. Your existing pandas code works unchanged — but operations compile to optimized SQL and execute only when results are needed (e.g.,
print(), len(), iteration).
pip install chdb
Decision Tree: Pick the Right Approach
1. "I have a file/database and want to analyze it with pandas" → DataStore.from_file() / from_mysql() / from_s3() etc. → See references/connectors.md 2. "I need to join data from different sources" → Create DataStores from each source, use .join() → See examples/examples.md #3-5 3. "My pandas code is too slow" → import chdb.datastore as pd — change one line, keep the rest 4. "I need raw SQL queries" → Use the chdb-sql skill instead
Connect to Any Data Source — One Pattern
from datastore import DataStore # Local file (auto-detects .parquet, .csv, .json, .arrow, .orc, .avro, .tsv, .xml) ds = DataStore.from_file("sales.parquet") # Database ds = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass") # Cloud storage ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True) # URI shorthand — auto-detects source type ds = DataStore.uri("mysql://root:pass@db:3306/shop/orders")
All 16+ sources and URI schemes → connectors.md
After Connecting — Full Pandas API
result = ds[ds["age"] > 25] # filter result = ds[["name", "city"]] # select columns result = ds.sort_values("revenue", ascending=False) # sort result = ds.groupby("dept")["salary"].mean() # groupby result = ds.assign(margin=lambda x: x["profit"] / x["revenue"]) # computed column ds["name"].str.upper() # string accessor ds["date"].dt.year # datetime accessor result = ds1.join(ds2, on="id") # join result = ds.head(10) # preview print(ds.to_sql()) # see generated SQL
209 DataFrame methods supported. Full API → api-reference.md
Cross-Source Join — The Killer Feature
from datastore import DataStore customers = DataStore.from_mysql(host="db:3306", database="crm", table="customers", user="root", password="pass") orders = DataStore.from_file("orders.parquet") result = (orders .join(customers, left_on="customer_id", right_on="id") .groupby("country") .agg({"amount": "sum", "rating": "mean"}) .sort_values("sum", ascending=False)) print(result)
More join examples → examples.md
Writing Data
source = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass") target = DataStore("file", path="summary.parquet", format="Parquet") target.insert_into("category", "total", "count").select_from( source.groupby("category").select("category", "sum(amount) AS total", "count() AS count") ).execute()
Troubleshooting
| Problem | Fix |
|---|---|
| |
| Use or |
| Database connection timeout | Include port in host: not |
| Join returns empty result | Check key types match (both int or both string); use to inspect |
| Unexpected results | Call to see the generated SQL and debug |
| Environment check | Run (from skill directory) |
References
- API Reference — Full DataStore method signatures
- Connectors — All 16+ data source connection methods
- Examples — 10+ runnable examples with expected output
- Verify Install — Environment verification script
- Official Docs
Note: This skill teaches how to use chdb DataStore. For raw SQL queries, use the
skill. For contributing to chdb source code, see CLAUDE.md in the project root.chdb-sql