Claude-skill-registry clickhouse-architect
ClickHouse schema design and optimization. TRIGGERS - ClickHouse schema, compression codecs, MergeTree, ORDER BY tuning, partition key.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/clickhouse-architect" ~/.claude/skills/majiayu000-claude-skill-registry-clickhouse-architect && rm -rf "$T"
skills/data/clickhouse-architect/SKILL.mdClickHouse Architect
<!-- ADR: 2025-12-09-clickhouse-architect-skill -->Prescriptive schema design, compression selection, and performance optimization for ClickHouse (v24.4+). Covers both ClickHouse Cloud (SharedMergeTree) and self-hosted (ReplicatedMergeTree) deployments.
Core Methodology
Schema Design Workflow
Follow this sequence when designing or reviewing ClickHouse schemas:
- Define ORDER BY key (3-5 columns, lowest cardinality first)
- Select compression codecs per column type
- Configure PARTITION BY for data lifecycle management
- Add performance accelerators (projections, indexes)
- Validate with audit queries (see scripts/)
- Document with COMMENT statements (see
)references/schema-documentation.md
ORDER BY Key Selection
The ORDER BY clause is the most critical decision in ClickHouse schema design.
Rules:
- Limit to 3-5 columns maximum (each additional column has diminishing returns)
- Place lowest cardinality columns first (e.g.,
beforetenant_id
)timestamp - Include all columns used in WHERE clauses for range queries
- PRIMARY KEY must be a prefix of ORDER BY (or omit to use full ORDER BY)
Example:
-- Correct: Low cardinality first, 4 columns CREATE TABLE trades ( exchange LowCardinality(String), symbol LowCardinality(String), timestamp DateTime64(3), trade_id UInt64, price Float64, quantity Float64 ) ENGINE = MergeTree() ORDER BY (exchange, symbol, timestamp, trade_id); -- Wrong: High cardinality first (10x slower queries) ORDER BY (trade_id, timestamp, symbol, exchange);
Compression Codec Quick Reference
| Column Type | Default Codec | Read-Heavy Alternative | Example |
|---|---|---|---|
| DateTime/DateTime64 | | | |
| Float prices/gauges | | | |
| Integer counters | | — | |
| Slowly changing integers | | | |
| String (low cardinality) | | — | |
| General data | | | Default compression level 3 |
When to use LZ4 over ZSTD: LZ4 provides 1.76x faster decompression. Use LZ4 for read-heavy workloads with monotonic sequences (timestamps, counters). Use ZSTD (default) when compression ratio matters or data patterns are unknown.
Note on codec combinations:
Delta/DoubleDelta + Gorilla combinations are blocked by default (
allow_suspicious_codecs) because Gorilla already performs implicit delta compression internally—combining them is redundant, not dangerous. A historical corruption bug (PR #45615, Jan 2023) was fixed, but the blocking remains as a best practice guardrail.
Use each codec family independently for its intended data type:
-- Correct usage price Float64 CODEC(Gorilla, ZSTD) -- Floats: use Gorilla timestamp DateTime64 CODEC(DoubleDelta, ZSTD) -- Timestamps: use DoubleDelta timestamp DateTime64 CODEC(DoubleDelta, LZ4) -- Read-heavy: use LZ4
PARTITION BY Guidelines
PARTITION BY is for data lifecycle management, NOT query optimization.
Rules:
- Partition by time units (month, week) for TTL and data management
- Keep partition count under 1000 total across all tables
- Each partition should contain 1-300 parts maximum
- Never partition by high-cardinality columns
Example:
-- Correct: Monthly partitions for TTL management PARTITION BY toYYYYMM(timestamp) -- Wrong: Daily partitions (too many parts) PARTITION BY toYYYYMMDD(timestamp) -- Wrong: High-cardinality partition key PARTITION BY user_id
Anti-Patterns Checklist (v24.4+)
| Pattern | Severity | Modern Status | Fix |
|---|---|---|---|
| Too many parts (>300/partition) | Critical | Still critical | Reduce partition granularity |
| Small batch inserts (<1000) | Critical | Still critical | Batch to 10k-100k rows |
| High-cardinality first ORDER BY | Critical | Still critical | Reorder: lowest cardinality first |
| No memory limits | High | Still critical | Set |
| Denormalization overuse | High | Still critical | Use dictionaries + materialized views |
| Large JOINs | Medium | 180x improved | Still avoid for ultra-low-latency |
| Mutations (UPDATE/DELETE) | Medium | 1700x improved | Use lightweight updates (v24.4+) |
Table Engine Selection
| Deployment | Engine | Use Case |
|---|---|---|
| ClickHouse Cloud | | Default for cloud deployments |
| Self-hosted cluster | | Multi-node with replication |
| Self-hosted single | | Single-node development/testing |
Cloud (SharedMergeTree):
CREATE TABLE trades (...) ENGINE = SharedMergeTree('/clickhouse/tables/{shard}/trades', '{replica}') ORDER BY (exchange, symbol, timestamp);
Self-hosted (ReplicatedMergeTree):
CREATE TABLE trades (...) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/trades', '{replica}') ORDER BY (exchange, symbol, timestamp);
Skill Delegation Guide
<!-- ADR: 2025-12-10-clickhouse-skill-delegation -->This skill is the hub for ClickHouse-related tasks. When the user's needs extend beyond schema design, invoke the related skills below.
Delegation Decision Matrix
| User Need | Invoke Skill | Trigger Phrases |
|---|---|---|
| Create database users, manage permissions | | "create user", "GRANT", "permissions", "credentials" |
| Configure DBeaver, generate connection JSON | | "DBeaver", "client config", "connection setup" |
| Validate schema contracts against live database | | "validate schema", "Earthly E2E", "schema contract" |
Typical Workflow Sequence
- Schema Design (THIS SKILL) → Design ORDER BY, compression, partitioning
- User Setup →
(if cloud credentials needed)clickhouse-cloud-management - Client Config →
(generate DBeaver JSON)clickhouse-pydantic-config - Validation →
(CI/CD schema contracts)schema-e2e-validation
Example: Full Stack Request
User: "I need to design a trades table for ClickHouse Cloud and set up DBeaver to query it."
Expected behavior:
- Use THIS skill for schema design
- Invoke
for creating database userclickhouse-cloud-management - Invoke
for DBeaver configurationclickhouse-pydantic-config
Performance Accelerators
Projections
Create alternative sort orders that ClickHouse automatically selects:
ALTER TABLE trades ADD PROJECTION trades_by_symbol ( SELECT * ORDER BY symbol, timestamp ); ALTER TABLE trades MATERIALIZE PROJECTION trades_by_symbol;
Materialized Views
Pre-compute aggregations for dashboard queries:
CREATE MATERIALIZED VIEW trades_hourly_mv ENGINE = SummingMergeTree() ORDER BY (exchange, symbol, hour) AS SELECT exchange, symbol, toStartOfHour(timestamp) AS hour, sum(quantity) AS total_volume, count() AS trade_count FROM trades GROUP BY exchange, symbol, hour;
Dictionaries
Replace JOINs with O(1) dictionary lookups for large-scale star schemas:
When to use dictionaries (v24.4+):
- Fact tables with 100M+ rows joining dimension tables
- Dimension tables 1k-500k rows with monotonic keys
- LEFT ANY JOIN semantics required
When JOINs are sufficient (v24.4+):
- Dimension tables <500 rows (JOIN overhead negligible)
- v24.4+ predicate pushdown provides 8-180x improvements
- Complex JOIN types (FULL, RIGHT, multi-condition)
Benchmark context: 6.6x speedup measured on Star Schema Benchmark (1.4B rows).
CREATE DICTIONARY symbol_info ( symbol String, name String, sector String ) PRIMARY KEY symbol SOURCE(CLICKHOUSE(TABLE 'symbols')) LAYOUT(FLAT()) -- Best for <500k entries with monotonic keys LIFETIME(3600); -- Use in queries (O(1) lookup) SELECT symbol, dictGet('symbol_info', 'name', symbol) AS symbol_name FROM trades;
Scripts
Execute comprehensive schema audit:
clickhouse-client --multiquery < scripts/schema-audit.sql
The audit script checks:
- Part count per partition (threshold: 300)
- Compression ratios by column
- Query performance patterns
- Replication lag (if applicable)
- Memory usage patterns
Additional Resources
Reference Files
| Reference | Content |
|---|---|
| Complete workflow with examples |
| Decision tree + benchmarks |
| 13 deadly sins + v24.4+ status |
| Query interpretation guide |
| Parameterized views, dictionaries, dedup |
| COMMENT patterns + naming for AI understanding |
External Documentation
Python Driver Policy
<!-- ADR: 2025-12-10-clickhouse-python-driver-policy -->Use
(official) for all Python integrations.clickhouse-connect
# ✅ RECOMMENDED: clickhouse-connect (official, HTTP) import clickhouse_connect client = clickhouse_connect.get_client( host='localhost', port=8123, # HTTP port username='default', password='' ) result = client.query("SELECT * FROM trades LIMIT 1000") df = client.query_df("SELECT * FROM trades") # Pandas integration
Why NOT clickhouse-driver
clickhouse-driver| Factor | clickhouse-connect | clickhouse-driver |
|---|---|---|
| Maintainer | ClickHouse Inc. | Solo developer |
| Weekly commits | Yes (active) | Sparse (months) |
| Open issues | 41 (addressed) | 76 (accumulating) |
| Downloads/week | 2.7M | 1.5M |
| Bus factor risk | Low (company) | High (1 person) |
Do NOT use
despite its ~26% speed advantage for large exports. The maintenance risk outweighs performance gains:clickhouse-driver
- Single maintainer (mymarilyn) with no succession plan
- Issues accumulating without response
- Risk of abandonment breaks production code
Exception: Only consider
clickhouse-driver if you have extreme performance requirements (exporting millions of rows) AND accept the maintenance risk.
Related Skills
| Skill | Purpose |
|---|---|
| User/permission management |
| DBeaver connection generation |
| YAML schema contracts |
| Database migration validation |