Claude-code-plugins-plus-skills clickhouse-core-workflow-a

install

source · Clone the upstream repo

git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/clickhouse-pack/skills/clickhouse-core-workflow-a" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-clickhouse-core-workflow-a && rm -rf "$T"

manifest: plugins/saas-packs/clickhouse-pack/skills/clickhouse-core-workflow-a/SKILL.md

ClickHouse Schema Design (Core Workflow A)

Overview

Design ClickHouse tables with correct engine selection, ORDER BY keys, partitioning, and codec choices for analytical workloads.

Prerequisites

@clickhouse/client

connected (see

clickhouse-install-auth

)

Understanding of your query patterns (what you filter and group on)

Instructions

Step 1: Choose the Right Engine

Engine	Best For	Dedup?	Example
`MergeTree`	General analytics, append-only logs	No	Clickstream, IoT
`ReplacingMergeTree`	Mutable rows (upserts)	Yes (on merge)	User profiles, state
`SummingMergeTree`	Pre-aggregated counters	Sums numerics	Page view counts
`AggregatingMergeTree`	Materialized view targets	Merges states	Dashboards
`CollapsingMergeTree`	Stateful row updates	Collapses +-1	Shopping carts

ClickHouse Cloud uses

SharedMergeTree

— it is a drop-in replacement for

MergeTree

on Cloud. You do not need to change your DDL.

Step 2: Design the ORDER BY (Sort Key)

The

ORDER BY

clause is the single most important schema decision. It defines:

Primary index — sparse index over sort-key granules (8192 rows default)
Data layout on disk — rows sorted physically by these columns
Query speed — queries filtering on ORDER BY prefix columns hit fewer granules

Rules of thumb:

Put low-cardinality filter columns first (
```
event_type
```
,
```
status
```
)
Then high-cardinality columns you filter on (
```
user_id
```
,
```
tenant_id
```
)
End with a time column if you use range filters (
```
created_at
```
)
Do NOT put high-cardinality columns you never filter on in ORDER BY

-- Good: filter by tenant, then by time ranges
ORDER BY (tenant_id, event_type, created_at)

-- Bad: UUID first means every query scans the full index
ORDER BY (event_id, created_at)  -- event_id is random UUID

Step 3: Schema Examples

Event Analytics Table

CREATE TABLE analytics.events (
    event_id     UUID DEFAULT generateUUIDv4(),
    tenant_id    UInt32,
    event_type   LowCardinality(String),
    user_id      UInt64,
    session_id   String,
    properties   String CODEC(ZSTD(3)),  -- JSON blob, compress well
    url          String CODEC(ZSTD(1)),
    ip_address   IPv4,
    country      LowCardinality(FixedString(2)),
    created_at   DateTime64(3) DEFAULT now64(3)
)
ENGINE = MergeTree()
ORDER BY (tenant_id, event_type, toDate(created_at), user_id)
PARTITION BY toYYYYMM(created_at)
TTL created_at + INTERVAL 1 YEAR
SETTINGS index_granularity = 8192;

User Profile Table (Upserts)

CREATE TABLE analytics.users (
    user_id      UInt64,
    email        String,
    plan         LowCardinality(String),
    mrr_cents    UInt32,
    properties   String CODEC(ZSTD(3)),
    updated_at   DateTime DEFAULT now()
)
ENGINE = ReplacingMergeTree(updated_at)   -- keeps latest row per ORDER BY key
ORDER BY user_id;

-- Query with FINAL to get deduplicated results
SELECT * FROM analytics.users FINAL WHERE user_id = 42;

Daily Aggregation Table

CREATE TABLE analytics.daily_stats (
    date         Date,
    tenant_id    UInt32,
    event_type   LowCardinality(String),
    event_count  UInt64,
    unique_users AggregateFunction(uniq, UInt64)
)
ENGINE = AggregatingMergeTree()
ORDER BY (tenant_id, event_type, date);

Step 4: Partitioning Guidelines

Partition Expression	Typical Use	Parts Per Partition
`toYYYYMM(date)`	Most common — monthly	Target 10-1000
`toMonday(date)`	Weekly rollups	More parts, finer drops
`toYYYYMMDD(date)`	Daily TTL drops	Many parts — use carefully
None	Small tables (<1M rows)	Fine

Warning: Each partition creates separate parts on disk. Over-partitioning (e.g., by

user_id

) creates millions of tiny parts and kills performance.

Step 5: Codecs and Compression

-- Column-level compression codecs
column1  UInt64 CODEC(Delta, ZSTD(3)),      -- Time series / sequential IDs
column2  Float64 CODEC(Gorilla, ZSTD(1)),   -- Floating point (similar values)
column3  String CODEC(ZSTD(3)),              -- General text / JSON
column4  DateTime CODEC(DoubleDelta, ZSTD),  -- Timestamps (near-sequential)

Applying Schema via Node.js

import { createClient } from '@clickhouse/client';

const client = createClient({ url: process.env.CLICKHOUSE_HOST! });

async function applySchema() {
  await client.command({ query: 'CREATE DATABASE IF NOT EXISTS analytics' });

  await client.command({
    query: `
      CREATE TABLE IF NOT EXISTS analytics.events (
        event_id   UUID DEFAULT generateUUIDv4(),
        tenant_id  UInt32,
        event_type LowCardinality(String),
        user_id    UInt64,
        payload    String CODEC(ZSTD(3)),
        created_at DateTime DEFAULT now()
      )
      ENGINE = MergeTree()
      ORDER BY (tenant_id, event_type, created_at)
      PARTITION BY toYYYYMM(created_at)
    `,
  });

  console.log('Schema applied.');
}

Error Handling

Error	Cause	Solution
`ORDER BY expression not in primary key`	PRIMARY KEY != ORDER BY	Remove explicit PRIMARY KEY or align
`Too many parts (300+)`	Over-partitioning	Use coarser partition expression
`Cannot convert String to UInt64`	Wrong data type	Match insert types to schema
`TTL expression type mismatch`	TTL on non-date column	TTL must reference DateTime column

Resources

Next Steps

For inserting and querying data, see

clickhouse-core-workflow-b