Claude-skill-registry clickhouse-patterns

Use when designing ClickHouse table schemas, choosing MergeTree engines, optimizing slow queries (execution time > 1 second), or implementing batch INSERT patterns. Does NOT cover real-time streaming from message queues or database replication/CDC - see related skills for those topics.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/clickhouse-patterns" ~/.claude/skills/majiayu000-claude-skill-registry-clickhouse-patterns && rm -rf "$T"
manifest: skills/data/clickhouse-patterns/SKILL.md
source content

ClickHouse Analytics Patterns

Overview

ClickHouse is a column-oriented DBMS for OLAP workloads. Core principle: Design tables for your query patterns, batch writes, leverage pre-aggregation.

Key insight: ClickHouse trades write complexity for read speed. If you're doing OLTP-style operations, you're using the wrong tool.

Scope

This skill covers:

  • Table schema design and MergeTree engine selection
  • Query optimization and performance tuning
  • Batch INSERT patterns (scheduled ETL jobs)
  • Materialized views for pre-aggregation

This skill does NOT cover:

  • Real-time streaming from Kafka/RabbitMQ → See
    clickhouse-streaming
  • Database replication (PostgreSQL/MySQL sync) → See
    clickhouse-cdc

If unsure: Start here for fundamentals, then check specialized skills.

When to Use

Symptoms:

  • Analytical queries timeout on datasets > 1M rows
  • Need to choose between MergeTree variants
  • Seeing "too many partitions" or "too many parts" errors
  • Small frequent inserts causing degradation
  • Confused about ORDER BY vs PRIMARY KEY

When NOT to use:

  • OLTP workloads (use PostgreSQL/MySQL)
  • Small datasets < 100k rows
  • Need ACID transactions across tables
  • Real-time individual row updates

Quick Reference

Engine Selection

digraph choose_engine {
    rankdir=TD;
    node [shape=box, style=rounded];

    start [label="Choose MergeTree Engine", shape=ellipse];
    dedup [label="Need deduplication?", shape=diamond];
    preagg [label="Need pre-aggregation?", shape=diamond];

    replacing [label="ReplacingMergeTree\n(auto-dedup on merge)"];
    aggregating [label="AggregatingMergeTree\n(maintain aggregates)"];
    basic [label="MergeTree\n(default, most common)"];

    start -> dedup;
    dedup -> replacing [label="yes"];
    dedup -> preagg [label="no"];
    preagg -> aggregating [label="yes"];
    preagg -> basic [label="no"];
}
EngineUse WhenExample
MergeTreeDefault (90% of cases)Event logs, metrics
ReplacingMergeTreeDuplicate data from sourcesUser events, CDC
AggregatingMergeTreeReal-time dashboard statsHourly/daily aggregations

Critical Rules

RuleWhyExample
ORDER BY firstPrimary index only works on prefix
ORDER BY (date, user_id)
Batch insertsEach insert = new partMinimum 1000 rows
Specify columnsColumn storage reads ALLNever
SELECT *
Monthly partitionsToo many = slow scans
PARTITION BY toYYYYMM(date)

Table Design Patterns

MergeTree (Default)

CREATE TABLE events (
    date Date,
    timestamp DateTime,
    user_id String,
    event_type LowCardinality(String),
    properties String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (date, user_id, timestamp)
SETTINGS index_granularity = 8192;

Key decisions:

  • PARTITION BY toYYYYMM(date)
    - Monthly partitions (sweet spot)
  • ORDER BY (date, user_id, timestamp)
    - Filter columns FIRST
  • LowCardinality(String)
    - For repeated values (< 10k unique)

ReplacingMergeTree (Deduplication)

CREATE TABLE user_events (
    event_id String,
    user_id String,
    timestamp DateTime,
    version UInt64
) ENGINE = ReplacingMergeTree(version)
PARTITION BY toYYYYMM(timestamp)
ORDER BY (user_id, event_id, timestamp);

-- ⚠️ Dedup happens during MERGE, not on read
-- Use FINAL for guaranteed dedup (slower):
SELECT * FROM user_events FINAL WHERE user_id = 'user-123';

AggregatingMergeTree (Pre-aggregation)

-- 1. Target table
CREATE TABLE stats_hourly (
    hour DateTime,
    market_id String,
    total_volume AggregateFunction(sum, UInt64),
    total_trades AggregateFunction(count, UInt32)
) ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, market_id);

-- 2. Materialized view (auto-populates)
CREATE MATERIALIZED VIEW stats_hourly_mv TO stats_hourly
AS SELECT
    toStartOfHour(timestamp) AS hour,
    market_id,
    sumState(amount) AS total_volume,
    countState() AS total_trades
FROM trades
GROUP BY hour, market_id;

-- 3. Query with merge functions
SELECT
    hour,
    sumMerge(total_volume) AS volume,
    countMerge(total_trades) AS trades
FROM stats_hourly
WHERE hour >= now() - INTERVAL 24 HOUR
GROUP BY hour, market_id;

Query Optimization

Leverage ORDER BY

-- ✅ GOOD: Filter on ORDER BY columns
SELECT user_id, event_type, timestamp
FROM events
WHERE date >= '2025-01-01'      -- First in ORDER BY
  AND user_id = 'user-123'       -- Second in ORDER BY
LIMIT 100;

-- ❌ BAD: Filter on non-indexed columns
SELECT *
FROM events
WHERE event_type = 'purchase'    -- Not in ORDER BY
  AND properties LIKE '%premium%';

Use ClickHouse Functions

SELECT
    toStartOfDay(timestamp) AS day,
    sum(volume) AS total,
    count() AS trades,              -- Not count(*)
    uniq(user_id) AS users,         -- Not COUNT(DISTINCT)
    quantile(0.95)(size) AS p95     -- Not percentile
FROM trades
WHERE timestamp >= today() - INTERVAL 7 DAY
GROUP BY day
HAVING total > 10000
ORDER BY day DESC;

Data Insertion

Bulk Insert (Required)

import { ClickHouse } from 'clickhouse';

// ✅ GOOD: Batch 1000+ rows
async function bulkInsert(trades: Trade[]) {
  const BATCH_SIZE = 5000;

  for (let i = 0; i < trades.length; i += BATCH_SIZE) {
    const batch = trades.slice(i, i + BATCH_SIZE);
    const values = batch.map(t =>
      `('${t.id}', '${t.market_id}', ${t.amount}, '${t.timestamp.toISOString()}')`
    ).join(',');

    await clickhouse.query(`
      INSERT INTO trades (id, market_id, amount, timestamp) VALUES ${values}
    `).toPromise();
  }
}

// ❌ BAD: Individual inserts create "merge hell"
async function slowInsert(trade: Trade) {
  await clickhouse.query(`INSERT INTO trades VALUES (...)`).toPromise();
}

Why batching matters: Each insert creates a new "part". Too many parts = slow merges = degraded performance.

Common Mistakes

MistakeWhy It FailsFix
**SELECT ***Reads ALL columnsSpecify:
SELECT id, name
Small frequent insertsToo many parts → merge hellBatch 1000-5000 rows
Too many partitionsScans all partitionsMonthly/daily, not hourly
Wrong ORDER BYCan't use primary indexFiltered columns FIRST
FINAL everywhereForces merge per queryOnly when truly needed
JOINs on large tablesNot optimized for joinsDenormalize or use dictionaries

Performance Monitoring

-- Find slow queries (> 1s in last hour)
SELECT
    query_duration_ms / 1000 AS seconds,
    read_rows,
    formatReadableSize(memory_usage) AS memory,
    substring(query, 1, 100) AS query_preview
FROM system.query_log
WHERE type = 'QueryFinish'
  AND query_duration_ms > 1000
  AND event_time >= now() - INTERVAL 1 HOUR
ORDER BY query_duration_ms DESC
LIMIT 10;

-- Check table health
SELECT
    table,
    formatReadableSize(sum(bytes)) AS size,
    sum(rows) AS rows,
    count() AS parts  -- If > 100, investigate merges
FROM system.parts
WHERE active
GROUP BY table
ORDER BY sum(bytes) DESC;

Analytics Patterns

-- Daily Active Users
SELECT
    toDate(timestamp) AS date,
    uniq(user_id) AS dau
FROM events
WHERE timestamp >= today() - INTERVAL 30 DAY
GROUP BY date;

-- Retention Cohort
WITH cohorts AS (
    SELECT user_id, min(toDate(timestamp)) AS signup_date
    FROM events GROUP BY user_id
)
SELECT
    toStartOfMonth(signup_date) AS cohort,
    dateDiff('day', signup_date, toDate(e.timestamp)) AS days,
    uniq(e.user_id) AS active_users
FROM events e
JOIN cohorts c ON e.user_id = c.user_id
WHERE e.timestamp >= today() - INTERVAL 90 DAY
GROUP BY cohort, days;

-- Conversion Funnel
SELECT
    countIf(event_type = 'view') AS step1,
    countIf(event_type = 'cart') AS step2,
    countIf(event_type = 'purchase') AS step3,
    round(step2 / step1 * 100, 2) AS view_to_cart_rate
FROM events
WHERE date = today();

Best Practices

Design:

  • Choose engine via flowchart
  • Partition by time (monthly for most)
  • ORDER BY = query filter columns

Query:

  • Specify columns (never SELECT *)
  • Filter on ORDER BY columns
  • Use ClickHouse functions (uniq, quantile)

Insert:

  • Batch 1000-5000 rows minimum
  • Use materialized views for aggregations
  • Monitor parts count

Red Flags

Stop if you're saying:

  • ❌ "I'll add a B-tree index" → ClickHouse uses ORDER BY
  • ❌ "SELECT * then filter in app" → Extremely slow
  • ❌ "Small inserts for prototype" → Technical debt
  • ❌ "JOIN like PostgreSQL" → Denormalize instead
  • ❌ "FINAL everywhere" → Kills performance

When to Escalate

Symptoms:

  • Query > 10s on < 10M rows
  • "Too many parts" despite batching
  • Memory exhaustion (need distributed)
  • Need ACID transactions (wrong tool)

Where to look:

Related Skills

For specialized use cases:

  • Database replication: PostgreSQL/MySQL → ClickHouse sync → See
    clickhouse-cdc
  • Streaming ingestion: Kafka/RabbitMQ → ClickHouse → See
    clickhouse-streaming

Remember: Design for query patterns, batch writes, leverage materialized views. If constantly fighting the tool, reconsider if OLAP is what you need.