Claude-code-plugins-plus-skills clickhouse-prod-checklist
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/clickhouse-pack/skills/clickhouse-prod-checklist" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-clickhouse-prod-checklist && rm -rf "$T"
manifest:
plugins/saas-packs/clickhouse-pack/skills/clickhouse-prod-checklist/SKILL.mdsource content
ClickHouse Production Checklist
Overview
Comprehensive go-live checklist for ClickHouse covering server tuning, schema design, backup configuration, monitoring, and operational readiness.
Prerequisites
- ClickHouse instance provisioned (Cloud or self-hosted)
- Application integration code tested in staging
Checklist
1. Schema & Engine Design
- Tables use
family engines (notMergeTree
,Memory
, orLog
)TinyLog -
columns match primary filter/group patternsORDER BY -
is coarse (monthly or weekly, never by ID)PARTITION BY -
configured for data retention policyTTL -
used for low-cardinality columnsLowCardinality(String) -
applied to large String/JSON columnsCODEC(ZSTD) - ReplacingMergeTree used with
or dedup logic if upserts neededFINAL
2. Server Configuration (Self-Hosted)
<!-- Key production settings in config.xml / users.xml --> <!-- Memory: set to ~80% of available RAM --> <max_server_memory_usage_to_ram_ratio>0.8</max_server_memory_usage_to_ram_ratio> <!-- Query limits --> <max_concurrent_queries>150</max_concurrent_queries> <max_memory_usage>10000000000</max_memory_usage> <!-- 10GB per query --> <max_execution_time>300</max_execution_time> <!-- 5 min timeout --> <!-- Merge settings --> <background_pool_size>16</background_pool_size> <background_schedule_pool_size>16</background_schedule_pool_size> <!-- Logging --> <query_log> <database>system</database> <table>query_log</table> <flush_interval_milliseconds>7500</flush_interval_milliseconds> </query_log>
3. Backup Configuration
-- ClickHouse native BACKUP to S3 BACKUP TABLE analytics.events TO S3( 'https://my-bucket.s3.us-east-1.amazonaws.com/backups/events', 'ACCESS_KEY', 'SECRET_KEY' ) SETTINGS compression_method = 'zstd'; -- Incremental backup (base + delta) BACKUP TABLE analytics.events TO S3('s3://my-bucket/backups/events-incremental') SETTINGS base_backup = S3('s3://my-bucket/backups/events-base');
ClickHouse Cloud: Backups are automatic. Configure retention and frequency in the Cloud console under Service Settings.
- Backup schedule configured (daily minimum)
- Backup restore tested and documented
- Point-in-time recovery possible (incremental backups)
- Backup stored in different region/account from primary
4. Monitoring & Alerting
-- Key metrics to monitor -- 1. Query latency (p50/p95/p99) -- 2. Active parts per table (alert > 300) -- 3. Merge queue depth -- 4. Memory usage -- 5. Disk space free -- 6. Replication lag (if clustered) -- Health check query (use for load balancer probes) SELECT 1; -- More thorough health check SELECT (SELECT count() FROM system.processes) AS running_queries, (SELECT sum(value) FROM system.metrics WHERE metric = 'MemoryTracking') AS memory_bytes, (SELECT count() FROM system.merges) AS active_merges;
- Prometheus endpoint configured (
via ClickHouse Exporter or Cloud endpoint)/metrics - Grafana dashboard with key panels (QPS, latency, memory, parts, merges)
- Alerts on: error rate > 5%, p95 latency > 5s, parts > 300, disk < 20%
- Query log monitoring for slow/failed queries
5. Security
- Default user password changed or user disabled
- Application user has minimal privileges (see
)clickhouse-security-basics - TLS enabled (HTTPS on port 8443)
- IP allowlist configured
- Secrets in environment variables or secret manager (not code)
6. Application Integration
- Connection pooling configured (
)max_open_connections - Graceful shutdown calls
client.close() - Insert batching in place (10K+ rows per INSERT)
- Retry logic for transient errors (see
)clickhouse-rate-limits - Health check endpoint includes ClickHouse ping
// Health check endpoint app.get('/health', async (req, res) => { try { const { success } = await client.ping(); res.json({ status: success ? 'healthy' : 'degraded', clickhouse: success }); } catch { res.status(503).json({ status: 'unhealthy', clickhouse: false }); } });
7. Operational Readiness
- Incident runbook documented (see
)clickhouse-incident-runbook - On-call escalation path defined
- Key rotation procedure documented
- Schema migration process in place (see
)clickhouse-migration-deep-dive - Load testing completed at expected peak traffic
8. Verification Queries
-- Verify table sizes and part health SELECT database, table, count() AS parts, sum(rows) AS rows, formatReadableSize(sum(bytes_on_disk)) AS size FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC; -- Verify no pending mutations SELECT * FROM system.mutations WHERE NOT is_done; -- Verify no replication lag SELECT database, table, queue_size, inserts_in_queue FROM system.replicas WHERE queue_size > 0;
Error Handling
| Issue | Detection | Action |
|---|---|---|
| Parts > 300 | Monitoring alert | Review insert patterns, wait for merges |
| Disk > 80% | Disk alert | Add storage, drop old partitions |
| Query p95 > 5s | Latency alert | Check for slow queries |
| Replication lag | Replica check | Investigate network, merge backlog |
Resources
Next Steps
For SDK version upgrades, see
clickhouse-upgrade-migration.