Claude-code-plugins-plus-skills clickhouse-incident-runbook
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/clickhouse-pack/skills/clickhouse-incident-runbook" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-clickhouse-incident-runbook && rm -rf "$T"
manifest:
plugins/saas-packs/clickhouse-pack/skills/clickhouse-incident-runbook/SKILL.mdsource content
ClickHouse Incident Runbook
Overview
Step-by-step procedures for triaging and resolving ClickHouse incidents using built-in system tables and SQL commands.
Severity Levels
| Level | Definition | Response | Examples |
|---|---|---|---|
| P1 | ClickHouse unreachable / all queries failing | < 15 min | Server down, OOM, disk full |
| P2 | Degraded performance / partial failures | < 1 hour | Slow queries, merge backlog |
| P3 | Minor impact / non-critical errors | < 4 hours | Single table issue, warnings |
| P4 | No user impact | Next business day | Monitoring gaps, optimization |
Quick Triage (Run First)
# 1. Is ClickHouse alive? curl -sf 'http://localhost:8123/ping' && echo "UP" || echo "DOWN" # 2. Can it answer a query? curl -sf 'http://localhost:8123/?query=SELECT+1' && echo "OK" || echo "QUERY FAILED" # 3. Check ClickHouse Cloud status curl -sf 'https://status.clickhouse.cloud' | head -5
-- 4. Server health snapshot (run if server responds) SELECT version() AS version, formatReadableTimeDelta(uptime()) AS uptime, (SELECT count() FROM system.processes) AS running_queries, (SELECT value FROM system.metrics WHERE metric = 'MemoryTracking') AS memory_bytes, (SELECT count() FROM system.merges) AS active_merges; -- 5. Recent errors SELECT event_time, exception_code, exception, substring(query, 1, 200) AS q FROM system.query_log WHERE type = 'ExceptionWhileProcessing' AND event_time >= now() - INTERVAL 10 MINUTE ORDER BY event_time DESC LIMIT 10;
Decision Tree
Server responds to ping? ├─ NO → Check process/container status, disk space, OOM killer logs │ └─ Container/process dead → Restart, check logs │ └─ Disk full → Emergency: drop old partitions, expand disk │ └─ OOM killed → Reduce max_memory_usage, add RAM └─ YES → Queries succeeding? ├─ NO → Check error codes below │ └─ Auth errors (516) → Verify credentials, check user exists │ └─ Too many queries (202) → Kill stuck queries, reduce concurrency │ └─ Memory exceeded (241) → Kill large queries, reduce max_threads └─ YES but slow → Performance triage below
Remediation Procedures
P1: Server Down / OOM
# Check if process was OOM-killed dmesg | grep -i "out of memory" | tail -5 journalctl -u clickhouse-server --since "10 minutes ago" | tail -20 # Restart sudo systemctl restart clickhouse-server # or for Docker: docker restart clickhouse # Verify recovery curl 'http://localhost:8123/?query=SELECT+version()'
P1: Disk Full
-- Find largest tables SELECT database, table, formatReadableSize(sum(bytes_on_disk)) AS size, sum(rows) AS rows FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC LIMIT 10; -- Emergency: drop old partitions ALTER TABLE analytics.events DROP PARTITION '202301'; ALTER TABLE analytics.events DROP PARTITION '202302'; -- Check free space SELECT name, formatReadableSize(free_space) AS free, formatReadableSize(total_space) AS total FROM system.disks;
P2: Stuck / Long-Running Queries
-- Find stuck queries SELECT query_id, user, elapsed, formatReadableSize(memory_usage) AS memory, substring(query, 1, 200) AS query_preview FROM system.processes ORDER BY elapsed DESC; -- Kill a specific query KILL QUERY WHERE query_id = 'abc-123-def'; -- Kill all queries from a user KILL QUERY WHERE user = 'runaway_user'; -- Kill all queries running longer than 5 minutes KILL QUERY WHERE elapsed > 300;
P2: Too Many Parts (Merge Backlog)
-- Check part counts SELECT database, table, count() AS parts FROM system.parts WHERE active GROUP BY database, table HAVING parts > 200 ORDER BY parts DESC; -- Check active merges SELECT database, table, progress, elapsed, formatReadableSize(total_size_bytes_compressed) AS size FROM system.merges; -- Temporary: raise the limit to prevent INSERT failures ALTER TABLE analytics.events MODIFY SETTING parts_to_throw_insert = 1000; -- Wait for merges to catch up, then lower back -- Root cause: too many small inserts — batch them
P2: Memory Pressure
-- Who's using the most memory? SELECT user, query_id, elapsed, formatReadableSize(memory_usage) AS memory, substring(query, 1, 200) AS q FROM system.processes ORDER BY memory_usage DESC; -- Kill the largest query KILL QUERY WHERE query_id = '<largest_query_id>'; -- Reduce per-query memory for all users ALTER USER app_writer SETTINGS max_memory_usage = 5000000000; -- 5GB
P3: Replication Lag (Clustered/Cloud)
-- Check replica status SELECT database, table, is_leader, total_replicas, active_replicas, queue_size, inserts_in_queue, merges_in_queue, log_pointer, last_queue_update FROM system.replicas WHERE active_replicas < total_replicas OR queue_size > 0;
Post-Incident Evidence Collection
-- Export error window from query log SELECT * FROM system.query_log WHERE event_time BETWEEN '2025-01-15 14:00:00' AND '2025-01-15 15:00:00' AND (type = 'ExceptionWhileProcessing' OR query_duration_ms > 10000) FORMAT JSONEachRow INTO OUTFILE '/tmp/incident-queries.json'; -- Metrics snapshot during incident window SELECT metric, value FROM system.metrics FORMAT TabSeparatedWithNames INTO OUTFILE '/tmp/incident-metrics.tsv';
Communication Templates
Internal (Slack):
[P1] INCIDENT: ClickHouse [Issue Type] Status: INVESTIGATING / MITIGATING / RESOLVED Impact: [What users see] Root cause: [If known] Actions taken: [What you did] Next update: [Time] Commander: @[name]
Postmortem Template:
## ClickHouse Incident: [Title] - Date: YYYY-MM-DD - Duration: X hours Y minutes - Severity: P[1-4] ### Timeline - HH:MM — [Event/action] ### Root Cause [Technical explanation] ### Resolution [What fixed it] ### Action Items - [ ] [Preventive measure] — Owner — Due date
Error Handling
| Symptom | Likely Cause | First Action |
|---|---|---|
| All queries fail | Server down | Check process, restart |
| Inserts fail | Too many parts | long merges, raise limit |
| Selects slow | Memory pressure | Kill large queries, add filters |
| Disk alerts | No TTL / no cleanup | Drop old partitions |
| Replication lag | Network / merge backlog | Check |
Resources
Next Steps
For data compliance, see
clickhouse-data-handling.