Datastoria diagnose-clickhouse-clusters
Diagnose ClickHouse cluster health and provide concrete remediation.
install
source · Clone the upstream repo
git clone https://github.com/FrankChen021/datastoria
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/FrankChen021/datastoria "$T" && mkdir -p ~/.claude/skills && cp -r "$T/resources/skills/diagnose-clickhouse-clusters" ~/.claude/skills/frankchen021-datastoria-diagnose-clickhouse-clusters && rm -rf "$T"
manifest:
resources/skills/diagnose-clickhouse-clusters/SKILL.mdsource content
Tool Usage Rules
- Call
before health conclusions about current cluster health.collect_cluster_status - For RCA questions, call
directly when the symptom and target are already clear. Usecollect_rca_evidence
first only when you need current health context, severity/outliers, or help choosing the RCA symptom/scope.collect_cluster_status - Use only supported Phase 1 RCA symptoms:
andhigh_part_count
.unknown - For bounded-time questions, use
and reuse the same time window in follow-up calls.status_analysis_mode="windowed" - If user asks for a chart, use the
skill. Do not emit chart specs directly from this skill.visualization - Do not invent custom health-check SQL. Use tool outputs as the source of truth.
Workflow (MANDATORY)
- Determine whether the user asks for status only, or root cause ("why", "root cause", "reason", "caused by", "explain").
- For RCA questions, pick one supported canonical symptom key based on user wording, explicit target details, and, when needed, status findings.
- Explain from tool output only: top candidates, support score, evidence lists, gaps, and prioritized actions.
Severity Thresholds (Guidance)
- CRITICAL: replication lag > 300s, disk usage > 90%
- WARNING: replication lag > 60s, disk usage > 80%
- OK: metrics within normal ranges
Do not hardcode parts thresholds in responses. Use the thresholds and severities returned by
collect_cluster_status.
Output Format (MANDATORY)
Use one of these two formats:
A) Status-only question
-
Summary table: Always print a table title line exactly before the table:
.### SummaryStatus Nodes with Issues Checks Run Timestamp 🟢 OK / 🟠 WARNING / 🔴 CRITICAL N categories ISO8601 -
Findings by category: Always print a table title line exactly before the table:
. Use a markdown table (not bullets) with one row per category. Required columns:### Findings by CategoryCategory Status Key Metrics Top Outlier / Scope Notes parts / errors / replication / ... 🟢 OK / 🟠 WARNING / 🔴 CRITICAL concise metric values with thresholds node/table if present, else -one short phrase Table rules:
- Include all categories returned by
in stable order.collect_cluster_status - Status must include both emoji and text (for example
), never emoji-only.🟠 WARNING - Markdown table cells do not reliably support line breaks in this UI. Do not try to render multi-line bullets in a cell.
- In
, put the 1-2 most important metrics only (single-line, semicolon-separated if needed).Key Metrics - Put additional metrics in
as compact key/value items (single-line).Notes - Put numeric values first (for example
), avoid prose-heavy sentences.max_parts_per_table=533 (>500) - Always wrap database/table identifiers in backticks (for example
or`db.table`
) in all table cells.`db` - If category has sub-findings (for example top errors), keep them in
as compact comma-separated items.Notes - If no outlier exists, set
toTop Outlier / Scope
.-
- Include all categories returned by
-
Recommendations (max 3 items; each item = title + why + concrete SQL/command if needed).
B) RCA question ("why", "cause", "reason", "explain")
Use compact structure only:
- RCA Verdict: one sentence, max 30 words.
- Top Candidates: markdown table with max 3 rows:
. Incause | support_score | evidence
, render up to 3evidence
items prefixed withevidence_for
and up to 2✓
items prefixed withevidence_against
, separated by✗
. When<br/>
is non-empty, include at least one excluded reason as aexcluded_candidates
item for the most relevant row. Evidence fidelity rules:✗- Use only
andcandidate.evidence_for
fromcandidate.evidence_against
for that row.collect_rca_evidence - Do not pull extra lines from top-level
, other candidates, or status output into the evidence cell.observations - Do not restate raw metrics unless they already appear inside
orcandidate.evidence_for
.candidate.evidence_against - Preserve the candidate/tool counts: if helpful, you may mention
, but never imply more matched checks than the tool returned.indicators_matched/indicators_checked
- Use only
- Possible Actions: max 3 numbered items, sorted by impact.
Formatting rule: print the line
, then a blank line, then an indented nested numbered list using exactly3. **Possible Actions**
,1.
,2.
. Do not continue the outer top-level numbering for action items.3. - Gaps / Next Checks: max 2 bullets.
Formatting rule: print the line
, then a blank line, then indented bullets using exactly4. **Gaps / Next Checks**
.-
RCA brevity limits:
- Keep total RCA response under 220 words (excluding SQL command blocks).
- Do not add long background/theory paragraphs.
- Use direct statements and numeric evidence.
Critical Rules
- ALWAYS call
before giving any opinion on current health.collect_cluster_status - Use
when user asks for a bounded time window or historical context.status_analysis_mode="windowed" - For RCA questions, MUST call
.collect_rca_evidence
is optional unless current health context is needed.collect_cluster_status - Do NOT state root causes without RCA evidence output.
- If
is non-empty, explicitly state what evidence is missing.gaps[] - If all candidates have
, state that the RCA is inconclusive and use candidatesupport_score < 0.3
plusnext_checks
to explain what to inspect next.gaps - If best candidate is weak (
), present it as a possibility with caveats and emphasize candidate0.30-0.39
.next_checks - Never fabricate or merge evidence lines across candidates. Candidate rows must be traceable directly to that candidate's
andevidence_for
.evidence_against - If
is non-empty, include a linecollect_rca_evidence.related_symptoms
and list them.Related symptoms: - When follow-up questions omit time range, reuse the most recent explicit time window/range from prior turns.
- Never assume schema or table names; use only what tools return.
- Do not invent custom health-check SQL; use tool outputs as source of truth.
- Be concise and focus on remediation, not theory.