Datastoria diagnose-clickhouse-clusters

Diagnose ClickHouse cluster health and provide concrete remediation.

install
source · Clone the upstream repo
git clone https://github.com/FrankChen021/datastoria
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/FrankChen021/datastoria "$T" && mkdir -p ~/.claude/skills && cp -r "$T/resources/skills/diagnose-clickhouse-clusters" ~/.claude/skills/frankchen021-datastoria-diagnose-clickhouse-clusters && rm -rf "$T"
manifest: resources/skills/diagnose-clickhouse-clusters/SKILL.md
source content

Tool Usage Rules

  • Call
    collect_cluster_status
    before health conclusions about current cluster health.
  • For RCA questions, call
    collect_rca_evidence
    directly when the symptom and target are already clear. Use
    collect_cluster_status
    first only when you need current health context, severity/outliers, or help choosing the RCA symptom/scope.
  • Use only supported Phase 1 RCA symptoms:
    high_part_count
    and
    unknown
    .
  • For bounded-time questions, use
    status_analysis_mode="windowed"
    and reuse the same time window in follow-up calls.
  • If user asks for a chart, use the
    visualization
    skill. Do not emit chart specs directly from this skill.
  • Do not invent custom health-check SQL. Use tool outputs as the source of truth.

Workflow (MANDATORY)

  1. Determine whether the user asks for status only, or root cause ("why", "root cause", "reason", "caused by", "explain").
  2. For RCA questions, pick one supported canonical symptom key based on user wording, explicit target details, and, when needed, status findings.
  3. Explain from tool output only: top candidates, support score, evidence lists, gaps, and prioritized actions.

Severity Thresholds (Guidance)

  • CRITICAL: replication lag > 300s, disk usage > 90%
  • WARNING: replication lag > 60s, disk usage > 80%
  • OK: metrics within normal ranges

Do not hardcode parts thresholds in responses. Use the thresholds and severities returned by

collect_cluster_status
.

Output Format (MANDATORY)

Use one of these two formats:

A) Status-only question

  1. Summary table: Always print a table title line exactly before the table:

    ### Summary
    .

    StatusNodes with IssuesChecks RunTimestamp
    🟢 OK / 🟠 WARNING / 🔴 CRITICALNcategoriesISO8601
  2. Findings by category: Always print a table title line exactly before the table:

    ### Findings by Category
    . Use a markdown table (not bullets) with one row per category. Required columns:

    CategoryStatusKey MetricsTop Outlier / ScopeNotes
    parts / errors / replication / ...🟢 OK / 🟠 WARNING / 🔴 CRITICALconcise metric values with thresholdsnode/table if present, else
    -
    one short phrase

    Table rules:

    • Include all categories returned by
      collect_cluster_status
      in stable order.
    • Status must include both emoji and text (for example
      🟠 WARNING
      ), never emoji-only.
    • Markdown table cells do not reliably support line breaks in this UI. Do not try to render multi-line bullets in a cell.
    • In
      Key Metrics
      , put the 1-2 most important metrics only (single-line, semicolon-separated if needed).
    • Put additional metrics in
      Notes
      as compact key/value items (single-line).
    • Put numeric values first (for example
      max_parts_per_table=533 (>500)
      ), avoid prose-heavy sentences.
    • Always wrap database/table identifiers in backticks (for example
      `db.table`
      or
      `db`
      ) in all table cells.
    • If category has sub-findings (for example top errors), keep them in
      Notes
      as compact comma-separated items.
    • If no outlier exists, set
      Top Outlier / Scope
      to
      -
      .
  3. Recommendations (max 3 items; each item = title + why + concrete SQL/command if needed).

B) RCA question ("why", "cause", "reason", "explain")

Use compact structure only:

  1. RCA Verdict: one sentence, max 30 words.
  2. Top Candidates: markdown table with max 3 rows:
    cause | support_score | evidence
    . In
    evidence
    , render up to 3
    evidence_for
    items prefixed with
    and up to 2
    evidence_against
    items prefixed with
    , separated by
    <br/>
    . When
    excluded_candidates
    is non-empty, include at least one excluded reason as a
    item for the most relevant row. Evidence fidelity rules:
    • Use only
      candidate.evidence_for
      and
      candidate.evidence_against
      from
      collect_rca_evidence
      for that row.
    • Do not pull extra lines from top-level
      observations
      , other candidates, or status output into the evidence cell.
    • Do not restate raw metrics unless they already appear inside
      candidate.evidence_for
      or
      candidate.evidence_against
      .
    • Preserve the candidate/tool counts: if helpful, you may mention
      indicators_matched/indicators_checked
      , but never imply more matched checks than the tool returned.
  3. Possible Actions: max 3 numbered items, sorted by impact. Formatting rule: print the line
    3. **Possible Actions**
    , then a blank line, then an indented nested numbered list using exactly
       1.
    ,
       2.
    ,
       3.
    . Do not continue the outer top-level numbering for action items.
  4. Gaps / Next Checks: max 2 bullets. Formatting rule: print the line
    4. **Gaps / Next Checks**
    , then a blank line, then indented bullets using exactly
       -
    .

RCA brevity limits:

  • Keep total RCA response under 220 words (excluding SQL command blocks).
  • Do not add long background/theory paragraphs.
  • Use direct statements and numeric evidence.

Critical Rules

  • ALWAYS call
    collect_cluster_status
    before giving any opinion on current health.
  • Use
    status_analysis_mode="windowed"
    when user asks for a bounded time window or historical context.
  • For RCA questions, MUST call
    collect_rca_evidence
    .
    collect_cluster_status
    is optional unless current health context is needed.
  • Do NOT state root causes without RCA evidence output.
  • If
    gaps[]
    is non-empty, explicitly state what evidence is missing.
  • If all candidates have
    support_score < 0.3
    , state that the RCA is inconclusive and use candidate
    next_checks
    plus
    gaps
    to explain what to inspect next.
  • If best candidate is weak (
    0.30-0.39
    ), present it as a possibility with caveats and emphasize candidate
    next_checks
    .
  • Never fabricate or merge evidence lines across candidates. Candidate rows must be traceable directly to that candidate's
    evidence_for
    and
    evidence_against
    .
  • If
    collect_rca_evidence.related_symptoms
    is non-empty, include a line
    Related symptoms:
    and list them.
  • When follow-up questions omit time range, reuse the most recent explicit time window/range from prior turns.
  • Never assume schema or table names; use only what tools return.
  • Do not invent custom health-check SQL; use tool outputs as source of truth.
  • Be concise and focus on remediation, not theory.