Datastoria sql-expert
Expert system for generating, validating, and optimizing ClickHouse SQL. Use this when the user needs data, queries, or analysis.
install
source · Clone the upstream repo
git clone https://github.com/FrankChen021/datastoria
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/FrankChen021/datastoria "$T" && mkdir -p ~/.claude/skills && cp -r "$T/resources/skills/sql-expert" ~/.claude/skills/frankchen021-datastoria-sql-expert && rm -rf "$T"
manifest:
resources/skills/sql-expert/SKILL.mdsource content
🚨 CRITICAL RULE: MANDATORY VALIDATION
You MUST call
for every new query you generate. Context Note: Historical validation steps are pruned to save tokens, but this does NOT excuse you from validating new queries in the current turn. Always validate before executing.validate_sql(sql)
1. Schema Discovery & Context
- Missing Schema: If you do not have the table schema, you MUST use
andget_tables
first.explore_schema- Optimization: If the user already mentioned exact field names, pass them in the
argument ofcolumns
instead of loading the full table schema.explore_schema
- Optimization: If the user already mentioned exact field names, pass them in the
- Exact Identifier Rule: Treat exact identifier-like tokens from the user question as candidate columns on the first schema lookup. This especially applies to ClickHouse metric names such as
,ProfileEvent_*
, and flattened event columns onCurrentMetric_*
tables.system.* - Missing Columns: If you don't see the expected column, retry
with a narrowerexplore_schema
list based on the user-mentioned identifier or the closest confirmed column names.columns - Schema Fidelity: Only use columns that are confirmed to exist in the table schema from
. Do not assume standard columns exist if they are not in the tool output.explore_schema - User Context: If the user asks about "my data", use
.WHERE user = '<clickHouseUser>' - System Tables: For queries on
tables (e.g.,system.*
,system.query_log
,system.parts
), defer to thesystem.merges
skill - it contains table-specific patterns, predicates, and resource metrics that this skill does not cover. Forclickhouse-system-queries
, do not generate SQL untilsystem.query_log
has been loaded viareferences/system-query-log.md
, and do not callskill_resource
for chart/time-series requests.search_query_log
2. Syntax Rules (The Grammar)
- Tables: ALWAYS use fully qualified names (e.g.,
).database.table - Semicolons: NEVER include a trailing semicolon (
).; - Enums: Use exact string literals for Enum columns.
- Safety: ALWAYS use
for data exploration queries.LIMIT
3. Optimization Rules (Best Practices)
- Time filters: Always filter by the partition key (usually
orevent_date
) first. Use bounded time windows (e.g., last 24h, 7 days) unless the user asks for all history.timestamp - Primary Keys (CRITICAL): ClickHouse indexes are sparse. You MUST filter on the leading column of the Primary Key if you filter on any secondary column.
- Bad:
(If PK isWHERE event_time > now() - 1h
, this scans everything).event_date, event_time - Good:
(Uses index, handles midnight crossover).WHERE event_date >= toDate(now() - 1h) AND event_time > now() - 1h
- Bad:
- Approximation: Use
instead ofuniq()
unless precision is explicitly requested.uniqExact() - Joins: Put the smaller table on the RIGHT. Use
only for distributed queries.GLOBAL IN
4. Execution Workflow
- Generate: Create the SQL following the rules above.
- Validate (MANDATORY): Call
.validate_sql(sql)- If invalid: Read the error, fix the SQL, and retry (max 3 attempts).
- Decide Action:
- Visualization: IF the user wants a chart, DO NOT execute. Pass the SQL to the visualization skill logic.
- Data: IF the user wants answers (lists, counts), call
.execute_sql(sql) - Code Only: IF the user asks to "write SQL", just output the code block.