Data-engineering-skills optimizing-query-by-id

install

source · Clone the upstream repo

git clone https://github.com/AltimateAI/data-engineering-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/AltimateAI/data-engineering-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/snowflake/optimizing-query-by-id" ~/.claude/skills/altimateai-data-engineering-skills-optimizing-query-by-id && rm -rf "$T"

manifest: skills/snowflake/optimizing-query-by-id/SKILL.md

source content

Optimize Query from Query ID

Fetch query → Get profile → Apply best practices → Verify improvement → Return optimized query

Workflow

1. Fetch Query Details from Query ID

SELECT
    query_id,
    query_text,
    total_elapsed_time/1000 as seconds,
    bytes_scanned/1e9 as gb_scanned,
    bytes_spilled_to_local_storage/1e9 as gb_spilled_local,
    bytes_spilled_to_remote_storage/1e9 as gb_spilled_remote,
    partitions_scanned,
    partitions_total,
    rows_produced
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE query_id = '<query_id>';

Note the key metrics:

```
seconds
```
: Total execution time
```
gb_scanned
```
: Data read (lower is better)
```
gb_spilled
```
: Spillage indicates memory pressure
```
partitions_scanned/total
```
: Partition pruning effectiveness

2. Get Query Profile Details

-- Get operator-level statistics
SELECT *
FROM TABLE(GET_QUERY_OPERATOR_STATS('<query_id>'));

Look for:

Operators with high
```
output_rows
```
vs
```
input_rows
```
(explosions)
TableScan operators with high bytes
Sort/Aggregate operators with spillage

3. Identify Optimization Opportunities

Based on profile, look for:

Metric	Issue	Fix
partitions_scanned = partitions_total	No pruning	Add filter on cluster key
gb_spilled > 0	Memory pressure	Simplify query, increase warehouse
High bytes_scanned	Full scan	Add selective filters, reduce columns
Join explosion	Cartesian or bad key	Fix join condition, filter before join

4. Apply Optimizations

Rewrite the query:

Select only needed columns
Filter early (before joins)
Use CTEs to avoid repeated scans
Ensure filters align with clustering keys
Add LIMIT if full result not needed

5. Get Explain Plan for Optimized Query

EXPLAIN USING JSON
<optimized_query>;

6. Compare Plans

Compare original vs optimized:

Fewer partitions scanned?
Fewer intermediate rows?
Better join order?

7. Return Results

Provide:

Original query metrics (time, data scanned, spillage)
Identified issues
The optimized query
Summary of changes made
Expected improvement

Example Output

Original Query Metrics:

Execution time: 45 seconds
Data scanned: 12.3 GB
Partitions: 500/500 (no pruning)
Spillage: 2.1 GB

Issues Found:

No partition pruning - filtering on non-cluster column
SELECT * scanning unnecessary columns
Large table joined without pre-filtering

Optimized Query:

WITH filtered_events AS (
    SELECT event_id, user_id, event_type, created_at
    FROM events
    WHERE created_at >= '2024-01-01'
      AND created_at < '2024-02-01'
      AND event_type = 'purchase'
)
SELECT fe.event_id, fe.created_at, u.name
FROM filtered_events fe
JOIN users u ON fe.user_id = u.id;

Changes:

Added date range filter matching cluster key
Replaced SELECT * with specific columns
Pre-filtered in CTE before join

Expected Improvement:

Partitions: 500 → ~15 (97% reduction)
Data scanned: 12.3 GB → ~0.4 GB
Estimated time: 45s → ~3s