Data-engineering-skills optimizing-query-by-id
install
source · Clone the upstream repo
git clone https://github.com/AltimateAI/data-engineering-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/AltimateAI/data-engineering-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/snowflake/optimizing-query-by-id" ~/.claude/skills/altimateai-data-engineering-skills-optimizing-query-by-id && rm -rf "$T"
manifest:
skills/snowflake/optimizing-query-by-id/SKILL.mdsource content
Optimize Query from Query ID
Fetch query → Get profile → Apply best practices → Verify improvement → Return optimized query
Workflow
1. Fetch Query Details from Query ID
SELECT query_id, query_text, total_elapsed_time/1000 as seconds, bytes_scanned/1e9 as gb_scanned, bytes_spilled_to_local_storage/1e9 as gb_spilled_local, bytes_spilled_to_remote_storage/1e9 as gb_spilled_remote, partitions_scanned, partitions_total, rows_produced FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY()) WHERE query_id = '<query_id>';
Note the key metrics:
: Total execution timeseconds
: Data read (lower is better)gb_scanned
: Spillage indicates memory pressuregb_spilled
: Partition pruning effectivenesspartitions_scanned/total
2. Get Query Profile Details
-- Get operator-level statistics SELECT * FROM TABLE(GET_QUERY_OPERATOR_STATS('<query_id>'));
Look for:
- Operators with high
vsoutput_rows
(explosions)input_rows - TableScan operators with high bytes
- Sort/Aggregate operators with spillage
3. Identify Optimization Opportunities
Based on profile, look for:
| Metric | Issue | Fix |
|---|---|---|
| partitions_scanned = partitions_total | No pruning | Add filter on cluster key |
| gb_spilled > 0 | Memory pressure | Simplify query, increase warehouse |
| High bytes_scanned | Full scan | Add selective filters, reduce columns |
| Join explosion | Cartesian or bad key | Fix join condition, filter before join |
4. Apply Optimizations
Rewrite the query:
- Select only needed columns
- Filter early (before joins)
- Use CTEs to avoid repeated scans
- Ensure filters align with clustering keys
- Add LIMIT if full result not needed
5. Get Explain Plan for Optimized Query
EXPLAIN USING JSON <optimized_query>;
6. Compare Plans
Compare original vs optimized:
- Fewer partitions scanned?
- Fewer intermediate rows?
- Better join order?
7. Return Results
Provide:
- Original query metrics (time, data scanned, spillage)
- Identified issues
- The optimized query
- Summary of changes made
- Expected improvement
Example Output
Original Query Metrics:
- Execution time: 45 seconds
- Data scanned: 12.3 GB
- Partitions: 500/500 (no pruning)
- Spillage: 2.1 GB
Issues Found:
- No partition pruning - filtering on non-cluster column
- SELECT * scanning unnecessary columns
- Large table joined without pre-filtering
Optimized Query:
WITH filtered_events AS ( SELECT event_id, user_id, event_type, created_at FROM events WHERE created_at >= '2024-01-01' AND created_at < '2024-02-01' AND event_type = 'purchase' ) SELECT fe.event_id, fe.created_at, u.name FROM filtered_events fe JOIN users u ON fe.user_id = u.id;
Changes:
- Added date range filter matching cluster key
- Replaced SELECT * with specific columns
- Pre-filtered in CTE before join
Expected Improvement:
- Partitions: 500 → ~15 (97% reduction)
- Data scanned: 12.3 GB → ~0.4 GB
- Estimated time: 45s → ~3s