Claude-skill-registry bq-query-optimization
Use when writing BigQuery queries, optimizing query performance, analyzing execution plans, or avoiding common SQL gotchas. Covers parameterized queries, UDFs, scripting, window functions (QUALIFY, ROW_NUMBER, RANK, LEAD/LAG), JSON functions, ARRAY/STRUCT operations, BigQuery-specific features (EXCEPT, REPLACE, SAFE_*), CTE re-execution issues, NOT IN with NULLs, DML performance, Standard vs Legacy SQL, and performance best practices.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/bq-query-optimization" ~/.claude/skills/majiayu000-claude-skill-registry-bq-query-optimization && rm -rf "$T"
skills/data/bq-query-optimization/SKILL.mdBigQuery Query Optimization
Use this skill when writing, debugging, or optimizing BigQuery SQL queries for performance and efficiency.
Query Execution Analysis
Using EXPLAIN
-- Get execution plan EXPLAIN SELECT * FROM `project.dataset.table` WHERE condition; -- Get execution plan with runtime stats EXPLAIN ANALYZE SELECT * FROM `project.dataset.table` WHERE condition;
What the plan shows:
- Stages of execution
- Bytes read per stage
- Slot time consumed
- Potential bottlenecks
Performance Best Practices
1. Avoid SELECT *
❌ Bad (scans all columns):
SELECT * FROM `project.dataset.large_table`
✅ Good (only needed columns):
SELECT customer_id, amount, date FROM `project.dataset.large_table`
Impact: Full table scan vs targeted column read. Can reduce data scanned by 90%+.
2. Filter Early and Often
❌ Bad (filter after aggregation):
SELECT customer_id, SUM(amount) as total FROM `project.dataset.orders` GROUP BY customer_id HAVING SUM(amount) > 1000
✅ Good (filter before aggregation):
SELECT customer_id, SUM(amount) as total FROM `project.dataset.orders` WHERE amount > 100 -- Filter early GROUP BY customer_id HAVING SUM(amount) > 1000
3. Use Partitioned Tables
Without partition filter:
-- Scans entire table SELECT * FROM `project.dataset.orders` WHERE order_date >= '2024-01-01'
With partition filter:
-- Only scans relevant partitions SELECT * FROM `project.dataset.orders` WHERE DATE(order_timestamp) >= '2024-01-01' -- Partition column
Key: Filter on the partition column for automatic partition pruning.
4. Break Complex Queries
❌ Anti-pattern (one huge query):
SELECT ... FROM ( SELECT ... FROM ( SELECT ... -- Deeply nested ) ) WHERE ...
✅ Good (use CTEs):
WITH base_data AS ( SELECT customer_id, amount, date FROM `project.dataset.orders` WHERE date >= '2024-01-01' ), aggregated AS ( SELECT customer_id, SUM(amount) as total FROM base_data GROUP BY customer_id ) SELECT * FROM aggregated WHERE total > 1000
✅ Better (multi-statement with temp tables):
CREATE TEMP TABLE base_data AS SELECT customer_id, amount, date FROM `project.dataset.orders` WHERE date >= '2024-01-01'; CREATE TEMP TABLE aggregated AS SELECT customer_id, SUM(amount) as total FROM base_data GROUP BY customer_id; SELECT * FROM aggregated WHERE total > 1000;
5. JOIN Optimization
Put largest table first:
-- ✅ Large table first SELECT l.*, s.detail FROM `project.dataset.large_table` l JOIN `project.dataset.small_table` s ON l.id = s.id
Use clustering on JOIN columns:
- Cluster tables on frequently joined columns
- BigQuery can prune data blocks more effectively
Consider ARRAY/STRUCT for 1:many:
-- Instead of JOIN for 1:many relationships SELECT order_id, ARRAY_AGG(STRUCT(product_id, quantity, price)) as items FROM `project.dataset.order_items` GROUP BY order_id
6. Leverage Automatic Features
BigQuery automatically performs:
- Query rewrites - Optimizes query structure
- Partition pruning - With proper filters
- Dynamic filtering - Reduces data scanned
Ensure your queries enable these:
- Filter on partition columns
- Use simple, clear predicates
- Avoid functions on partition columns in WHERE clause
Parameterized Queries
CLI Syntax
bq query \ --use_legacy_sql=false \ --parameter=start_date:DATE:2024-01-01 \ --parameter=end_date:DATE:2024-12-31 \ --parameter=min_amount:FLOAT64:100.0 \ 'SELECT * FROM `project.dataset.orders` WHERE order_date BETWEEN @start_date AND @end_date AND amount >= @min_amount'
Python Syntax
from google.cloud import bigquery client = bigquery.Client() query = """ SELECT customer_id, SUM(amount) as total FROM `project.dataset.orders` WHERE order_date >= @start_date GROUP BY customer_id """ job_config = bigquery.QueryJobConfig( query_parameters=[ bigquery.ScalarQueryParameter("start_date", "DATE", "2024-01-01") ] ) results = client.query_and_wait(query, job_config=job_config)
Key points:
- Use
prefix for named parameters@ - Syntax:
orname:TYPE:value
(STRING default)name::value - Cannot use as column/table names
- Only works with standard SQL
User-Defined Functions (UDFs)
SQL UDF (Simple)
CREATE TEMP FUNCTION CleanEmail(email STRING) RETURNS STRING AS ( LOWER(TRIM(email)) ); SELECT CleanEmail(customer_email) as email FROM `project.dataset.customers`;
JavaScript UDF (Complex Logic)
CREATE TEMP FUNCTION ParseUserAgent(ua STRING) RETURNS STRUCT<browser STRING, version STRING> LANGUAGE js AS r""" var match = ua.match(/(Chrome|Firefox|Safari)\/(\d+)/); return { browser: match ? match[1] : 'Unknown', version: match ? match[2] : '0' }; """; SELECT ParseUserAgent(user_agent).browser as browser FROM `project.dataset.sessions`;
Limitations:
- INT64 unsupported in JavaScript (use FLOAT64 or STRING)
- JavaScript doesn't support 64-bit integers natively
Persistent UDFs
-- Create once, use many times CREATE FUNCTION `project.dataset.clean_email`(email STRING) RETURNS STRING AS (LOWER(TRIM(email))); -- Use anywhere SELECT `project.dataset.clean_email`(email) FROM ...
Scripting & Procedural Language
Variables
DECLARE total_orders INT64; SET total_orders = (SELECT COUNT(*) FROM `project.dataset.orders`); SELECT total_orders;
Loops
LOOP:
DECLARE x INT64 DEFAULT 0; LOOP SET x = x + 1; IF x >= 10 THEN LEAVE; END IF; END LOOP;
WHILE:
DECLARE x INT64 DEFAULT 0; WHILE x < 10 DO SET x = x + 1; END WHILE;
FOR with arrays:
DECLARE ids ARRAY<STRING>; SET ids = ['id1', 'id2', 'id3']; FOR item IN (SELECT * FROM UNNEST(ids) as id) DO -- Process each id SELECT id; END FOR;
Query Caching
Automatic caching (24 hours):
- Identical queries serve cached results (free)
- No additional cost
- Instant response
To bypass cache:
bq query --use_cache=false 'SELECT...'
Common Anti-Patterns
❌ Using LIMIT to reduce cost
-- LIMIT doesn't reduce data scanned or cost! SELECT * FROM `project.dataset.huge_table` LIMIT 10
Impact: Still scans entire table. Use WHERE filters instead.
❌ Functions on partition columns
-- Prevents partition pruning WHERE CAST(date_column AS STRING) = '2024-01-01'
✅ Better:
WHERE date_column = DATE('2024-01-01')
❌ Cross joins without filters
-- Cartesian product = huge result SELECT * FROM table1 CROSS JOIN table2
Impact: Can generate millions/billions of rows.
❌ Correlated subqueries
-- Runs subquery for each row SELECT * FROM orders o WHERE amount > (SELECT AVG(amount) FROM orders WHERE customer_id = o.customer_id)
✅ Better (use window functions):
SELECT * FROM ( SELECT *, AVG(amount) OVER (PARTITION BY customer_id) as avg_amount FROM orders ) WHERE amount > avg_amount
Common SQL Gotchas
CTE Re-execution (CRITICAL COST ISSUE)
Problem: When a CTE is referenced multiple times, BigQuery re-executes it each time, billing you multiple times.
❌ Bad (CTE runs 3 times - billed 3x):
WITH expensive_cte AS ( SELECT * FROM `project.dataset.huge_table` WHERE complex_conditions AND lots_of_joins ) SELECT COUNT(*) FROM expensive_cte UNION ALL SELECT SUM(amount) FROM expensive_cte UNION ALL SELECT MAX(date) FROM expensive_cte;
Impact: If the CTE scans 10 TB, you're billed for 30 TB (10 TB × 3).
✅ Good (use temp table - billed 1x):
CREATE TEMP TABLE expensive_data AS SELECT * FROM `project.dataset.huge_table` WHERE complex_conditions AND lots_of_joins; SELECT COUNT(*) FROM expensive_data UNION ALL SELECT SUM(amount) FROM expensive_data UNION ALL SELECT MAX(date) FROM expensive_data;
When CTEs are OK:
- Referenced only once
- Very small result set
- Part of larger query (BigQuery may optimize)
When to use temp tables:
- CTE referenced 2+ times
- Large data volumes
- Complex/expensive CTE query
NOT IN with NULL Values (SILENT FAILURE)
Problem:
NOT IN returns NOTHING (empty result) if ANY NULL exists in the subquery.
❌ Broken (returns empty if blocked_customers has any NULL):
SELECT * FROM customers WHERE customer_id NOT IN ( SELECT customer_id FROM blocked_customers -- If ANY NULL, returns 0 rows! );
Why it fails:
- SQL three-valued logic: TRUE, FALSE, UNKNOWN
evaluates to UNKNOWNNULL IN (...)
is still UNKNOWNNOT UNKNOWN- Rows with UNKNOWN are filtered out
✅ Solution 1: Use NOT EXISTS (safest):
SELECT * FROM customers c WHERE NOT EXISTS ( SELECT 1 FROM blocked_customers b WHERE b.customer_id = c.customer_id );
✅ Solution 2: Filter NULLs explicitly:
SELECT * FROM customers WHERE customer_id NOT IN ( SELECT customer_id FROM blocked_customers WHERE customer_id IS NOT NULL -- Explicit NULL filter );
✅ Solution 3: Use LEFT JOIN:
SELECT c.* FROM customers c LEFT JOIN blocked_customers b ON c.customer_id = b.customer_id WHERE b.customer_id IS NULL;
Best practice: Prefer
NOT EXISTS - it's clearer, safer, and often faster.
DML Statement Performance
Problem: BigQuery is optimized for analytics (OLAP), not transactional updates (OLTP). DML statements are slow and expensive.
Why DML is slow in BigQuery:
- Columnar storage (not row-based)
- Designed for bulk reads, not individual updates
- No indexes for fast row lookups
- Every update rewrites affected partitions
❌ Very slow (row-by-row updates):
-- Don't do this - takes minutes/hours UPDATE `project.dataset.orders` SET status = 'processed' WHERE order_id = '12345'; -- This is even worse - runs once per row FOR record IN (SELECT order_id FROM orders_to_update) DO UPDATE orders SET status = 'processed' WHERE order_id = record.order_id; END FOR;
⚠️ Better (batch updates):
UPDATE `project.dataset.orders` SET status = 'processed' WHERE order_id IN (SELECT order_id FROM orders_to_update);
✅ Best (recreate table - fastest):
CREATE OR REPLACE TABLE `project.dataset.orders` AS SELECT * EXCEPT(status), CASE WHEN order_id IN (SELECT order_id FROM orders_to_update) THEN 'processed' ELSE status END AS status FROM `project.dataset.orders`;
For INSERT/UPSERT - use MERGE:
MERGE `project.dataset.customers` AS target USING `project.dataset.customer_updates` AS source ON target.customer_id = source.customer_id WHEN MATCHED THEN UPDATE SET name = source.name, updated_at = CURRENT_TIMESTAMP() WHEN NOT MATCHED THEN INSERT (customer_id, name, created_at) VALUES (customer_id, name, CURRENT_TIMESTAMP());
Best practices:
- Batch updates instead of row-by-row
- Use MERGE for upserts
- Consider recreating table for large updates
- Partition tables to limit update scope
- Avoid frequent small DML operations
Data Type Gotchas
INT64 is the only integer type:
-- All of these are the same: INT64 CREATE TABLE example ( col1 INT64, -- ✅ Explicit col2 INTEGER, -- Converted to INT64 col3 INT, -- Converted to INT64 col4 SMALLINT, -- Converted to INT64 col5 BIGINT -- Converted to INT64 );
No UUID type - use STRING:
-- PostgreSQL CREATE TABLE users (id UUID); -- BigQuery CREATE TABLE users (id STRING); -- Store UUID as string
NUMERIC precision limits:
-- NUMERIC: 38 digits precision, 9 decimal places NUMERIC(38, 9) -- BIGNUMERIC: 76 digits precision, 38 decimal places BIGNUMERIC(76, 38) -- Example SELECT CAST('12345678901234567890.123456789' AS NUMERIC) AS num, CAST('12345678901234567890.123456789' AS BIGNUMERIC) AS bignum;
TIMESTAMP vs DATETIME vs DATE:
-- TIMESTAMP: UTC, timezone-aware SELECT CURRENT_TIMESTAMP(); -- 2024-01-15 10:30:45.123456 UTC -- DATETIME: No timezone SELECT CURRENT_DATETIME(); -- 2024-01-15 10:30:45.123456 -- DATE: Date only SELECT CURRENT_DATE(); -- 2024-01-15 -- Conversion SELECT TIMESTAMP('2024-01-15 10:30:45'), -- Assumes UTC DATETIME(TIMESTAMP '2024-01-15 10:30:45'), -- Loses timezone DATE(TIMESTAMP '2024-01-15 10:30:45'); -- Loses time
Type coercion in JOINs:
-- ❌ Implicit cast can prevent optimization SELECT * FROM table1 t1 JOIN table2 t2 ON t1.id = CAST(t2.id AS STRING); -- Prevents clustering optimization -- ✅ Match types explicitly SELECT * FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id; -- Both STRING
Window Functions
Window functions perform calculations across a set of rows related to the current row, without collapsing the result set.
Basic Syntax
<function> OVER ( [PARTITION BY partition_expression] [ORDER BY sort_expression] [window_frame_clause] )
Ranking Functions
ROW_NUMBER() - Sequential numbering:
SELECT customer_id, order_date, amount, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num FROM orders;
Common use: Deduplication
SELECT * EXCEPT(row_num) FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS row_num FROM customers ) WHERE row_num = 1;
RANK() - Rank with gaps:
SELECT product_name, revenue, RANK() OVER (ORDER BY revenue DESC) AS rank FROM products; -- Results: -- Product A: $1000, rank 1 -- Product B: $900, rank 2 -- Product C: $900, rank 2 (tie) -- Product D: $800, rank 4 (gap after tie)
DENSE_RANK() - Rank without gaps:
SELECT product_name, revenue, DENSE_RANK() OVER (ORDER BY revenue DESC) AS dense_rank FROM products; -- Results: -- Product A: $1000, rank 1 -- Product B: $900, rank 2 -- Product C: $900, rank 2 (tie) -- Product D: $800, rank 3 (no gap)
NTILE() - Divide into N buckets:
SELECT customer_id, total_spend, NTILE(4) OVER (ORDER BY total_spend DESC) AS quartile FROM customer_totals;
Analytical Functions
LEAD() and LAG() - Access rows before/after:
-- Time series analysis SELECT date, revenue, LAG(revenue) OVER (ORDER BY date) AS prev_day_revenue, LEAD(revenue) OVER (ORDER BY date) AS next_day_revenue, revenue - LAG(revenue) OVER (ORDER BY date) AS daily_change, ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY date)) / LAG(revenue) OVER (ORDER BY date), 2) AS pct_change FROM daily_sales ORDER BY date;
With PARTITION BY:
-- Per-customer analysis SELECT customer_id, order_date, amount, LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_order_amount, amount - LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS amount_diff FROM orders;
FIRST_VALUE() and LAST_VALUE():
SELECT date, revenue, FIRST_VALUE(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_day_revenue, LAST_VALUE(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_day_revenue FROM daily_sales;
NTH_VALUE() - Get Nth value:
SELECT product_id, date, sales, NTH_VALUE(sales, 2) OVER (PARTITION BY product_id ORDER BY date) AS second_day_sales FROM product_sales;
Aggregate Window Functions
SUM/AVG/COUNT as window functions:
SELECT date, revenue, SUM(revenue) OVER (ORDER BY date) AS cumulative_revenue, AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg_7day, COUNT(*) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_30day_count FROM daily_sales;
Running totals:
SELECT customer_id, order_date, amount, SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS customer_lifetime_value FROM orders;
QUALIFY Clause (BigQuery-Specific)
QUALIFY filters on window function results - no subquery needed!
❌ Standard SQL (verbose):
SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num FROM orders ) WHERE row_num = 1;
✅ BigQuery with QUALIFY (clean):
SELECT customer_id, order_date, amount FROM orders QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) = 1;
More QUALIFY examples:
-- Get top 3 products per category SELECT category, product_name, revenue FROM products QUALIFY RANK() OVER (PARTITION BY category ORDER BY revenue DESC) <= 3; -- Filter outliers (keep middle 80%) SELECT * FROM measurements QUALIFY NTILE(10) OVER (ORDER BY value) BETWEEN 2 AND 9; -- Get first order per customer SELECT customer_id, order_date, amount FROM orders QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date ASC) = 1;
Window Frame Clauses
Control which rows are included in the window:
-- ROWS: Physical row count ROWS BETWEEN 3 PRECEDING AND CURRENT ROW -- Last 4 rows including current -- RANGE: Logical range (based on ORDER BY values) RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW -- Last 7 days -- Examples: SELECT date, sales, -- Last 7 rows AVG(sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7rows, -- Last 7 days (logical) AVG(sales) OVER (ORDER BY date RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW) AS avg_7days, -- All preceding rows SUM(sales) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum, -- Centered window (3 before, 3 after) AVG(sales) OVER (ORDER BY date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS centered_avg FROM daily_sales;
Window Function Performance Tips
1. Partition appropriately:
-- ✅ Good: Partitions reduce data scanned SELECT * FROM events WHERE date >= '2024-01-01' -- Partition filter QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp DESC) = 1;
2. Avoid window functions in WHERE:
-- ❌ Wrong: Can't use window functions in WHERE SELECT * FROM orders WHERE ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) = 1; -- ERROR -- ✅ Use QUALIFY instead SELECT * FROM orders QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) = 1;
3. Reuse window definitions:
SELECT date, revenue, ROW_NUMBER() OVER w AS row_num, RANK() OVER w AS rank, AVG(revenue) OVER w AS avg_revenue FROM sales WINDOW w AS (PARTITION BY category ORDER BY revenue DESC);
JSON Functions
BigQuery provides rich functions for parsing, extracting, and manipulating JSON data.
Extracting JSON Values
JSON_VALUE() - Extract scalar values (Standard SQL):
SELECT JSON_VALUE(json_column, '$.user.name') AS user_name, JSON_VALUE(json_column, '$.user.email') AS email, CAST(JSON_VALUE(json_column, '$.amount') AS FLOAT64) AS amount, CAST(JSON_VALUE(json_column, '$.quantity') AS INT64) AS quantity FROM events;
JSON_QUERY() - Extract objects or arrays:
SELECT JSON_QUERY(json_column, '$.user') AS user_object, JSON_QUERY(json_column, '$.items') AS items_array FROM events;
JSON_EXTRACT() - Legacy, still widely used:
SELECT JSON_EXTRACT(json_column, '$.user.name') AS user_name, JSON_EXTRACT_SCALAR(json_column, '$.user.email') AS email -- Returns STRING FROM events;
JSONPath syntax:
-- Dot notation '$.user.name' -- Array index '$.items[0].product_id' -- Array slice '$.items[0:3]' -- Wildcard '$.users[*].name' -- Recursive descent '$..name' -- All 'name' fields at any level
Working with JSON Arrays
JSON_EXTRACT_ARRAY() - Extract array elements:
SELECT event_id, tag FROM events, UNNEST(JSON_EXTRACT_ARRAY(tags_json, '$')) AS tag;
JSON_VALUE_ARRAY() - Extract array of scalars:
SELECT product_id, tag FROM products, UNNEST(JSON_VALUE_ARRAY(tags_json, '$')) AS tag;
Complete example:
-- JSON: {"product_id": "A123", "tags": ["electronics", "sale", "featured"]} SELECT JSON_VALUE(product_json, '$.product_id') AS product_id, tag FROM products, UNNEST(JSON_VALUE_ARRAY(product_json, '$.tags')) AS tag; -- Results: -- A123, electronics -- A123, sale -- A123, featured
Creating JSON
TO_JSON_STRING() - Convert to JSON:
SELECT customer_id, TO_JSON_STRING(STRUCT( name, email, created_at )) AS customer_json FROM customers;
Create JSON objects:
SELECT TO_JSON_STRING(STRUCT( 'John' AS name, 30 AS age, ['reading', 'hiking'] AS hobbies, STRUCT('123 Main St' AS street, 'Boston' AS city) AS address )) AS person_json; -- Result: -- {"name":"John","age":30,"hobbies":["reading","hiking"],"address":{"street":"123 Main St","city":"Boston"}}
Aggregate into JSON:
SELECT customer_id, TO_JSON_STRING( ARRAY_AGG( STRUCT(order_id, amount, date) ORDER BY date DESC LIMIT 5 ) ) AS recent_orders_json FROM orders GROUP BY customer_id;
Parsing JSON Strings
PARSE_JSON() - Convert string to JSON:
SELECT JSON_VALUE(PARSE_JSON('{"name":"Alice","age":25}'), '$.name') AS name;
Safe JSON parsing (avoid errors):
SELECT SAFE.JSON_VALUE(invalid_json, '$.name') AS name -- Returns NULL on error FROM events;
Complex JSON Examples
Nested JSON extraction:
-- JSON structure: -- { -- "order": { -- "id": "ORD123", -- "items": [ -- {"product": "A", "qty": 2, "price": 10.50}, -- {"product": "B", "qty": 1, "price": 25.00} -- ] -- } -- } SELECT JSON_VALUE(data, '$.order.id') AS order_id, JSON_VALUE(item, '$.product') AS product, CAST(JSON_VALUE(item, '$.qty') AS INT64) AS quantity, CAST(JSON_VALUE(item, '$.price') AS FLOAT64) AS price FROM events, UNNEST(JSON_EXTRACT_ARRAY(data, '$.order.items')) AS item;
Transform relational data to JSON:
SELECT category, TO_JSON_STRING( STRUCT( category AS category_name, COUNT(*) AS product_count, ROUND(AVG(price), 2) AS avg_price, ARRAY_AGG( STRUCT(product_name, price) ORDER BY price DESC LIMIT 3 ) AS top_products ) ) AS category_summary FROM products GROUP BY category;
JSON Performance Tips
1. Extract once, reuse:
-- ❌ Bad: Multiple extractions SELECT JSON_VALUE(data, '$.user.id'), JSON_VALUE(data, '$.user.name'), JSON_VALUE(data, '$.user.email') FROM events; -- ✅ Better: Extract object once WITH extracted AS ( SELECT JSON_QUERY(data, '$.user') AS user_json FROM events ) SELECT JSON_VALUE(user_json, '$.id'), JSON_VALUE(user_json, '$.name'), JSON_VALUE(user_json, '$.email') FROM extracted;
2. Consider STRUCT columns instead of JSON:
-- If schema is known and stable, use STRUCT CREATE TABLE events ( user STRUCT<id STRING, name STRING, email STRING>, timestamp TIMESTAMP ); -- Query with dot notation (faster than JSON extraction) SELECT user.id, user.name, user.email FROM events;
3. Materialize frequently accessed JSON fields:
-- Add extracted columns to table ALTER TABLE events ADD COLUMN user_id STRING AS (JSON_VALUE(data, '$.user.id')); -- Now queries can filter efficiently SELECT * FROM events WHERE user_id = 'U123';
ARRAY and STRUCT
BigQuery's native support for nested and repeated data allows for powerful denormalization and performance optimization.
ARRAY Basics
Creating arrays:
SELECT [1, 2, 3, 4, 5] AS numbers, ['apple', 'banana', 'cherry'] AS fruits, [DATE '2024-01-01', DATE '2024-01-02'] AS dates;
ARRAY_AGG() - Aggregate into array:
SELECT customer_id, ARRAY_AGG(order_id ORDER BY order_date DESC) AS order_ids, ARRAY_AGG(amount) AS order_amounts FROM orders GROUP BY customer_id;
With LIMIT:
SELECT customer_id, ARRAY_AGG(order_id ORDER BY order_date DESC LIMIT 5) AS recent_order_ids FROM orders GROUP BY customer_id;
UNNEST - Flattening Arrays
Basic UNNEST:
SELECT element FROM UNNEST(['a', 'b', 'c']) AS element; -- Results: -- a -- b -- c
UNNEST with table:
-- Table: customers -- customer_id | order_ids -- 1 | [101, 102, 103] -- 2 | [201, 202] SELECT customer_id, order_id FROM customers, UNNEST(order_ids) AS order_id; -- Results: -- 1, 101 -- 1, 102 -- 1, 103 -- 2, 201 -- 2, 202
UNNEST with OFFSET (get array index):
SELECT item, idx FROM UNNEST(['first', 'second', 'third']) AS item WITH OFFSET AS idx; -- Results: -- first, 0 -- second, 1 -- third, 2
STRUCT - Nested Records
Creating structs:
SELECT STRUCT('John' AS name, 30 AS age, 'Engineer' AS role) AS person, STRUCT('123 Main St' AS street, 'Boston' AS city, '02101' AS zip) AS address;
Querying struct fields:
SELECT person.name, person.age, address.city FROM ( SELECT STRUCT('John' AS name, 30 AS age) AS person, STRUCT('Boston' AS city) AS address );
ARRAY of STRUCT (Most Powerful Pattern)
Create:
SELECT customer_id, ARRAY_AGG( STRUCT( order_id, amount, order_date, status ) ORDER BY order_date DESC ) AS orders FROM orders GROUP BY customer_id;
Query:
-- Flatten array of struct SELECT customer_id, order.order_id, order.amount, order.order_date FROM customers, UNNEST(orders) AS order WHERE order.status = 'completed';
Filter array elements:
SELECT customer_id, ARRAY( SELECT AS STRUCT order_id, amount FROM UNNEST(orders) AS order WHERE order.status = 'completed' ORDER BY amount DESC LIMIT 3 ) AS top_completed_orders FROM customers;
ARRAY Functions
ARRAY_LENGTH():
SELECT customer_id, ARRAY_LENGTH(order_ids) AS total_orders FROM customers;
ARRAY_CONCAT():
SELECT ARRAY_CONCAT([1, 2], [3, 4], [5]) AS combined; -- Result: [1, 2, 3, 4, 5]
ARRAY_TO_STRING():
SELECT ARRAY_TO_STRING(['apple', 'banana', 'cherry'], ', ') AS fruits; -- Result: 'apple, banana, cherry'
GENERATE_ARRAY():
SELECT GENERATE_ARRAY(1, 10) AS numbers; -- Result: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] SELECT GENERATE_ARRAY(0, 100, 10) AS multiples_of_10; -- Result: [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
ARRAY_REVERSE():
SELECT ARRAY_REVERSE([1, 2, 3, 4, 5]) AS reversed; -- Result: [5, 4, 3, 2, 1]
Performance: ARRAY vs JOIN
Traditional approach (2 tables, JOIN):
-- Table 1: customers (1M rows) -- Table 2: orders (10M rows) SELECT c.customer_id, c.name, o.order_id, o.amount FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE c.customer_id = '12345'; -- Scans: customers (1M) + orders (10M) = 11M rows -- Join cost: High
Array approach (1 table with ARRAY):
-- Table: customers (1M rows with nested orders array) SELECT customer_id, name, order.order_id, order.amount FROM customers, UNNEST(orders) AS order WHERE customer_id = '12345'; -- Scans: customers (1M) only -- No join cost -- 50-80% faster for 1:many relationships
When to use ARRAY:
- 1:many relationships (orders per customer)
- Moderate array size (< 1000 elements)
- Frequent filtering by parent entity
- Want to reduce JOINs
When NOT to use ARRAY:
- Many:many relationships
- Very large arrays (> 10,000 elements)
- Need to query array elements independently
- Array elements frequently updated
Complete Example: Denormalized Design
Traditional normalized:
-- 3 tables, 2 JOINs SELECT c.customer_id, c.name, o.order_id, oi.product_id, oi.quantity FROM customers c JOIN orders o ON c.customer_id = o.customer_id JOIN order_items oi ON o.order_id = oi.order_id;
Denormalized with ARRAY/STRUCT:
-- 1 table, no JOINs CREATE TABLE customers_denormalized AS SELECT c.customer_id, c.name, ARRAY_AGG( STRUCT( o.order_id, o.order_date, o.status, ARRAY( SELECT AS STRUCT product_id, quantity, price FROM order_items WHERE order_id = o.order_id ) AS items ) ORDER BY o.order_date DESC ) AS orders FROM customers c LEFT JOIN orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.name; -- Query (no JOINs!) SELECT customer_id, name, order.order_id, item.product_id, item.quantity FROM customers_denormalized, UNNEST(orders) AS order, UNNEST(order.items) AS item WHERE customer_id = '12345';
Performance improvement: 3-5x faster for typical queries.
BigQuery-Specific Features
EXCEPT and REPLACE in SELECT
EXCEPT - Exclude columns:
-- Select all except sensitive columns SELECT * EXCEPT(ssn, password, credit_card) FROM customers; -- Combine with WHERE SELECT * EXCEPT(internal_notes) FROM orders WHERE status = 'shipped';
REPLACE - Modify columns:
-- Replace column values SELECT * REPLACE(UPPER(name) AS name, ROUND(price, 2) AS price) FROM products; -- Anonymize data SELECT * REPLACE('***' AS ssn, '***' AS credit_card) FROM customers;
Combine EXCEPT and REPLACE:
SELECT * EXCEPT(password) REPLACE(LOWER(email) AS email) FROM users;
SAFE Functions (NULL Instead of Errors)
SAFE_CAST() - Returns NULL on error:
-- Regular CAST throws error on invalid input SELECT CAST('invalid' AS INT64); -- ERROR -- SAFE_CAST returns NULL SELECT SAFE_CAST('invalid' AS INT64) AS result; -- NULL
SAFE_DIVIDE() - Returns NULL on division by zero:
SELECT revenue, orders, SAFE_DIVIDE(revenue, orders) AS avg_order_value -- NULL if orders = 0 FROM daily_metrics;
Other SAFE functions:
-- SAFE_SUBTRACT (for dates) SELECT SAFE_SUBTRACT(DATE '2024-01-01', DATE '2024-12-31'); -- NULL (negative) -- SAFE_NEGATE SELECT SAFE_NEGATE(9223372036854775807); -- NULL (overflow) -- SAFE_ADD SELECT SAFE_ADD(9223372036854775807, 1); -- NULL (overflow)
Use case: Data quality checks:
SELECT COUNT(*) AS total_rows, COUNT(SAFE_CAST(amount AS FLOAT64)) AS valid_amounts, COUNT(*) - COUNT(SAFE_CAST(amount AS FLOAT64)) AS invalid_amounts FROM transactions;
GROUP BY with Column Numbers
-- ✅ Valid: Group by column position SELECT customer_id, DATE(order_date) AS order_date, SUM(amount) AS total FROM orders GROUP BY 1, 2; -- Same as: GROUP BY customer_id, DATE(order_date) -- ✅ Also valid: Mix names and numbers SELECT customer_id, DATE(order_date) AS order_date, SUM(amount) AS total FROM orders GROUP BY customer_id, 2;
When useful:
- Long expressions in SELECT
- Complex CASE statements
- Simplifies GROUP BY clause
TABLESAMPLE - Random Sampling
System sampling (fast, approximate):
-- Sample ~10% of data (by blocks) SELECT * FROM large_table TABLESAMPLE SYSTEM (10 PERCENT);
Use cases:
- Quick data exploration
- Testing queries on large tables
- Statistical sampling
Note: SYSTEM sampling is block-based, not truly random. For exact percentages, use ROW_NUMBER() with RAND().
PIVOT and UNPIVOT
PIVOT - Columns to rows:
SELECT * FROM ( SELECT product, quarter, sales FROM quarterly_sales ) PIVOT ( SUM(sales) FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4') ); -- Before: -- product | quarter | sales -- A | Q1 | 100 -- A | Q2 | 150 -- After: -- product | Q1 | Q2 | Q3 | Q4 -- A | 100 | 150 | ... | ...
UNPIVOT - Rows to columns:
SELECT * FROM quarterly_totals UNPIVOT ( sales FOR quarter IN (Q1, Q2, Q3, Q4) ); -- Before: -- product | Q1 | Q2 | Q3 | Q4 -- A | 100 | 150 | 200 | 250 -- After: -- product | quarter | sales -- A | Q1 | 100 -- A | Q2 | 150
Standard SQL vs Legacy SQL
BigQuery has two SQL dialects:
| Feature | Standard SQL | Legacy SQL |
|---|---|---|
| ANSI Compliance | ✅ Yes | ❌ No |
| Recommended | ✅ Yes | ❌ Deprecated |
| Table Reference | `project.dataset.table` | [project:dataset.table] |
| CTEs (WITH) | ✅ Yes | ❌ No |
| Window Functions | ✅ Full support | ⚠️ Limited |
| ARRAY/STRUCT | ✅ Native | ⚠️ Limited |
| Portability | ✅ High | ❌ BigQuery-only |
How to detect Legacy SQL:
-- Legacy SQL indicators: -- 1. Square brackets for tables SELECT * FROM [project:dataset.table] -- 2. GROUP EACH BY SELECT customer_id, COUNT(*) FROM orders GROUP EACH BY customer_id -- 3. FLATTEN SELECT * FROM FLATTEN([project:dataset.table], field) -- 4. TABLE_DATE_RANGE SELECT * FROM TABLE_DATE_RANGE([dataset.table_], TIMESTAMP('2024-01-01'), TIMESTAMP('2024-12-31'))
Standard SQL equivalent:
-- 1. Backticks SELECT * FROM `project.dataset.table` -- 2. Regular GROUP BY SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id -- 3. UNNEST SELECT * FROM `project.dataset.table`, UNNEST(field) -- 4. Partitioned table filter SELECT * FROM `project.dataset.table` WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2024-01-01') AND TIMESTAMP('2024-12-31')
Migration:
# Set default to Standard SQL bq query --use_legacy_sql=false 'SELECT ...' # Or in Python job_config = bigquery.QueryJobConfig(use_legacy_sql=False)
Performance Checklist
Before running expensive queries:
- ☐ Use
to estimate cost--dry_run - ☐ Check if partition pruning is active
- ☐ Verify clustering on JOIN/WHERE columns
- ☐ Remove SELECT *
- ☐ Filter early with WHERE
- ☐ Use EXPLAIN to analyze plan
- ☐ Consider materialized views for repeated queries
- ☐ Test with LIMIT first on full query
- ☐ Avoid CTE re-execution (use temp tables if referenced 2+ times)
- ☐ Use NOT EXISTS instead of NOT IN
- ☐ Batch DML operations (avoid row-by-row updates)
- ☐ Consider ARRAY/STRUCT for 1:many relationships
Monitoring Query Performance
-- Check query statistics SELECT job_id, user_email, total_bytes_processed, total_slot_ms, TIMESTAMP_DIFF(end_time, start_time, SECOND) as duration_sec FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) ORDER BY total_bytes_processed DESC LIMIT 10;
Quick Wins
Immediate improvements:
- Add partition filter → 50-90% cost reduction
- Remove SELECT * → 30-70% cost reduction
- Use clustering → 20-50% performance improvement
- Break complex queries → 2-5x faster execution
- Enable query cache → Free repeated queries