Claude-skill-registry bq-query-optimization

Use when writing BigQuery queries, optimizing query performance, analyzing execution plans, or avoiding common SQL gotchas. Covers parameterized queries, UDFs, scripting, window functions (QUALIFY, ROW_NUMBER, RANK, LEAD/LAG), JSON functions, ARRAY/STRUCT operations, BigQuery-specific features (EXCEPT, REPLACE, SAFE_*), CTE re-execution issues, NOT IN with NULLs, DML performance, Standard vs Legacy SQL, and performance best practices.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/bq-query-optimization" ~/.claude/skills/majiayu000-claude-skill-registry-bq-query-optimization && rm -rf "$T"

manifest: skills/data/bq-query-optimization/SKILL.md

BigQuery Query Optimization

Use this skill when writing, debugging, or optimizing BigQuery SQL queries for performance and efficiency.

Query Execution Analysis

Using EXPLAIN

-- Get execution plan
EXPLAIN SELECT * FROM `project.dataset.table` WHERE condition;

-- Get execution plan with runtime stats
EXPLAIN ANALYZE SELECT * FROM `project.dataset.table` WHERE condition;

What the plan shows:

Stages of execution
Bytes read per stage
Slot time consumed
Potential bottlenecks

Performance Best Practices

1. Avoid SELECT *

❌ Bad (scans all columns):

SELECT * FROM `project.dataset.large_table`

✅ Good (only needed columns):

SELECT customer_id, amount, date
FROM `project.dataset.large_table`

Impact: Full table scan vs targeted column read. Can reduce data scanned by 90%+.

2. Filter Early and Often

❌ Bad (filter after aggregation):

SELECT customer_id, SUM(amount) as total
FROM `project.dataset.orders`
GROUP BY customer_id
HAVING SUM(amount) > 1000

✅ Good (filter before aggregation):

SELECT customer_id, SUM(amount) as total
FROM `project.dataset.orders`
WHERE amount > 100  -- Filter early
GROUP BY customer_id
HAVING SUM(amount) > 1000

3. Use Partitioned Tables

Without partition filter:

-- Scans entire table
SELECT * FROM `project.dataset.orders`
WHERE order_date >= '2024-01-01'

With partition filter:

-- Only scans relevant partitions
SELECT * FROM `project.dataset.orders`
WHERE DATE(order_timestamp) >= '2024-01-01'  -- Partition column

Key: Filter on the partition column for automatic partition pruning.

4. Break Complex Queries

❌ Anti-pattern (one huge query):

SELECT ...
FROM (
  SELECT ...
  FROM (
    SELECT ... -- Deeply nested
  )
)
WHERE ...

✅ Good (use CTEs):

WITH base_data AS (
  SELECT customer_id, amount, date
  FROM `project.dataset.orders`
  WHERE date >= '2024-01-01'
),
aggregated AS (
  SELECT customer_id, SUM(amount) as total
  FROM base_data
  GROUP BY customer_id
)
SELECT * FROM aggregated WHERE total > 1000

✅ Better (multi-statement with temp tables):

CREATE TEMP TABLE base_data AS
SELECT customer_id, amount, date
FROM `project.dataset.orders`
WHERE date >= '2024-01-01';

CREATE TEMP TABLE aggregated AS
SELECT customer_id, SUM(amount) as total
FROM base_data
GROUP BY customer_id;

SELECT * FROM aggregated WHERE total > 1000;

5. JOIN Optimization

Put largest table first:

-- ✅ Large table first
SELECT l.*, s.detail
FROM `project.dataset.large_table` l
JOIN `project.dataset.small_table` s
  ON l.id = s.id

Use clustering on JOIN columns:

Cluster tables on frequently joined columns
BigQuery can prune data blocks more effectively

Consider ARRAY/STRUCT for 1:many:

-- Instead of JOIN for 1:many relationships
SELECT
  order_id,
  ARRAY_AGG(STRUCT(product_id, quantity, price)) as items
FROM `project.dataset.order_items`
GROUP BY order_id

6. Leverage Automatic Features

BigQuery automatically performs:

Query rewrites - Optimizes query structure
Partition pruning - With proper filters
Dynamic filtering - Reduces data scanned

Ensure your queries enable these:

Filter on partition columns
Use simple, clear predicates
Avoid functions on partition columns in WHERE clause

Parameterized Queries

CLI Syntax

bq query \
  --use_legacy_sql=false \
  --parameter=start_date:DATE:2024-01-01 \
  --parameter=end_date:DATE:2024-12-31 \
  --parameter=min_amount:FLOAT64:100.0 \
  'SELECT *
   FROM `project.dataset.orders`
   WHERE order_date BETWEEN @start_date AND @end_date
   AND amount >= @min_amount'

Python Syntax

from google.cloud import bigquery

client = bigquery.Client()

query = """
SELECT customer_id, SUM(amount) as total
FROM `project.dataset.orders`
WHERE order_date >= @start_date
GROUP BY customer_id
"""

job_config = bigquery.QueryJobConfig(
    query_parameters=[
        bigquery.ScalarQueryParameter("start_date", "DATE", "2024-01-01")
    ]
)

results = client.query_and_wait(query, job_config=job_config)

Key points:

Use
```
@
```
prefix for named parameters
Syntax:
```
name:TYPE:value
```
or
```
name::value
```
(STRING default)
Cannot use as column/table names
Only works with standard SQL

User-Defined Functions (UDFs)

SQL UDF (Simple)

CREATE TEMP FUNCTION CleanEmail(email STRING)
RETURNS STRING
AS (
  LOWER(TRIM(email))
);

SELECT CleanEmail(customer_email) as email
FROM `project.dataset.customers`;

JavaScript UDF (Complex Logic)

CREATE TEMP FUNCTION ParseUserAgent(ua STRING)
RETURNS STRUCT<browser STRING, version STRING>
LANGUAGE js AS r"""
  var match = ua.match(/(Chrome|Firefox|Safari)\/(\d+)/);
  return {
    browser: match ? match[1] : 'Unknown',
    version: match ? match[2] : '0'
  };
""";

SELECT ParseUserAgent(user_agent).browser as browser
FROM `project.dataset.sessions`;

Limitations:

INT64 unsupported in JavaScript (use FLOAT64 or STRING)
JavaScript doesn't support 64-bit integers natively

Persistent UDFs

-- Create once, use many times
CREATE FUNCTION `project.dataset.clean_email`(email STRING)
RETURNS STRING
AS (LOWER(TRIM(email)));

-- Use anywhere
SELECT `project.dataset.clean_email`(email) FROM ...

Scripting & Procedural Language

Variables

DECLARE total_orders INT64;
SET total_orders = (SELECT COUNT(*) FROM `project.dataset.orders`);

SELECT total_orders;

Loops

LOOP:

DECLARE x INT64 DEFAULT 0;
LOOP
  SET x = x + 1;
  IF x >= 10 THEN
    LEAVE;
  END IF;
END LOOP;

WHILE:

DECLARE x INT64 DEFAULT 0;
WHILE x < 10 DO
  SET x = x + 1;
END WHILE;

FOR with arrays:

DECLARE ids ARRAY<STRING>;
SET ids = ['id1', 'id2', 'id3'];

FOR item IN (SELECT * FROM UNNEST(ids) as id)
DO
  -- Process each id
  SELECT id;
END FOR;

Query Caching

Automatic caching (24 hours):

Identical queries serve cached results (free)
No additional cost
Instant response

To bypass cache:

bq query --use_cache=false 'SELECT...'

Common Anti-Patterns

❌ Using LIMIT to reduce cost

-- LIMIT doesn't reduce data scanned or cost!
SELECT * FROM `project.dataset.huge_table` LIMIT 10

Impact: Still scans entire table. Use WHERE filters instead.

❌ Functions on partition columns

-- Prevents partition pruning
WHERE CAST(date_column AS STRING) = '2024-01-01'

✅ Better:

WHERE date_column = DATE('2024-01-01')

❌ Cross joins without filters

-- Cartesian product = huge result
SELECT * FROM table1 CROSS JOIN table2

Impact: Can generate millions/billions of rows.

❌ Correlated subqueries

-- Runs subquery for each row
SELECT *
FROM orders o
WHERE amount > (SELECT AVG(amount) FROM orders WHERE customer_id = o.customer_id)

✅ Better (use window functions):

SELECT *
FROM (
  SELECT *, AVG(amount) OVER (PARTITION BY customer_id) as avg_amount
  FROM orders
)
WHERE amount > avg_amount

Common SQL Gotchas

CTE Re-execution (CRITICAL COST ISSUE)

Problem: When a CTE is referenced multiple times, BigQuery re-executes it each time, billing you multiple times.

❌ Bad (CTE runs 3 times - billed 3x):

WITH expensive_cte AS (
  SELECT * FROM `project.dataset.huge_table`
  WHERE complex_conditions
  AND lots_of_joins
)
SELECT COUNT(*) FROM expensive_cte
UNION ALL
SELECT SUM(amount) FROM expensive_cte
UNION ALL
SELECT MAX(date) FROM expensive_cte;

Impact: If the CTE scans 10 TB, you're billed for 30 TB (10 TB × 3).

✅ Good (use temp table - billed 1x):

CREATE TEMP TABLE expensive_data AS
SELECT * FROM `project.dataset.huge_table`
WHERE complex_conditions
AND lots_of_joins;

SELECT COUNT(*) FROM expensive_data
UNION ALL
SELECT SUM(amount) FROM expensive_data
UNION ALL
SELECT MAX(date) FROM expensive_data;

When CTEs are OK:

Referenced only once
Very small result set
Part of larger query (BigQuery may optimize)

When to use temp tables:

CTE referenced 2+ times
Large data volumes
Complex/expensive CTE query

NOT IN with NULL Values (SILENT FAILURE)

Problem:

NOT IN

returns NOTHING (empty result) if ANY NULL exists in the subquery.

❌ Broken (returns empty if blocked_customers has any NULL):

SELECT * FROM customers
WHERE customer_id NOT IN (
  SELECT customer_id FROM blocked_customers  -- If ANY NULL, returns 0 rows!
);

Why it fails:

SQL three-valued logic: TRUE, FALSE, UNKNOWN
```
NULL IN (...)
```
evaluates to UNKNOWN
```
NOT UNKNOWN
```
is still UNKNOWN
Rows with UNKNOWN are filtered out

✅ Solution 1: Use NOT EXISTS (safest):

SELECT * FROM customers c
WHERE NOT EXISTS (
  SELECT 1 FROM blocked_customers b
  WHERE b.customer_id = c.customer_id
);

✅ Solution 2: Filter NULLs explicitly:

SELECT * FROM customers
WHERE customer_id NOT IN (
  SELECT customer_id
  FROM blocked_customers
  WHERE customer_id IS NOT NULL  -- Explicit NULL filter
);

✅ Solution 3: Use LEFT JOIN:

SELECT c.*
FROM customers c
LEFT JOIN blocked_customers b
  ON c.customer_id = b.customer_id
WHERE b.customer_id IS NULL;

Best practice: Prefer

NOT EXISTS

- it's clearer, safer, and often faster.

DML Statement Performance

Problem: BigQuery is optimized for analytics (OLAP), not transactional updates (OLTP). DML statements are slow and expensive.

Why DML is slow in BigQuery:

Columnar storage (not row-based)
Designed for bulk reads, not individual updates
No indexes for fast row lookups
Every update rewrites affected partitions

❌ Very slow (row-by-row updates):

-- Don't do this - takes minutes/hours
UPDATE `project.dataset.orders`
SET status = 'processed'
WHERE order_id = '12345';

-- This is even worse - runs once per row
FOR record IN (SELECT order_id FROM orders_to_update)
DO
  UPDATE orders SET status = 'processed' WHERE order_id = record.order_id;
END FOR;

⚠️ Better (batch updates):

UPDATE `project.dataset.orders`
SET status = 'processed'
WHERE order_id IN (SELECT order_id FROM orders_to_update);

✅ Best (recreate table - fastest):

CREATE OR REPLACE TABLE `project.dataset.orders` AS
SELECT
  * EXCEPT(status),
  CASE
    WHEN order_id IN (SELECT order_id FROM orders_to_update)
      THEN 'processed'
    ELSE status
  END AS status
FROM `project.dataset.orders`;

For INSERT/UPSERT - use MERGE:

MERGE `project.dataset.customers` AS target
USING `project.dataset.customer_updates` AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET name = source.name, updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, created_at)
  VALUES (customer_id, name, CURRENT_TIMESTAMP());

Best practices:

Batch updates instead of row-by-row
Use MERGE for upserts
Consider recreating table for large updates
Partition tables to limit update scope
Avoid frequent small DML operations

Data Type Gotchas

INT64 is the only integer type:

-- All of these are the same: INT64
CREATE TABLE example (
  col1 INT64,      -- ✅ Explicit
  col2 INTEGER,    -- Converted to INT64
  col3 INT,        -- Converted to INT64
  col4 SMALLINT,   -- Converted to INT64
  col5 BIGINT      -- Converted to INT64
);

No UUID type - use STRING:

-- PostgreSQL
CREATE TABLE users (id UUID);

-- BigQuery
CREATE TABLE users (id STRING);  -- Store UUID as string

NUMERIC precision limits:

-- NUMERIC: 38 digits precision, 9 decimal places
NUMERIC(38, 9)

-- BIGNUMERIC: 76 digits precision, 38 decimal places
BIGNUMERIC(76, 38)

-- Example
SELECT
  CAST('12345678901234567890.123456789' AS NUMERIC) AS num,
  CAST('12345678901234567890.123456789' AS BIGNUMERIC) AS bignum;

TIMESTAMP vs DATETIME vs DATE:

-- TIMESTAMP: UTC, timezone-aware
SELECT CURRENT_TIMESTAMP();  -- 2024-01-15 10:30:45.123456 UTC

-- DATETIME: No timezone
SELECT CURRENT_DATETIME();   -- 2024-01-15 10:30:45.123456

-- DATE: Date only
SELECT CURRENT_DATE();       -- 2024-01-15

-- Conversion
SELECT
  TIMESTAMP('2024-01-15 10:30:45'),          -- Assumes UTC
  DATETIME(TIMESTAMP '2024-01-15 10:30:45'), -- Loses timezone
  DATE(TIMESTAMP '2024-01-15 10:30:45');     -- Loses time

Type coercion in JOINs:

-- ❌ Implicit cast can prevent optimization
SELECT * FROM table1 t1
JOIN table2 t2
  ON t1.id = CAST(t2.id AS STRING);  -- Prevents clustering optimization

-- ✅ Match types explicitly
SELECT * FROM table1 t1
JOIN table2 t2
  ON t1.id = t2.id;  -- Both STRING

Window Functions

Window functions perform calculations across a set of rows related to the current row, without collapsing the result set.

Basic Syntax

<function> OVER (
  [PARTITION BY partition_expression]
  [ORDER BY sort_expression]
  [window_frame_clause]
)

Ranking Functions

ROW_NUMBER() - Sequential numbering:

SELECT
  customer_id,
  order_date,
  amount,
  ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num
FROM orders;

Common use: Deduplication

SELECT * EXCEPT(row_num)
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS row_num
  FROM customers
)
WHERE row_num = 1;

RANK() - Rank with gaps:

SELECT
  product_name,
  revenue,
  RANK() OVER (ORDER BY revenue DESC) AS rank
FROM products;

-- Results:
-- Product A: $1000, rank 1
-- Product B: $900,  rank 2
-- Product C: $900,  rank 2  (tie)
-- Product D: $800,  rank 4  (gap after tie)

DENSE_RANK() - Rank without gaps:

SELECT
  product_name,
  revenue,
  DENSE_RANK() OVER (ORDER BY revenue DESC) AS dense_rank
FROM products;

-- Results:
-- Product A: $1000, rank 1
-- Product B: $900,  rank 2
-- Product C: $900,  rank 2  (tie)
-- Product D: $800,  rank 3  (no gap)

NTILE() - Divide into N buckets:

SELECT
  customer_id,
  total_spend,
  NTILE(4) OVER (ORDER BY total_spend DESC) AS quartile
FROM customer_totals;

Analytical Functions

LEAD() and LAG() - Access rows before/after:

-- Time series analysis
SELECT
  date,
  revenue,
  LAG(revenue) OVER (ORDER BY date) AS prev_day_revenue,
  LEAD(revenue) OVER (ORDER BY date) AS next_day_revenue,
  revenue - LAG(revenue) OVER (ORDER BY date) AS daily_change,
  ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY date)) / LAG(revenue) OVER (ORDER BY date), 2) AS pct_change
FROM daily_sales
ORDER BY date;

With PARTITION BY:

-- Per-customer analysis
SELECT
  customer_id,
  order_date,
  amount,
  LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_order_amount,
  amount - LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS amount_diff
FROM orders;

FIRST_VALUE() and LAST_VALUE():

SELECT
  date,
  revenue,
  FIRST_VALUE(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_day_revenue,
  LAST_VALUE(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_day_revenue
FROM daily_sales;

NTH_VALUE() - Get Nth value:

SELECT
  product_id,
  date,
  sales,
  NTH_VALUE(sales, 2) OVER (PARTITION BY product_id ORDER BY date) AS second_day_sales
FROM product_sales;

Aggregate Window Functions

SUM/AVG/COUNT as window functions:

SELECT
  date,
  revenue,
  SUM(revenue) OVER (ORDER BY date) AS cumulative_revenue,
  AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg_7day,
  COUNT(*) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_30day_count
FROM daily_sales;

Running totals:

SELECT
  customer_id,
  order_date,
  amount,
  SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS customer_lifetime_value
FROM orders;

QUALIFY Clause (BigQuery-Specific)

QUALIFY filters on window function results - no subquery needed!

❌ Standard SQL (verbose):

SELECT *
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num
  FROM orders
)
WHERE row_num = 1;

✅ BigQuery with QUALIFY (clean):

SELECT customer_id, order_date, amount
FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) = 1;

More QUALIFY examples:

-- Get top 3 products per category
SELECT category, product_name, revenue
FROM products
QUALIFY RANK() OVER (PARTITION BY category ORDER BY revenue DESC) <= 3;

-- Filter outliers (keep middle 80%)
SELECT *
FROM measurements
QUALIFY NTILE(10) OVER (ORDER BY value) BETWEEN 2 AND 9;

-- Get first order per customer
SELECT customer_id, order_date, amount
FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date ASC) = 1;

Window Frame Clauses

Control which rows are included in the window:

-- ROWS: Physical row count
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW  -- Last 4 rows including current

-- RANGE: Logical range (based on ORDER BY values)
RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW  -- Last 7 days

-- Examples:
SELECT
  date,
  sales,
  -- Last 7 rows
  AVG(sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7rows,

  -- Last 7 days (logical)
  AVG(sales) OVER (ORDER BY date RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW) AS avg_7days,

  -- All preceding rows
  SUM(sales) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum,

  -- Centered window (3 before, 3 after)
  AVG(sales) OVER (ORDER BY date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS centered_avg
FROM daily_sales;

Window Function Performance Tips

1. Partition appropriately:

-- ✅ Good: Partitions reduce data scanned
SELECT *
FROM events
WHERE date >= '2024-01-01'  -- Partition filter
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp DESC) = 1;

2. Avoid window functions in WHERE:

-- ❌ Wrong: Can't use window functions in WHERE
SELECT * FROM orders
WHERE ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) = 1;  -- ERROR

-- ✅ Use QUALIFY instead
SELECT * FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) = 1;

3. Reuse window definitions:

SELECT
  date,
  revenue,
  ROW_NUMBER() OVER w AS row_num,
  RANK() OVER w AS rank,
  AVG(revenue) OVER w AS avg_revenue
FROM sales
WINDOW w AS (PARTITION BY category ORDER BY revenue DESC);

JSON Functions

BigQuery provides rich functions for parsing, extracting, and manipulating JSON data.

Extracting JSON Values

JSON_VALUE() - Extract scalar values (Standard SQL):

SELECT
  JSON_VALUE(json_column, '$.user.name') AS user_name,
  JSON_VALUE(json_column, '$.user.email') AS email,
  CAST(JSON_VALUE(json_column, '$.amount') AS FLOAT64) AS amount,
  CAST(JSON_VALUE(json_column, '$.quantity') AS INT64) AS quantity
FROM events;

JSON_QUERY() - Extract objects or arrays:

SELECT
  JSON_QUERY(json_column, '$.user') AS user_object,
  JSON_QUERY(json_column, '$.items') AS items_array
FROM events;

JSON_EXTRACT() - Legacy, still widely used:

SELECT
  JSON_EXTRACT(json_column, '$.user.name') AS user_name,
  JSON_EXTRACT_SCALAR(json_column, '$.user.email') AS email  -- Returns STRING
FROM events;

JSONPath syntax:

-- Dot notation
'$.user.name'

-- Array index
'$.items[0].product_id'

-- Array slice
'$.items[0:3]'

-- Wildcard
'$.users[*].name'

-- Recursive descent
'$..name'  -- All 'name' fields at any level

Working with JSON Arrays

JSON_EXTRACT_ARRAY() - Extract array elements:

SELECT
  event_id,
  tag
FROM events,
UNNEST(JSON_EXTRACT_ARRAY(tags_json, '$')) AS tag;

JSON_VALUE_ARRAY() - Extract array of scalars:

SELECT
  product_id,
  tag
FROM products,
UNNEST(JSON_VALUE_ARRAY(tags_json, '$')) AS tag;

Complete example:

-- JSON: {"product_id": "A123", "tags": ["electronics", "sale", "featured"]}
SELECT
  JSON_VALUE(product_json, '$.product_id') AS product_id,
  tag
FROM products,
UNNEST(JSON_VALUE_ARRAY(product_json, '$.tags')) AS tag;

-- Results:
-- A123, electronics
-- A123, sale
-- A123, featured

Creating JSON

TO_JSON_STRING() - Convert to JSON:

SELECT
  customer_id,
  TO_JSON_STRING(STRUCT(
    name,
    email,
    created_at
  )) AS customer_json
FROM customers;

Create JSON objects:

SELECT
  TO_JSON_STRING(STRUCT(
    'John' AS name,
    30 AS age,
    ['reading', 'hiking'] AS hobbies,
    STRUCT('123 Main St' AS street, 'Boston' AS city) AS address
  )) AS person_json;

-- Result:
-- {"name":"John","age":30,"hobbies":["reading","hiking"],"address":{"street":"123 Main St","city":"Boston"}}

Aggregate into JSON:

SELECT
  customer_id,
  TO_JSON_STRING(
    ARRAY_AGG(
      STRUCT(order_id, amount, date)
      ORDER BY date DESC
      LIMIT 5
    )
  ) AS recent_orders_json
FROM orders
GROUP BY customer_id;

Parsing JSON Strings

PARSE_JSON() - Convert string to JSON:

SELECT
  JSON_VALUE(PARSE_JSON('{"name":"Alice","age":25}'), '$.name') AS name;

Safe JSON parsing (avoid errors):

SELECT
  SAFE.JSON_VALUE(invalid_json, '$.name') AS name  -- Returns NULL on error
FROM events;

Complex JSON Examples

Nested JSON extraction:

-- JSON structure:
-- {
--   "order": {
--     "id": "ORD123",
--     "items": [
--       {"product": "A", "qty": 2, "price": 10.50},
--       {"product": "B", "qty": 1, "price": 25.00}
--     ]
--   }
-- }

SELECT
  JSON_VALUE(data, '$.order.id') AS order_id,
  JSON_VALUE(item, '$.product') AS product,
  CAST(JSON_VALUE(item, '$.qty') AS INT64) AS quantity,
  CAST(JSON_VALUE(item, '$.price') AS FLOAT64) AS price
FROM events,
UNNEST(JSON_EXTRACT_ARRAY(data, '$.order.items')) AS item;

Transform relational data to JSON:

SELECT
  category,
  TO_JSON_STRING(
    STRUCT(
      category AS category_name,
      COUNT(*) AS product_count,
      ROUND(AVG(price), 2) AS avg_price,
      ARRAY_AGG(
        STRUCT(product_name, price)
        ORDER BY price DESC
        LIMIT 3
      ) AS top_products
    )
  ) AS category_summary
FROM products
GROUP BY category;

JSON Performance Tips

1. Extract once, reuse:

-- ❌ Bad: Multiple extractions
SELECT
  JSON_VALUE(data, '$.user.id'),
  JSON_VALUE(data, '$.user.name'),
  JSON_VALUE(data, '$.user.email')
FROM events;

-- ✅ Better: Extract object once
WITH extracted AS (
  SELECT JSON_QUERY(data, '$.user') AS user_json
  FROM events
)
SELECT
  JSON_VALUE(user_json, '$.id'),
  JSON_VALUE(user_json, '$.name'),
  JSON_VALUE(user_json, '$.email')
FROM extracted;

2. Consider STRUCT columns instead of JSON:

-- If schema is known and stable, use STRUCT
CREATE TABLE events (
  user STRUCT<id STRING, name STRING, email STRING>,
  timestamp TIMESTAMP
);

-- Query with dot notation (faster than JSON extraction)
SELECT user.id, user.name, user.email
FROM events;

3. Materialize frequently accessed JSON fields:

-- Add extracted columns to table
ALTER TABLE events
ADD COLUMN user_id STRING AS (JSON_VALUE(data, '$.user.id'));

-- Now queries can filter efficiently
SELECT * FROM events WHERE user_id = 'U123';

ARRAY and STRUCT

BigQuery's native support for nested and repeated data allows for powerful denormalization and performance optimization.

ARRAY Basics

Creating arrays:

SELECT
  [1, 2, 3, 4, 5] AS numbers,
  ['apple', 'banana', 'cherry'] AS fruits,
  [DATE '2024-01-01', DATE '2024-01-02'] AS dates;

ARRAY_AGG() - Aggregate into array:

SELECT
  customer_id,
  ARRAY_AGG(order_id ORDER BY order_date DESC) AS order_ids,
  ARRAY_AGG(amount) AS order_amounts
FROM orders
GROUP BY customer_id;

With LIMIT:

SELECT
  customer_id,
  ARRAY_AGG(order_id ORDER BY order_date DESC LIMIT 5) AS recent_order_ids
FROM orders
GROUP BY customer_id;

UNNEST - Flattening Arrays

Basic UNNEST:

SELECT element
FROM UNNEST(['a', 'b', 'c']) AS element;

-- Results:
-- a
-- b
-- c

UNNEST with table:

-- Table: customers
-- customer_id | order_ids
-- 1           | [101, 102, 103]
-- 2           | [201, 202]

SELECT
  customer_id,
  order_id
FROM customers,
UNNEST(order_ids) AS order_id;

-- Results:
-- 1, 101
-- 1, 102
-- 1, 103
-- 2, 201
-- 2, 202

UNNEST with OFFSET (get array index):

SELECT
  item,
  idx
FROM UNNEST(['first', 'second', 'third']) AS item WITH OFFSET AS idx;

-- Results:
-- first, 0
-- second, 1
-- third, 2

STRUCT - Nested Records

Creating structs:

SELECT
  STRUCT('John' AS name, 30 AS age, 'Engineer' AS role) AS person,
  STRUCT('123 Main St' AS street, 'Boston' AS city, '02101' AS zip) AS address;

Querying struct fields:

SELECT
  person.name,
  person.age,
  address.city
FROM (
  SELECT
    STRUCT('John' AS name, 30 AS age) AS person,
    STRUCT('Boston' AS city) AS address
);

ARRAY of STRUCT (Most Powerful Pattern)

Create:

SELECT
  customer_id,
  ARRAY_AGG(
    STRUCT(
      order_id,
      amount,
      order_date,
      status
    )
    ORDER BY order_date DESC
  ) AS orders
FROM orders
GROUP BY customer_id;

Query:

-- Flatten array of struct
SELECT
  customer_id,
  order.order_id,
  order.amount,
  order.order_date
FROM customers,
UNNEST(orders) AS order
WHERE order.status = 'completed';

Filter array elements:

SELECT
  customer_id,
  ARRAY(
    SELECT AS STRUCT order_id, amount
    FROM UNNEST(orders) AS order
    WHERE order.status = 'completed'
    ORDER BY amount DESC
    LIMIT 3
  ) AS top_completed_orders
FROM customers;

ARRAY Functions

ARRAY_LENGTH():

SELECT
  customer_id,
  ARRAY_LENGTH(order_ids) AS total_orders
FROM customers;

ARRAY_CONCAT():

SELECT ARRAY_CONCAT([1, 2], [3, 4], [5]) AS combined;
-- Result: [1, 2, 3, 4, 5]

ARRAY_TO_STRING():

SELECT ARRAY_TO_STRING(['apple', 'banana', 'cherry'], ', ') AS fruits;
-- Result: 'apple, banana, cherry'

GENERATE_ARRAY():

SELECT GENERATE_ARRAY(1, 10) AS numbers;
-- Result: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

SELECT GENERATE_ARRAY(0, 100, 10) AS multiples_of_10;
-- Result: [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

ARRAY_REVERSE():

SELECT ARRAY_REVERSE([1, 2, 3, 4, 5]) AS reversed;
-- Result: [5, 4, 3, 2, 1]

Performance: ARRAY vs JOIN

Traditional approach (2 tables, JOIN):

-- Table 1: customers (1M rows)
-- Table 2: orders (10M rows)

SELECT
  c.customer_id,
  c.name,
  o.order_id,
  o.amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.customer_id = '12345';

-- Scans: customers (1M) + orders (10M) = 11M rows
-- Join cost: High

Array approach (1 table with ARRAY):

-- Table: customers (1M rows with nested orders array)

SELECT
  customer_id,
  name,
  order.order_id,
  order.amount
FROM customers,
UNNEST(orders) AS order
WHERE customer_id = '12345';

-- Scans: customers (1M) only
-- No join cost
-- 50-80% faster for 1:many relationships

When to use ARRAY:

1:many relationships (orders per customer)
Moderate array size (< 1000 elements)
Frequent filtering by parent entity
Want to reduce JOINs

When NOT to use ARRAY:

Many:many relationships
Very large arrays (> 10,000 elements)
Need to query array elements independently
Array elements frequently updated

Complete Example: Denormalized Design

Traditional normalized:

-- 3 tables, 2 JOINs
SELECT
  c.customer_id,
  c.name,
  o.order_id,
  oi.product_id,
  oi.quantity
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id;

Denormalized with ARRAY/STRUCT:

-- 1 table, no JOINs
CREATE TABLE customers_denormalized AS
SELECT
  c.customer_id,
  c.name,
  ARRAY_AGG(
    STRUCT(
      o.order_id,
      o.order_date,
      o.status,
      ARRAY(
        SELECT AS STRUCT product_id, quantity, price
        FROM order_items
        WHERE order_id = o.order_id
      ) AS items
    )
    ORDER BY o.order_date DESC
  ) AS orders
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name;

-- Query (no JOINs!)
SELECT
  customer_id,
  name,
  order.order_id,
  item.product_id,
  item.quantity
FROM customers_denormalized,
UNNEST(orders) AS order,
UNNEST(order.items) AS item
WHERE customer_id = '12345';

Performance improvement: 3-5x faster for typical queries.

BigQuery-Specific Features

EXCEPT and REPLACE in SELECT

EXCEPT - Exclude columns:

-- Select all except sensitive columns
SELECT * EXCEPT(ssn, password, credit_card)
FROM customers;

-- Combine with WHERE
SELECT * EXCEPT(internal_notes)
FROM orders
WHERE status = 'shipped';

REPLACE - Modify columns:

-- Replace column values
SELECT * REPLACE(UPPER(name) AS name, ROUND(price, 2) AS price)
FROM products;

-- Anonymize data
SELECT * REPLACE('***' AS ssn, '***' AS credit_card)
FROM customers;

Combine EXCEPT and REPLACE:

SELECT * EXCEPT(password) REPLACE(LOWER(email) AS email)
FROM users;

SAFE Functions (NULL Instead of Errors)

SAFE_CAST() - Returns NULL on error:

-- Regular CAST throws error on invalid input
SELECT CAST('invalid' AS INT64);  -- ERROR

-- SAFE_CAST returns NULL
SELECT SAFE_CAST('invalid' AS INT64) AS result;  -- NULL

SAFE_DIVIDE() - Returns NULL on division by zero:

SELECT
  revenue,
  orders,
  SAFE_DIVIDE(revenue, orders) AS avg_order_value  -- NULL if orders = 0
FROM daily_metrics;

Other SAFE functions:

-- SAFE_SUBTRACT (for dates)
SELECT SAFE_SUBTRACT(DATE '2024-01-01', DATE '2024-12-31');  -- NULL (negative)

-- SAFE_NEGATE
SELECT SAFE_NEGATE(9223372036854775807);  -- NULL (overflow)

-- SAFE_ADD
SELECT SAFE_ADD(9223372036854775807, 1);  -- NULL (overflow)

Use case: Data quality checks:

SELECT
  COUNT(*) AS total_rows,
  COUNT(SAFE_CAST(amount AS FLOAT64)) AS valid_amounts,
  COUNT(*) - COUNT(SAFE_CAST(amount AS FLOAT64)) AS invalid_amounts
FROM transactions;

GROUP BY with Column Numbers

-- ✅ Valid: Group by column position
SELECT
  customer_id,
  DATE(order_date) AS order_date,
  SUM(amount) AS total
FROM orders
GROUP BY 1, 2;  -- Same as: GROUP BY customer_id, DATE(order_date)

-- ✅ Also valid: Mix names and numbers
SELECT
  customer_id,
  DATE(order_date) AS order_date,
  SUM(amount) AS total
FROM orders
GROUP BY customer_id, 2;

When useful:

Long expressions in SELECT
Complex CASE statements
Simplifies GROUP BY clause

TABLESAMPLE - Random Sampling

System sampling (fast, approximate):

-- Sample ~10% of data (by blocks)
SELECT * FROM large_table
TABLESAMPLE SYSTEM (10 PERCENT);

Use cases:

Quick data exploration
Testing queries on large tables
Statistical sampling

Note: SYSTEM sampling is block-based, not truly random. For exact percentages, use ROW_NUMBER() with RAND().

PIVOT and UNPIVOT

PIVOT - Columns to rows:

SELECT *
FROM (
  SELECT product, quarter, sales
  FROM quarterly_sales
)
PIVOT (
  SUM(sales)
  FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4')
);

-- Before:
-- product | quarter | sales
-- A       | Q1      | 100
-- A       | Q2      | 150

-- After:
-- product | Q1  | Q2  | Q3  | Q4
-- A       | 100 | 150 | ... | ...

UNPIVOT - Rows to columns:

SELECT *
FROM quarterly_totals
UNPIVOT (
  sales FOR quarter IN (Q1, Q2, Q3, Q4)
);

-- Before:
-- product | Q1  | Q2  | Q3  | Q4
-- A       | 100 | 150 | 200 | 250

-- After:
-- product | quarter | sales
-- A       | Q1      | 100
-- A       | Q2      | 150

Standard SQL vs Legacy SQL

BigQuery has two SQL dialects:

Feature	Standard SQL	Legacy SQL
ANSI Compliance	✅ Yes	❌ No
Recommended	✅ Yes	❌ Deprecated
Table Reference	`project.dataset.table`	[project:dataset.table]
CTEs (WITH)	✅ Yes	❌ No
Window Functions	✅ Full support	⚠️ Limited
ARRAY/STRUCT	✅ Native	⚠️ Limited
Portability	✅ High	❌ BigQuery-only

How to detect Legacy SQL:

-- Legacy SQL indicators:
-- 1. Square brackets for tables
SELECT * FROM [project:dataset.table]

-- 2. GROUP EACH BY
SELECT customer_id, COUNT(*) FROM orders GROUP EACH BY customer_id

-- 3. FLATTEN
SELECT * FROM FLATTEN([project:dataset.table], field)

-- 4. TABLE_DATE_RANGE
SELECT * FROM TABLE_DATE_RANGE([dataset.table_], TIMESTAMP('2024-01-01'), TIMESTAMP('2024-12-31'))

Standard SQL equivalent:

-- 1. Backticks
SELECT * FROM `project.dataset.table`

-- 2. Regular GROUP BY
SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id

-- 3. UNNEST
SELECT * FROM `project.dataset.table`, UNNEST(field)

-- 4. Partitioned table filter
SELECT * FROM `project.dataset.table`
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2024-01-01') AND TIMESTAMP('2024-12-31')

Migration:

# Set default to Standard SQL
bq query --use_legacy_sql=false 'SELECT ...'

# Or in Python
job_config = bigquery.QueryJobConfig(use_legacy_sql=False)

Performance Checklist

Before running expensive queries:

☐ Use
```
--dry_run
```
to estimate cost
☐ Check if partition pruning is active
☐ Verify clustering on JOIN/WHERE columns
☐ Remove SELECT *
☐ Filter early with WHERE
☐ Use EXPLAIN to analyze plan
☐ Consider materialized views for repeated queries
☐ Test with LIMIT first on full query
☐ Avoid CTE re-execution (use temp tables if referenced 2+ times)
☐ Use NOT EXISTS instead of NOT IN
☐ Batch DML operations (avoid row-by-row updates)
☐ Consider ARRAY/STRUCT for 1:many relationships

Monitoring Query Performance

-- Check query statistics
SELECT
  job_id,
  user_email,
  total_bytes_processed,
  total_slot_ms,
  TIMESTAMP_DIFF(end_time, start_time, SECOND) as duration_sec
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY total_bytes_processed DESC
LIMIT 10;

Quick Wins

Immediate improvements:

Add partition filter → 50-90% cost reduction
Remove SELECT * → 30-70% cost reduction
Use clustering → 20-50% performance improvement
Break complex queries → 2-5x faster execution
Enable query cache → Free repeated queries