Skilllibrary bigquery
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/09-backend-api-and-data/bigquery" ~/.claude/skills/merceralex397-collab-skilllibrary-bigquery && rm -rf "$T"
manifest:
09-backend-api-and-data/bigquery/SKILL.mdsource content
Purpose
Write efficient BigQuery SQL with proper partitioning, clustering, cost estimation, and query optimization.
When to use this skill
- writing or optimizing BigQuery SQL queries
- designing tables with time-based partitioning and clustering
- estimating query cost before running with
--dry_run - debugging slow queries or slot contention in BigQuery
Do not use this skill when
- working with PostgreSQL/MySQL — prefer
orm-patterns - building real-time OLTP systems — BigQuery is for analytics
- managing GCP infrastructure beyond BigQuery (Terraform, etc.)
Procedure
- Estimate cost first — run
to see bytes scanned.bq query --dry_run --use_legacy_sql=false 'SELECT ...' - Prune partitions — always filter on the partition column (usually
or a_PARTITIONTIME
/DATE
column) inTIMESTAMP
.WHERE - Use clustering — cluster by high-cardinality filter columns (e.g.,
,user_id
) after partitioning.event_name - Select only needed columns — BigQuery is columnar;
scans all columns and inflates cost.SELECT * - Avoid cross joins — use
with explicit keys. CheckJOIN
for bytes billed.INFORMATION_SCHEMA.JOBS - Use
— for cardinality estimates on large tables, ~2% error but 10x faster.APPROX_COUNT_DISTINCT - Materialize CTEs — BigQuery evaluates CTEs multiple times; use temp tables for repeated subqueries.
- Monitor slots — check
forINFORMATION_SCHEMA.JOBS_BY_PROJECT
to find expensive queries.total_slot_ms
Table design
CREATE TABLE project.dataset.events ( event_date DATE NOT NULL, event_name STRING NOT NULL, user_id STRING, properties JSON, created_at TIMESTAMP NOT NULL ) PARTITION BY event_date CLUSTER BY event_name, user_id OPTIONS ( partition_expiration_days = 365, require_partition_filter = true );
Cost estimation
# Dry run to check bytes scanned (cost = bytes * $6.25/TB on-demand) bq query --dry_run --use_legacy_sql=false \ 'SELECT user_id, COUNT(*) FROM project.dataset.events WHERE event_date BETWEEN "2024-01-01" AND "2024-01-31" GROUP BY 1' # Check actual cost of recent queries SELECT job_id, total_bytes_billed / POW(1024, 4) AS tb_billed, total_slot_ms / 1000 AS slot_seconds FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) ORDER BY total_bytes_billed DESC LIMIT 10;
Decision rules
- Set
on large tables — prevents full-table scans.require_partition_filter = true - Partition by date for time-series data; by integer range for non-temporal data.
- Cluster by up to 4 columns in order of filter frequency.
- Use
functions for dashboards; exact aggregates for financial data.APPROX_ - Prefer
overMERGE
for upserts — single-pass atomic operation.DELETE + INSERT
References
- https://cloud.google.com/bigquery/docs/best-practices-performance-overview
- https://cloud.google.com/bigquery/pricing
Related skills
— OLTP database patternsorm-patterns