Skilllibrary bigquery

Name: bigquery
Author: merceralex397-collab

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/skilllibrary

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/09-backend-api-and-data/bigquery" ~/.claude/skills/merceralex397-collab-skilllibrary-bigquery && rm -rf "$T"

manifest: 09-backend-api-and-data/bigquery/SKILL.md

source content

Purpose

Write efficient BigQuery SQL with proper partitioning, clustering, cost estimation, and query optimization.

When to use this skill

writing or optimizing BigQuery SQL queries
designing tables with time-based partitioning and clustering
estimating query cost before running with
```
--dry_run
```
debugging slow queries or slot contention in BigQuery

Do not use this skill when

working with PostgreSQL/MySQL — prefer
```
orm-patterns
```
building real-time OLTP systems — BigQuery is for analytics
managing GCP infrastructure beyond BigQuery (Terraform, etc.)

Procedure

Estimate cost first — run

bq query --dry_run --use_legacy_sql=false 'SELECT ...'

to see bytes scanned.

Prune partitions — always filter on the partition column (usually
```
_PARTITIONTIME
```
or a
```
DATE
```
/
```
TIMESTAMP
```
column) in
```
WHERE
```
.
Use clustering — cluster by high-cardinality filter columns (e.g.,
```
user_id
```
,
```
event_name
```
) after partitioning.
Select only needed columns — BigQuery is columnar;
```
SELECT *
```
scans all columns and inflates cost.
Avoid cross joins — use
```
JOIN
```
with explicit keys. Check
```
INFORMATION_SCHEMA.JOBS
```
for bytes billed.
Use
APPROX_COUNT_DISTINCT
— for cardinality estimates on large tables, ~2% error but 10x faster.
Materialize CTEs — BigQuery evaluates CTEs multiple times; use temp tables for repeated subqueries.
Monitor slots — check
```
INFORMATION_SCHEMA.JOBS_BY_PROJECT
```
for
```
total_slot_ms
```
to find expensive queries.

Table design

CREATE TABLE project.dataset.events (
  event_date    DATE NOT NULL,
  event_name    STRING NOT NULL,
  user_id       STRING,
  properties    JSON,
  created_at    TIMESTAMP NOT NULL
)
PARTITION BY event_date
CLUSTER BY event_name, user_id
OPTIONS (
  partition_expiration_days = 365,
  require_partition_filter = true
);

Cost estimation

# Dry run to check bytes scanned (cost = bytes * $6.25/TB on-demand)
bq query --dry_run --use_legacy_sql=false \
  'SELECT user_id, COUNT(*) FROM project.dataset.events
   WHERE event_date BETWEEN "2024-01-01" AND "2024-01-31"
   GROUP BY 1'

# Check actual cost of recent queries
SELECT
  job_id,
  total_bytes_billed / POW(1024, 4) AS tb_billed,
  total_slot_ms / 1000 AS slot_seconds
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
ORDER BY total_bytes_billed DESC
LIMIT 10;

Decision rules

Set
```
require_partition_filter = true
```
on large tables — prevents full-table scans.
Partition by date for time-series data; by integer range for non-temporal data.
Cluster by up to 4 columns in order of filter frequency.
Use
```
APPROX_
```
functions for dashboards; exact aggregates for financial data.
Prefer
```
MERGE
```
over
```
DELETE + INSERT
```
for upserts — single-pass atomic operation.

References

Related skills

```
orm-patterns
```
— OLTP database patterns