Skilllibrary bigquery-skill

Google BigQuery specialist for query authoring, schema design, cost optimization, and data pipeline configuration. Triggers: 'BigQuery query', 'BQ cost estimate', 'partition strategy', 'optimize this BQ query', 'BigQuery schema design', 'scheduled query', 'data loading into BigQuery', 'UNNEST', 'wildcard tables'. Do NOT use for general SQL unrelated to BigQuery, Postgres/MySQL tuning, or data warehouse platforms other than BigQuery (Snowflake, Redshift, Databricks).

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/skilllibrary

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/17-external-reference-seeds/bigquery-skill" ~/.claude/skills/merceralex397-collab-skilllibrary-bigquery-skill && rm -rf "$T"

manifest: 17-external-reference-seeds/bigquery-skill/SKILL.md

source content

Purpose

Domain skill for Google BigQuery — covering query authoring, schema design, cost management, performance optimization, data loading strategies, and operational configuration. Produces production-ready SQL, schema recommendations, and cost-aware optimization guidance grounded in BigQuery's specific architecture (Dremel execution engine, Colossus storage, slot-based processing).

When to use this skill

Use this skill when:

writing or reviewing BigQuery Standard SQL queries
designing or evolving BigQuery table schemas (partitioning, clustering, nested/repeated fields)
estimating or reducing query costs (on-demand bytes billed or slot utilization)
configuring data loading (batch load, streaming inserts, Storage Write API)
setting up scheduled queries, data transfers, authorized views, or materialized views
troubleshooting BigQuery performance (slot contention, shuffle limits, query plan analysis)
migrating to BigQuery from other warehouses or from BigQuery Legacy SQL to Standard SQL

Do not use this skill when

the task involves a different data warehouse (Snowflake, Redshift, Databricks, Synapse) — use the appropriate platform skill
the SQL is for PostgreSQL, MySQL, or SQLite with no BigQuery involvement
the task is purely about GCP IAM or networking without BigQuery-specific configuration
a quick format conversion or one-liner would suffice — use
```
misc-helper
```
instead

Operating procedure

1. Understand the data context

Identify the dataset, table(s), and their schemas. Check for partitioning (
```
_PARTITIONTIME
```
,
```
_PARTITIONDATE
```
, or column-based), clustering columns, and nested/repeated fields (RECORD/ARRAY types).
Determine the pricing model in use: on-demand (per-TB scanned) or flat-rate/editions (slot-based).
Note the data volume: row count, total storage size, and typical query scan size.

2. Author or optimize the query

Query patterns to apply:

Always filter on the partition column first to enable partition pruning. Use
```
WHERE _PARTITIONTIME BETWEEN ...
```
or the partitioning column directly.
Use
```
UNNEST()
```
to flatten ARRAY/STRUCT fields before filtering or aggregating.
For wildcard tables (
```
project.dataset.events_*
```
), add
```
_TABLE_SUFFIX
```
filters to limit tables scanned.
Prefer
```
APPROX_COUNT_DISTINCT()
```
over
```
COUNT(DISTINCT ...)
```
for large cardinality columns (typically 1-2% error, up to 50x faster).
Select only needed columns — never
```
SELECT *
```
on wide or large tables.
Place the largest table first in JOINs so the BigQuery optimizer can broadcast smaller tables.
Use CTEs (WITH clauses) for readability; BigQuery optimizes them into the execution plan.
Use
```
SAFE_DIVIDE()
```
,
```
SAFE_CAST()
```
, and
```
IFNULL()
```
/
```
COALESCE()
```
to handle nulls and division-by-zero without query failure.

Cost estimation:

Run a dry run before execution:
```
bq query --dry_run --use_legacy_sql=false 'SELECT ...'
```
to get bytes processed estimate.
Convert bytes to cost: on-demand pricing is $6.25/TB (first 1 TB/month free).
For flat-rate, cost is fixed per slot-hour; focus on slot utilization and concurrency instead.

3. Design or review schema

Partitioning strategy: use time-based partitioning (DAY, HOUR, MONTH, YEAR) for event/log data; integer-range partitioning for ID-sharded data. Max 4,000 partitions per table.
Clustering: choose up to 4 columns, ordered by filter frequency (most filtered first). Clustering is free and improves both performance and cost.
Nested/repeated fields: prefer denormalized schemas with STRUCT and ARRAY over normalized multi-table joins. BigQuery is optimized for wide, denormalized tables.
Materialized views: use for frequently repeated aggregations. BigQuery auto-refreshes them and the optimizer can transparently rewrite queries to use them.

4. Configure data loading

Batch load (recommended for most cases): use
```
bq load
```
or the Storage Write API in batch mode. Prefer Avro format (fastest load, self-describing schema), then Parquet (good compression, columnar), then CSV (slowest, requires schema definition).
Streaming inserts (
```
tabledata.insertAll
```
): use only for real-time ingestion with <1 second latency requirements. Costs $0.05/GB. Data is available immediately but not in the streaming buffer for export for up to 90 minutes.
Storage Write API (recommended over streaming): lower cost, exactly-once semantics, higher throughput. Use the default stream for low-latency or committed streams for transactional guarantees.

5. Validate and deliver

Verify partition pruning in the query plan (INFORMATION_SCHEMA.JOBS or the BQ console execution details).
Confirm bytes billed matches the dry run estimate (±10%).
Check for slot contention in
```
INFORMATION_SCHEMA.JOBS_BY_PROJECT
```
if using flat-rate.

Decision rules

Always partition: no production table over 1 GB should be unpartitioned. Default to daily time partitioning unless the access pattern clearly favors another strategy.
Always cluster: if a table has common filter/group-by columns, add clustering. There is no cost, only benefit.
Dry run first: never execute an ad-hoc query on a table >10 GB without a dry run cost estimate.
Denormalize over join: prefer nested/repeated fields over multi-table joins when the data relationship is 1:N and the nested data is always accessed with the parent.
Approximate over exact: for exploratory analytics on >1 billion rows, prefer
```
APPROX_
```
functions unless exact counts are a business requirement.
Storage Write API over streaming inserts: for new pipelines, always recommend the Storage Write API unless the team has an existing streaming inserts integration that works.
Avoid cross-region queries: if data is in
```
US
```
and compute is in
```
EU
```
, flag the cost and latency implications.

Output requirements

Every response must include the applicable sections:

Query
— the complete, runnable BigQuery Standard SQL with comments explaining partition pruning, clustering usage, and any cost-relevant choices.
Cost Estimate
— estimated bytes scanned and dollar cost (on-demand) or slot-seconds (flat-rate). Include the dry run command to verify.
Schema Recommendation
(if schema changes are involved) — DDL statement with partitioning, clustering, and field type choices annotated.
Optimization Notes
— bullet list of what was optimized and why (e.g., "Added
```
_PARTITIONDATE
```
filter to reduce scan from 2.1 TB to 45 GB").
Warnings
(if any) — cost risks, missing partition filters, or deprecated patterns detected.

Anti-patterns

SELECT *
on large tables: scans every column; always specify needed columns explicitly. On a 10-column, 5 TB table, selecting 2 columns reduces cost by ~80%.
Unpartitioned date-filtered queries: filtering on a DATE column without partitioning forces a full table scan. Partition the table or use a partitioned view.
Streaming inserts for batch workloads: using
```
tabledata.insertAll
```
for nightly ETL jobs wastes money ($0.05/GB vs free for batch loads) and lacks exactly-once guarantees.
JOINing before filtering: apply WHERE clauses before JOINs to reduce shuffle volume. BigQuery's optimizer helps, but explicit pre-filtering is more reliable.
Ignoring clustering order: clustering columns are hierarchical (like a compound index). Putting a high-cardinality column first when the low-cardinality column is filtered more often wastes the clustering benefit.
Overusing
ORDER BY
without
LIMIT
: sorting the full result set of a multi-TB query is expensive and usually unnecessary. Always pair
```
ORDER BY
```
with
```
LIMIT
```
unless the consumer truly needs sorted output.
Legacy SQL: any query starting with
```
#legacySQL
```
or using bracket syntax (
```
[project:dataset.table]
```
) should be migrated to Standard SQL.

Related skills

```
fastapi-patterns
```
— for building API layers that query BigQuery
```
misc-helper
```
— for quick data format conversions before/after BigQuery operations
```
tauri-solidjs
```
— for desktop dashboards consuming BigQuery data

Failure handling

If the table schema is unknown, query
```
INFORMATION_SCHEMA.COLUMNS
```
or
```
INFORMATION_SCHEMA.TABLE_OPTIONS
```
to discover partitioning, clustering, and field types before proceeding.
If a dry run returns an unexpectedly high bytes estimate, investigate: check for missing partition filters, unnecessary columns, or wildcard table suffix filters.
If a query fails with "Resources exceeded", simplify the query (break into stages with temp tables), reduce JOIN fan-out, or add
```
LIMIT
```
for iterative development.
If the pricing model is unknown, assume on-demand and flag that slot-based pricing should be verified with the team.