Claude-skill-registry bigquery-cli

Use when working with BigQuery from command line using bq tool, including querying data, loading/exporting tables, managing datasets, cost estimation with dry-run, or partitioning strategies

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/bigquery-cli" ~/.claude/skills/majiayu000-claude-skill-registry-bigquery-cli && rm -rf "$T"
manifest: skills/data/bigquery-cli/SKILL.md
source content

BigQuery CLI (bq)

Overview

The BigQuery CLI (

bq
) provides command-line access to Google BigQuery for data warehousing and analytics. Core principle: Always estimate costs with dry-run, use proper authentication, and leverage partitioning for scale.

When to Use

  • Running SQL queries on BigQuery
  • Loading data from CSV/JSON/Avro/Parquet
  • Exporting table data to GCS
  • Managing datasets, tables, views
  • Cost estimation and optimization
  • Creating partitioned/clustered tables
  • Viewing schemas and data samples

When NOT to use:

  • When BigQuery UI is more appropriate (ad-hoc exploration)
  • When using BigQuery client libraries (Python, Java, etc.)

Authentication and Project Setup

Before ANY bq operation:

# Check authentication
gcloud auth list

# Login if needed
gcloud auth login

# Set default project
gcloud config set project PROJECT_ID

# Verify current project
gcloud config get-value project

Command Structure

Format:

bq [--global_flags] <command> [--command_flags] [args]

Important global flags:

  • --project_id=PROJECT
    - Override default project
  • --dataset_id=DATASET
    - Default dataset
  • --location=LOCATION
    - Geographic location (us, eu, asia-northeast1)
  • --format=FORMAT
    - Output format: pretty, sparse, prettyjson, json, csv
  • --quiet
    /
    -q
    - Suppress status updates
  • --dry_run
    - Validate without executing (queries only)

Quick Reference

Query Operations

TaskCommandKey Flags
Run query
bq query 'SELECT ...'
--dry_run
,
--use_legacy_sql=false
,
--destination_table
,
--max_rows
Estimate cost
bq query --dry_run 'SELECT ...'
Shows bytes to be processed
Save results
bq query --destination_table=ds.table 'SELECT ...'
--append_table
,
--replace
Parameterized query
bq query --parameter='name:STRING:value' 'SELECT ...'
Repeat
--parameter
for multiple

Data Loading

TaskCommandKey Flags
Load CSV
bq load ds.table gs://bucket/file.csv schema
--skip_leading_rows=1
,
--autodetect
,
--field_delimiter
Load JSON
bq load --source_format=NEWLINE_DELIMITED_JSON ds.table file.json
--autodetect
,
--schema
Load Parquet/Avro
bq load --source_format=PARQUET ds.table gs://bucket/*.parquet
Schema auto-detected
Replace table
bq load --replace ds.table source schema
Overwrites existing data
Append to table
bq load --noreplace ds.table source schema
Adds to existing data (default)

Data Export

TaskCommandKey Flags
Export to CSV
bq extract --destination_format=CSV ds.table gs://bucket/file.csv
--field_delimiter
,
--print_header
Export to JSON
bq extract --destination_format=NEWLINE_DELIMITED_JSON ds.table gs://bucket/*.json
Use wildcard for large files
Compressed export
bq extract --compression=GZIP ds.table gs://bucket/*.json.gz
GZIP, SNAPPY, or NONE
Export model
bq extract -m ds.model gs://bucket/model
For ML models

Resource Management

TaskCommandKey Flags
List datasets
bq ls
or
bq ls PROJECT:
-a
for all (including hidden)
List tables
bq ls DATASET
-m
for models,
-a
for all
List jobs
bq ls -j PROJECT
--filter='state:RUNNING,PENDING'
,
--max_results
Show table
bq show ds.table
--schema
,
--format=prettyjson
Show schema only
bq show --schema ds.table
Faster than full show
Preview data
bq head ds.table
-n 100
,
-s 10
for offset
Show job
bq show -j JOB_ID
Check job status and details

Creating Resources

TaskCommandKey Flags
Create dataset
bq mk DATASET
--location=us
,
--description
Create table
bq mk -t ds.table schema
See schema format below
Create view
bq mk --view='SELECT ...' ds.view
SQL definition
Create materialized view
bq mk --materialized_view='SELECT ...' ds.mview
Auto-refreshed
Create partitioned table
bq mk --table --time_partitioning_type=DAY ds.table schema
See partitioning below

Deleting Resources

TaskCommandKey Flags
Delete table
bq rm ds.table
-f
to skip confirmation
Delete dataset
bq rm -r DATASET
-r
removes all tables,
-f
force
Delete model
bq rm -m ds.model
For ML models
Cancel job
bq cancel JOB_ID
Stops running query

Cost Estimation - ALWAYS USE DRY RUN

# ALWAYS check cost before running expensive queries
bq query --dry_run 'SELECT * FROM project.dataset.huge_table'

# Output shows bytes to be processed
# Cost = (Bytes / 1TB) × $6.25 (US, on-demand pricing)
# First 1 TB per month is free

# Add safety limit (in bytes)
bq query --maximum_bytes_billed=5000000000000 'SELECT ...'  # ~5TB limit

Schema Format

Inline (comma-separated):

field1:STRING,field2:INTEGER,field3:FLOAT,field4:BOOLEAN,field5:TIMESTAMP

With mode:

field1:STRING:REQUIRED,field2:INTEGER:NULLABLE,field3:RECORD:REPEATED

JSON file:

[
  {"name": "field1", "type": "STRING", "mode": "REQUIRED"},
  {"name": "field2", "type": "INTEGER", "mode": "NULLABLE"},
  {
    "name": "nested",
    "type": "RECORD",
    "fields": [
      {"name": "subfield", "type": "STRING"}
    ]
  }
]

Common types: STRING, INTEGER, FLOAT, BOOLEAN, TIMESTAMP, DATE, TIME, DATETIME, NUMERIC, BIGNUMERIC, BYTES, GEOGRAPHY, JSON, RECORD (nested), ARRAY

Partitioning and Clustering

Time-based Partitioning

# Partition by DATE/TIMESTAMP column
bq mk --table \
  --time_partitioning_type=DAY \
  --time_partitioning_field=event_date \
  --time_partitioning_expiration=2592000 \
  ds.events \
  event_id:STRING,event_date:DATE,data:STRING

# Require partition filter (prevents expensive full scans)
bq mk --table \
  --time_partitioning_type=DAY \
  --require_partition_filter=true \
  ds.events \
  schema.json

Partition types: DAY, HOUR, MONTH, YEAR

Clustering

# Add clustering (up to 4 columns)
bq mk --table \
  --time_partitioning_type=DAY \
  --time_partitioning_field=event_date \
  --clustering_fields=user_id,region \
  ds.events \
  schema.json

Benefits: Faster queries, lower costs when filtering/grouping by clustered columns

Output Formats

# Pretty table (default)
bq query --format=pretty 'SELECT ...'

# JSON (compact)
bq query --format=json 'SELECT ...'

# Pretty JSON (readable)
bq query --format=prettyjson 'SELECT ...'

# CSV with header
bq query --format=csv 'SELECT ...'

# Sparse table (simpler)
bq query --format=sparse 'SELECT ...'

Common Workflows

Cost Estimation → Query

# 1. Estimate cost
bq query --dry_run 'SELECT ...'

# 2. Review bytes to be processed
# Calculate: (bytes / 1099511627776) × $6.25

# 3. Run query if acceptable
bq query 'SELECT ...'

# OR add safety limit
bq query --maximum_bytes_billed=10000000000000 'SELECT ...'

Load Data Pipeline

# 1. Check authentication and project
gcloud auth list
gcloud config get-value project

# 2. Create dataset if needed
bq ls | grep my_dataset || bq mk my_dataset

# 3. Load with autodetect (fast) or explicit schema (production)
bq load --autodetect --skip_leading_rows=1 \
  my_dataset.table \
  gs://bucket/data.csv

# 4. Verify
bq show my_dataset.table
bq head -n 10 my_dataset.table

Export Large Table

# 1. Export with wildcard (required for >1GB)
bq extract \
  --compression=GZIP \
  --destination_format=NEWLINE_DELIMITED_JSON \
  dataset.large_table \
  'gs://bucket/export_*.json.gz'

# 2. Verify export
gsutil ls -lh gs://bucket/export_*

# 3. Check total size
gsutil du -sh gs://bucket/

Create Partitioned Table for Scale

# 1. Create with partitioning and clustering
bq mk --table \
  --time_partitioning_type=DAY \
  --time_partitioning_field=event_date \
  --clustering_fields=user_id,event_type \
  --require_partition_filter=true \
  --time_partitioning_expiration=31536000 \
  dataset.events \
  event_id:STRING,user_id:STRING,event_type:STRING,event_date:DATE,data:JSON

# 2. Load initial data
bq load dataset.events gs://bucket/events_*.json

# 3. Query with partition filter (required)
bq query 'SELECT * FROM dataset.events WHERE event_date = "2025-01-20"'

Best Practices

Query Cost Optimization

  1. Always use --dry_run first for queries scanning >1TB
  2. Add WHERE clauses on partition columns
  3. SELECT specific columns, not
    SELECT *
  4. Use --maximum_bytes_billed as safety net
  5. Partition large tables by date/timestamp
  6. Cluster by common filter columns

Loading Data

  1. Use --autodetect for quick loads, explicit schema for production
  2. Load from GCS (not local files) for large data
  3. Use appropriate source format: Parquet/Avro > CSV/JSON
  4. Batch small files into larger files before loading
  5. Set expiration on staging tables

Exporting Data

  1. Use wildcards for large exports (>1GB limit per file)
  2. Compress exports with GZIP or SNAPPY
  3. Use columnar formats (Parquet, Avro) for analytics workflows
  4. Match regions (BigQuery dataset and GCS bucket)

Partitioning

  1. Partition by date/timestamp for time-series data
  2. Use require_partition_filter to prevent expensive scans
  3. Set partition expiration to auto-delete old data
  4. Combine with clustering for additional optimization
  5. Partition types: DAY for most use cases, HOUR for high-volume

Schema Design

  1. Use REQUIRED for mandatory fields
  2. Use TIMESTAMP over STRING for timestamps
  3. Use NUMERIC for financial data (exact precision)
  4. Use JSON type for flexible nested data (Standard SQL only)
  5. Nested records better than wide tables with many columns

Common Mistakes

MistakeWhy It's WrongCorrect Approach
No dry-run for large queriesUnexpected costsAlways
bq query --dry_run
first
Skipping authentication checkCommands failRun
gcloud auth list
before bq commands
SELECT * on huge tablesScans all columnsSelect only needed columns
Loading without --skip_leading_rowsHeader becomes dataUse
--skip_leading_rows=1
for CSVs with headers
Single file for large exports1GB limit per fileUse wildcard:
gs://bucket/file_*.json.gz
No partitioning on large tablesExpensive full scansUse
--time_partitioning_type=DAY
Legacy SQL (default in old versions)Different syntaxUse
--use_legacy_sql=false
(or omit, it's default now)
Wrong dataset referenceAmbiguous tableUse fully qualified:
project.dataset.table
No compression on exportsLarger GCS storage costsUse
--compression=GZIP
Forgetting location parameterCross-region costsMatch
--location
to dataset location

Red Flags - CHECK BEFORE RUNNING

  • Running query without
    --dry_run
    on production data
  • Using
    SELECT *
    on tables with >1TB
  • Loading data without schema validation
  • Exporting large table without wildcard
  • Creating unpartitioned table for billions of rows
  • No authentication check before commands
  • Using wrong project (check
    gcloud config get-value project
    )
  • Not matching dataset and GCS bucket regions

All of these mean: Stop, review the command, fix the issue.

Dataset and Table References

Fully qualified:

project:dataset.table
# or
project.dataset.table

Without project (uses default):

dataset.table

Partitioned table (specific date):

dataset.table$20250120

Legacy SQL (bracket notation):

[project:dataset.table]

Job Management

# List running jobs
bq ls -j --filter='state:RUNNING'

# Show job details
bq show -j JOB_ID

# Cancel long-running job
bq cancel JOB_ID

# Wait for job completion
bq wait JOB_ID

Getting Help

# General help
bq help

# Command-specific help
bq help query
bq help load
bq help extract

# Show all flags
bq --helpfull

When in doubt, check

bq help <command>
for exact flags and syntax.

Real-World Impact

Cost savings:

  • Dry-run prevents accidental multi-thousand dollar queries
  • Partitioning reduces scan costs by 10-100x
  • Clustering adds 20-40% additional savings

Performance:

  • Partitioning + clustering: queries 10-100x faster
  • Proper schema: faster loads and queries
  • Columnar formats (Parquet): 5-10x faster loads than CSV

Common scenarios:

  • Ad-hoc analysis:
    bq query --dry_run
    → review →
    bq query
  • Data pipelines:
    bq load
    from GCS → process →
    bq extract
  • Cost monitoring: Always dry-run before production queries
  • Scale: Partition + cluster tables with >100M rows