Claude-skill-registry bigquery-cli

Use when working with BigQuery from command line using bq tool, including querying data, loading/exporting tables, managing datasets, cost estimation with dry-run, or partitioning strategies

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/bigquery-cli" ~/.claude/skills/majiayu000-claude-skill-registry-bigquery-cli && rm -rf "$T"

manifest: skills/data/bigquery-cli/SKILL.md

BigQuery CLI (bq)

Overview

The BigQuery CLI (

bq

) provides command-line access to Google BigQuery for data warehousing and analytics. Core principle: Always estimate costs with dry-run, use proper authentication, and leverage partitioning for scale.

When to Use

Running SQL queries on BigQuery
Loading data from CSV/JSON/Avro/Parquet
Exporting table data to GCS
Managing datasets, tables, views
Cost estimation and optimization
Creating partitioned/clustered tables
Viewing schemas and data samples

When NOT to use:

When BigQuery UI is more appropriate (ad-hoc exploration)
When using BigQuery client libraries (Python, Java, etc.)

Authentication and Project Setup

Before ANY bq operation:

# Check authentication
gcloud auth list

# Login if needed
gcloud auth login

# Set default project
gcloud config set project PROJECT_ID

# Verify current project
gcloud config get-value project

Command Structure

Format:

bq [--global_flags] <command> [--command_flags] [args]

Important global flags:

```
--project_id=PROJECT
```
- Override default project
```
--dataset_id=DATASET
```
- Default dataset
```
--location=LOCATION
```
- Geographic location (us, eu, asia-northeast1)
```
--format=FORMAT
```
- Output format: pretty, sparse, prettyjson, json, csv
```
--quiet
```
/
```
-q
```
- Suppress status updates
```
--dry_run
```
- Validate without executing (queries only)

Quick Reference

Query Operations

Task	Command	Key Flags
Run query	`bq query 'SELECT ...'`	`--dry_run` , `--use_legacy_sql=false` , `--destination_table` , `--max_rows`
Estimate cost	`bq query --dry_run 'SELECT ...'`	Shows bytes to be processed
Save results	`bq query --destination_table=ds.table 'SELECT ...'`	`--append_table` , `--replace`
Parameterized query	`bq query --parameter='name:STRING:value' 'SELECT ...'`	Repeat `--parameter` for multiple

Data Loading

Task	Command	Key Flags
Load CSV	`bq load ds.table gs://bucket/file.csv schema`	`--skip_leading_rows=1` , `--autodetect` , `--field_delimiter`
Load JSON	`bq load --source_format=NEWLINE_DELIMITED_JSON ds.table file.json`	`--autodetect` , `--schema`
Load Parquet/Avro	`bq load --source_format=PARQUET ds.table gs://bucket/*.parquet`	Schema auto-detected
Replace table	`bq load --replace ds.table source schema`	Overwrites existing data
Append to table	`bq load --noreplace ds.table source schema`	Adds to existing data (default)

Data Export

Task	Command	Key Flags
Export to CSV	`bq extract --destination_format=CSV ds.table gs://bucket/file.csv`	`--field_delimiter` , `--print_header`
Export to JSON	`bq extract --destination_format=NEWLINE_DELIMITED_JSON ds.table gs://bucket/*.json`	Use wildcard for large files
Compressed export	`bq extract --compression=GZIP ds.table gs://bucket/*.json.gz`	GZIP, SNAPPY, or NONE
Export model	`bq extract -m ds.model gs://bucket/model`	For ML models

Resource Management

Task	Command	Key Flags
List datasets	`bq ls` or `bq ls PROJECT:`	`-a` for all (including hidden)
List tables	`bq ls DATASET`	`-m` for models, `-a` for all
List jobs	`bq ls -j PROJECT`	`--filter='state:RUNNING,PENDING'` , `--max_results`
Show table	`bq show ds.table`	`--schema` , `--format=prettyjson`
Show schema only	`bq show --schema ds.table`	Faster than full show
Preview data	`bq head ds.table`	`-n 100` , `-s 10` for offset
Show job	`bq show -j JOB_ID`	Check job status and details

Creating Resources

Task	Command	Key Flags
Create dataset	`bq mk DATASET`	`--location=us` , `--description`
Create table	`bq mk -t ds.table schema`	See schema format below
Create view	`bq mk --view='SELECT ...' ds.view`	SQL definition
Create materialized view	`bq mk --materialized_view='SELECT ...' ds.mview`	Auto-refreshed
Create partitioned table	`bq mk --table --time_partitioning_type=DAY ds.table schema`	See partitioning below

Deleting Resources

Task	Command	Key Flags
Delete table	`bq rm ds.table`	`-f` to skip confirmation
Delete dataset	`bq rm -r DATASET`	`-r` removes all tables, `-f` force
Delete model	`bq rm -m ds.model`	For ML models
Cancel job	`bq cancel JOB_ID`	Stops running query

Cost Estimation - ALWAYS USE DRY RUN

# ALWAYS check cost before running expensive queries
bq query --dry_run 'SELECT * FROM project.dataset.huge_table'

# Output shows bytes to be processed
# Cost = (Bytes / 1TB) × $6.25 (US, on-demand pricing)
# First 1 TB per month is free

# Add safety limit (in bytes)
bq query --maximum_bytes_billed=5000000000000 'SELECT ...'  # ~5TB limit

Schema Format

Inline (comma-separated):

field1:STRING,field2:INTEGER,field3:FLOAT,field4:BOOLEAN,field5:TIMESTAMP

With mode:

field1:STRING:REQUIRED,field2:INTEGER:NULLABLE,field3:RECORD:REPEATED

JSON file:

[
  {"name": "field1", "type": "STRING", "mode": "REQUIRED"},
  {"name": "field2", "type": "INTEGER", "mode": "NULLABLE"},
  {
    "name": "nested",
    "type": "RECORD",
    "fields": [
      {"name": "subfield", "type": "STRING"}
    ]
  }
]

Common types: STRING, INTEGER, FLOAT, BOOLEAN, TIMESTAMP, DATE, TIME, DATETIME, NUMERIC, BIGNUMERIC, BYTES, GEOGRAPHY, JSON, RECORD (nested), ARRAY

Partitioning and Clustering

Time-based Partitioning

# Partition by DATE/TIMESTAMP column
bq mk --table \
  --time_partitioning_type=DAY \
  --time_partitioning_field=event_date \
  --time_partitioning_expiration=2592000 \
  ds.events \
  event_id:STRING,event_date:DATE,data:STRING

# Require partition filter (prevents expensive full scans)
bq mk --table \
  --time_partitioning_type=DAY \
  --require_partition_filter=true \
  ds.events \
  schema.json

Partition types: DAY, HOUR, MONTH, YEAR

Clustering

# Add clustering (up to 4 columns)
bq mk --table \
  --time_partitioning_type=DAY \
  --time_partitioning_field=event_date \
  --clustering_fields=user_id,region \
  ds.events \
  schema.json

Benefits: Faster queries, lower costs when filtering/grouping by clustered columns

Output Formats

# Pretty table (default)
bq query --format=pretty 'SELECT ...'

# JSON (compact)
bq query --format=json 'SELECT ...'

# Pretty JSON (readable)
bq query --format=prettyjson 'SELECT ...'

# CSV with header
bq query --format=csv 'SELECT ...'

# Sparse table (simpler)
bq query --format=sparse 'SELECT ...'

Common Workflows

Cost Estimation → Query

# 1. Estimate cost
bq query --dry_run 'SELECT ...'

# 2. Review bytes to be processed
# Calculate: (bytes / 1099511627776) × $6.25

# 3. Run query if acceptable
bq query 'SELECT ...'

# OR add safety limit
bq query --maximum_bytes_billed=10000000000000 'SELECT ...'

Load Data Pipeline

# 1. Check authentication and project
gcloud auth list
gcloud config get-value project

# 2. Create dataset if needed
bq ls | grep my_dataset || bq mk my_dataset

# 3. Load with autodetect (fast) or explicit schema (production)
bq load --autodetect --skip_leading_rows=1 \
  my_dataset.table \
  gs://bucket/data.csv

# 4. Verify
bq show my_dataset.table
bq head -n 10 my_dataset.table

Export Large Table

# 1. Export with wildcard (required for >1GB)
bq extract \
  --compression=GZIP \
  --destination_format=NEWLINE_DELIMITED_JSON \
  dataset.large_table \
  'gs://bucket/export_*.json.gz'

# 2. Verify export
gsutil ls -lh gs://bucket/export_*

# 3. Check total size
gsutil du -sh gs://bucket/

Create Partitioned Table for Scale

# 1. Create with partitioning and clustering
bq mk --table \
  --time_partitioning_type=DAY \
  --time_partitioning_field=event_date \
  --clustering_fields=user_id,event_type \
  --require_partition_filter=true \
  --time_partitioning_expiration=31536000 \
  dataset.events \
  event_id:STRING,user_id:STRING,event_type:STRING,event_date:DATE,data:JSON

# 2. Load initial data
bq load dataset.events gs://bucket/events_*.json

# 3. Query with partition filter (required)
bq query 'SELECT * FROM dataset.events WHERE event_date = "2025-01-20"'

Best Practices

Query Cost Optimization

Always use --dry_run first for queries scanning >1TB
Add WHERE clauses on partition columns
SELECT specific columns, not
```
SELECT *
```
Use --maximum_bytes_billed as safety net
Partition large tables by date/timestamp
Cluster by common filter columns

Loading Data

Use --autodetect for quick loads, explicit schema for production
Load from GCS (not local files) for large data
Use appropriate source format: Parquet/Avro > CSV/JSON
Batch small files into larger files before loading
Set expiration on staging tables

Exporting Data

Use wildcards for large exports (>1GB limit per file)
Compress exports with GZIP or SNAPPY
Use columnar formats (Parquet, Avro) for analytics workflows
Match regions (BigQuery dataset and GCS bucket)

Partitioning

Partition by date/timestamp for time-series data
Use require_partition_filter to prevent expensive scans
Set partition expiration to auto-delete old data
Combine with clustering for additional optimization
Partition types: DAY for most use cases, HOUR for high-volume

Schema Design

Use REQUIRED for mandatory fields
Use TIMESTAMP over STRING for timestamps
Use NUMERIC for financial data (exact precision)
Use JSON type for flexible nested data (Standard SQL only)
Nested records better than wide tables with many columns

Common Mistakes

Mistake	Why It's Wrong	Correct Approach
No dry-run for large queries	Unexpected costs	Always `bq query --dry_run` first
Skipping authentication check	Commands fail	Run `gcloud auth list` before bq commands
SELECT * on huge tables	Scans all columns	Select only needed columns
Loading without --skip_leading_rows	Header becomes data	Use `--skip_leading_rows=1` for CSVs with headers
Single file for large exports	1GB limit per file	Use wildcard: `gs://bucket/file_*.json.gz`
No partitioning on large tables	Expensive full scans	Use `--time_partitioning_type=DAY`
Legacy SQL (default in old versions)	Different syntax	Use `--use_legacy_sql=false` (or omit, it's default now)
Wrong dataset reference	Ambiguous table	Use fully qualified: `project.dataset.table`
No compression on exports	Larger GCS storage costs	Use `--compression=GZIP`
Forgetting location parameter	Cross-region costs	Match `--location` to dataset location

Red Flags - CHECK BEFORE RUNNING

Running query without
```
--dry_run
```
on production data
Using
```
SELECT *
```
on tables with >1TB
Loading data without schema validation
Exporting large table without wildcard
Creating unpartitioned table for billions of rows
No authentication check before commands
Using wrong project (check
```
gcloud config get-value project
```
)
Not matching dataset and GCS bucket regions

All of these mean: Stop, review the command, fix the issue.

Dataset and Table References

Fully qualified:

project:dataset.table
# or
project.dataset.table

Without project (uses default):

dataset.table

Partitioned table (specific date):

dataset.table$20250120

Legacy SQL (bracket notation):

[project:dataset.table]

Job Management

# List running jobs
bq ls -j --filter='state:RUNNING'

# Show job details
bq show -j JOB_ID

# Cancel long-running job
bq cancel JOB_ID

# Wait for job completion
bq wait JOB_ID

Getting Help

# General help
bq help

# Command-specific help
bq help query
bq help load
bq help extract

# Show all flags
bq --helpfull

When in doubt, check

bq help <command>

for exact flags and syntax.

Real-World Impact

Cost savings:

Dry-run prevents accidental multi-thousand dollar queries
Partitioning reduces scan costs by 10-100x
Clustering adds 20-40% additional savings

Performance:

Partitioning + clustering: queries 10-100x faster
Proper schema: faster loads and queries
Columnar formats (Parquet): 5-10x faster loads than CSV

Common scenarios:

Ad-hoc analysis:
```
bq query --dry_run
```
→ review →
```
bq query
```
Data pipelines:
```
bq load
```
from GCS → process →
```
bq extract
```
Cost monitoring: Always dry-run before production queries
Scale: Partition + cluster tables with >100M rows