Claude-skill-registry bigquery-cli
Use when working with BigQuery from command line using bq tool, including querying data, loading/exporting tables, managing datasets, cost estimation with dry-run, or partitioning strategies
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/bigquery-cli" ~/.claude/skills/majiayu000-claude-skill-registry-bigquery-cli && rm -rf "$T"
skills/data/bigquery-cli/SKILL.mdBigQuery CLI (bq)
Overview
The BigQuery CLI (
bq) provides command-line access to Google BigQuery for data warehousing and analytics. Core principle: Always estimate costs with dry-run, use proper authentication, and leverage partitioning for scale.
When to Use
- Running SQL queries on BigQuery
- Loading data from CSV/JSON/Avro/Parquet
- Exporting table data to GCS
- Managing datasets, tables, views
- Cost estimation and optimization
- Creating partitioned/clustered tables
- Viewing schemas and data samples
When NOT to use:
- When BigQuery UI is more appropriate (ad-hoc exploration)
- When using BigQuery client libraries (Python, Java, etc.)
Authentication and Project Setup
Before ANY bq operation:
# Check authentication gcloud auth list # Login if needed gcloud auth login # Set default project gcloud config set project PROJECT_ID # Verify current project gcloud config get-value project
Command Structure
Format:
bq [--global_flags] <command> [--command_flags] [args]
Important global flags:
- Override default project--project_id=PROJECT
- Default dataset--dataset_id=DATASET
- Geographic location (us, eu, asia-northeast1)--location=LOCATION
- Output format: pretty, sparse, prettyjson, json, csv--format=FORMAT
/--quiet
- Suppress status updates-q
- Validate without executing (queries only)--dry_run
Quick Reference
Query Operations
| Task | Command | Key Flags |
|---|---|---|
| Run query | | , , , |
| Estimate cost | | Shows bytes to be processed |
| Save results | | , |
| Parameterized query | | Repeat for multiple |
Data Loading
| Task | Command | Key Flags |
|---|---|---|
| Load CSV | | , , |
| Load JSON | | , |
| Load Parquet/Avro | | Schema auto-detected |
| Replace table | | Overwrites existing data |
| Append to table | | Adds to existing data (default) |
Data Export
| Task | Command | Key Flags |
|---|---|---|
| Export to CSV | | , |
| Export to JSON | | Use wildcard for large files |
| Compressed export | | GZIP, SNAPPY, or NONE |
| Export model | | For ML models |
Resource Management
| Task | Command | Key Flags |
|---|---|---|
| List datasets | or | for all (including hidden) |
| List tables | | for models, for all |
| List jobs | | , |
| Show table | | , |
| Show schema only | | Faster than full show |
| Preview data | | , for offset |
| Show job | | Check job status and details |
Creating Resources
| Task | Command | Key Flags |
|---|---|---|
| Create dataset | | , |
| Create table | | See schema format below |
| Create view | | SQL definition |
| Create materialized view | | Auto-refreshed |
| Create partitioned table | | See partitioning below |
Deleting Resources
| Task | Command | Key Flags |
|---|---|---|
| Delete table | | to skip confirmation |
| Delete dataset | | removes all tables, force |
| Delete model | | For ML models |
| Cancel job | | Stops running query |
Cost Estimation - ALWAYS USE DRY RUN
# ALWAYS check cost before running expensive queries bq query --dry_run 'SELECT * FROM project.dataset.huge_table' # Output shows bytes to be processed # Cost = (Bytes / 1TB) × $6.25 (US, on-demand pricing) # First 1 TB per month is free # Add safety limit (in bytes) bq query --maximum_bytes_billed=5000000000000 'SELECT ...' # ~5TB limit
Schema Format
Inline (comma-separated):
field1:STRING,field2:INTEGER,field3:FLOAT,field4:BOOLEAN,field5:TIMESTAMP
With mode:
field1:STRING:REQUIRED,field2:INTEGER:NULLABLE,field3:RECORD:REPEATED
JSON file:
[ {"name": "field1", "type": "STRING", "mode": "REQUIRED"}, {"name": "field2", "type": "INTEGER", "mode": "NULLABLE"}, { "name": "nested", "type": "RECORD", "fields": [ {"name": "subfield", "type": "STRING"} ] } ]
Common types: STRING, INTEGER, FLOAT, BOOLEAN, TIMESTAMP, DATE, TIME, DATETIME, NUMERIC, BIGNUMERIC, BYTES, GEOGRAPHY, JSON, RECORD (nested), ARRAY
Partitioning and Clustering
Time-based Partitioning
# Partition by DATE/TIMESTAMP column bq mk --table \ --time_partitioning_type=DAY \ --time_partitioning_field=event_date \ --time_partitioning_expiration=2592000 \ ds.events \ event_id:STRING,event_date:DATE,data:STRING # Require partition filter (prevents expensive full scans) bq mk --table \ --time_partitioning_type=DAY \ --require_partition_filter=true \ ds.events \ schema.json
Partition types: DAY, HOUR, MONTH, YEAR
Clustering
# Add clustering (up to 4 columns) bq mk --table \ --time_partitioning_type=DAY \ --time_partitioning_field=event_date \ --clustering_fields=user_id,region \ ds.events \ schema.json
Benefits: Faster queries, lower costs when filtering/grouping by clustered columns
Output Formats
# Pretty table (default) bq query --format=pretty 'SELECT ...' # JSON (compact) bq query --format=json 'SELECT ...' # Pretty JSON (readable) bq query --format=prettyjson 'SELECT ...' # CSV with header bq query --format=csv 'SELECT ...' # Sparse table (simpler) bq query --format=sparse 'SELECT ...'
Common Workflows
Cost Estimation → Query
# 1. Estimate cost bq query --dry_run 'SELECT ...' # 2. Review bytes to be processed # Calculate: (bytes / 1099511627776) × $6.25 # 3. Run query if acceptable bq query 'SELECT ...' # OR add safety limit bq query --maximum_bytes_billed=10000000000000 'SELECT ...'
Load Data Pipeline
# 1. Check authentication and project gcloud auth list gcloud config get-value project # 2. Create dataset if needed bq ls | grep my_dataset || bq mk my_dataset # 3. Load with autodetect (fast) or explicit schema (production) bq load --autodetect --skip_leading_rows=1 \ my_dataset.table \ gs://bucket/data.csv # 4. Verify bq show my_dataset.table bq head -n 10 my_dataset.table
Export Large Table
# 1. Export with wildcard (required for >1GB) bq extract \ --compression=GZIP \ --destination_format=NEWLINE_DELIMITED_JSON \ dataset.large_table \ 'gs://bucket/export_*.json.gz' # 2. Verify export gsutil ls -lh gs://bucket/export_* # 3. Check total size gsutil du -sh gs://bucket/
Create Partitioned Table for Scale
# 1. Create with partitioning and clustering bq mk --table \ --time_partitioning_type=DAY \ --time_partitioning_field=event_date \ --clustering_fields=user_id,event_type \ --require_partition_filter=true \ --time_partitioning_expiration=31536000 \ dataset.events \ event_id:STRING,user_id:STRING,event_type:STRING,event_date:DATE,data:JSON # 2. Load initial data bq load dataset.events gs://bucket/events_*.json # 3. Query with partition filter (required) bq query 'SELECT * FROM dataset.events WHERE event_date = "2025-01-20"'
Best Practices
Query Cost Optimization
- Always use --dry_run first for queries scanning >1TB
- Add WHERE clauses on partition columns
- SELECT specific columns, not
SELECT * - Use --maximum_bytes_billed as safety net
- Partition large tables by date/timestamp
- Cluster by common filter columns
Loading Data
- Use --autodetect for quick loads, explicit schema for production
- Load from GCS (not local files) for large data
- Use appropriate source format: Parquet/Avro > CSV/JSON
- Batch small files into larger files before loading
- Set expiration on staging tables
Exporting Data
- Use wildcards for large exports (>1GB limit per file)
- Compress exports with GZIP or SNAPPY
- Use columnar formats (Parquet, Avro) for analytics workflows
- Match regions (BigQuery dataset and GCS bucket)
Partitioning
- Partition by date/timestamp for time-series data
- Use require_partition_filter to prevent expensive scans
- Set partition expiration to auto-delete old data
- Combine with clustering for additional optimization
- Partition types: DAY for most use cases, HOUR for high-volume
Schema Design
- Use REQUIRED for mandatory fields
- Use TIMESTAMP over STRING for timestamps
- Use NUMERIC for financial data (exact precision)
- Use JSON type for flexible nested data (Standard SQL only)
- Nested records better than wide tables with many columns
Common Mistakes
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| No dry-run for large queries | Unexpected costs | Always first |
| Skipping authentication check | Commands fail | Run before bq commands |
| SELECT * on huge tables | Scans all columns | Select only needed columns |
| Loading without --skip_leading_rows | Header becomes data | Use for CSVs with headers |
| Single file for large exports | 1GB limit per file | Use wildcard: |
| No partitioning on large tables | Expensive full scans | Use |
| Legacy SQL (default in old versions) | Different syntax | Use (or omit, it's default now) |
| Wrong dataset reference | Ambiguous table | Use fully qualified: |
| No compression on exports | Larger GCS storage costs | Use |
| Forgetting location parameter | Cross-region costs | Match to dataset location |
Red Flags - CHECK BEFORE RUNNING
- Running query without
on production data--dry_run - Using
on tables with >1TBSELECT * - Loading data without schema validation
- Exporting large table without wildcard
- Creating unpartitioned table for billions of rows
- No authentication check before commands
- Using wrong project (check
)gcloud config get-value project - Not matching dataset and GCS bucket regions
All of these mean: Stop, review the command, fix the issue.
Dataset and Table References
Fully qualified:
project:dataset.table # or project.dataset.table
Without project (uses default):
dataset.table
Partitioned table (specific date):
dataset.table$20250120
Legacy SQL (bracket notation):
[project:dataset.table]
Job Management
# List running jobs bq ls -j --filter='state:RUNNING' # Show job details bq show -j JOB_ID # Cancel long-running job bq cancel JOB_ID # Wait for job completion bq wait JOB_ID
Getting Help
# General help bq help # Command-specific help bq help query bq help load bq help extract # Show all flags bq --helpfull
When in doubt, check
for exact flags and syntax.bq help <command>
Real-World Impact
Cost savings:
- Dry-run prevents accidental multi-thousand dollar queries
- Partitioning reduces scan costs by 10-100x
- Clustering adds 20-40% additional savings
Performance:
- Partitioning + clustering: queries 10-100x faster
- Proper schema: faster loads and queries
- Columnar formats (Parquet): 5-10x faster loads than CSV
Common scenarios:
- Ad-hoc analysis:
→ review →bq query --dry_runbq query - Data pipelines:
from GCS → process →bq loadbq extract - Cost monitoring: Always dry-run before production queries
- Scale: Partition + cluster tables with >100M rows