Claude-skill-registry gcp-bq-table-management
Use when creating BigQuery tables, implementing partitioning or clustering, managing table schemas, or optimizing table structure. Covers time-based partitioning, range partitioning, clustering strategies, DDL commands, and table configuration.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/gcp-bq-table-management" ~/.claude/skills/majiayu000-claude-skill-registry-gcp-bq-table-management && rm -rf "$T"
skills/data/gcp-bq-table-management/SKILL.mdBigQuery Table Management
Use this skill when creating, modifying, or optimizing BigQuery table structures with partitioning and clustering.
Creating Tables
Basic Table Creation
Using bq mk:
bq mk -t \ --schema 'customer_id:STRING,amount:FLOAT,date:DATE' \ --description "Customer orders table" \ project:dataset.orders
Using SQL DDL:
CREATE TABLE `project.dataset.orders` ( customer_id STRING, amount FLOAT64, date DATE, created_at TIMESTAMP );
Partitioning Strategies
Time-Based Partitioning
Create time-partitioned table:
bq mk -t \ --schema 'timestamp:TIMESTAMP,customer_id:STRING,amount:FLOAT' \ --time_partitioning_field timestamp \ --time_partitioning_type DAY \ project:dataset.orders
SQL DDL version:
CREATE TABLE `project.dataset.orders` ( timestamp TIMESTAMP, customer_id STRING, amount FLOAT64 ) PARTITION BY DATE(timestamp);
Partitioning options:
- DAY - Daily partitions (most common)
- HOUR - Hourly partitions (for high-volume data)
- MONTH - Monthly partitions (for historical data)
- YEAR - Yearly partitions (for very old data)
Ingestion-Time Partitioning
Create table with automatic _PARTITIONTIME:
bq mk -t \ --schema 'customer_id:STRING,amount:FLOAT' \ --time_partitioning_type DAY \ project:dataset.orders
Query with ingestion-time partition:
SELECT * FROM `project.dataset.orders` WHERE _PARTITIONTIME >= '2024-01-01'
Range Partitioning
Create range-partitioned table:
bq mk -t \ --schema 'customer_id:INTEGER,region:STRING,sales:FLOAT' \ --range_partitioning=customer_id,0,100,10 \ project:dataset.sales
Parameters:
field,start,end,interval
- Creates partitions: [0,10), [10,20), [20,30), ..., [90,100)
SQL DDL version:
CREATE TABLE `project.dataset.sales` ( customer_id INT64, region STRING, sales FLOAT64 ) PARTITION BY RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 100, 10));
Clustering
Basic Clustering
Create clustered table:
bq mk -t \ --schema 'timestamp:TIMESTAMP,customer_id:STRING,product_id:STRING,amount:FLOAT' \ --clustering_fields customer_id,product_id \ project:dataset.orders
SQL DDL version:
CREATE TABLE `project.dataset.orders` ( timestamp TIMESTAMP, customer_id STRING, product_id STRING, amount FLOAT64 ) CLUSTER BY customer_id, product_id;
Clustering rules:
- Up to 4 clustering columns
- Order matters (first column has most impact)
- Works best with WHERE, GROUP BY, JOIN filters
Partitioning + Clustering (Recommended)
Combined approach:
bq mk -t \ --schema 'timestamp:TIMESTAMP,customer_id:STRING,transaction_amount:FLOAT' \ --time_partitioning_field timestamp \ --clustering_fields customer_id \ --description "Partitioned by day, clustered by customer" \ project:dataset.transactions
SQL DDL version:
CREATE TABLE `project.dataset.transactions` ( timestamp TIMESTAMP, customer_id STRING, transaction_amount FLOAT64 ) PARTITION BY DATE(timestamp) CLUSTER BY customer_id;
Query benefits:
-- Partition pruning + clustering optimization SELECT * FROM `project.dataset.transactions` WHERE DATE(timestamp) = '2024-01-15' -- Partition filter AND customer_id = 'CUST123' -- Cluster filter
Table Configuration Options
Expiration
Set table expiration:
bq mk -t \ --expiration 2592000 \ --schema 'field:TYPE' \ project:dataset.temp_table
Expiration in seconds: 2592000 = 30 days
Update existing table:
bq update --expiration 604800 project:dataset.table
Remove expiration:
bq update --expiration 0 project:dataset.table
Labels
Add labels:
bq mk -t \ --schema 'field:TYPE' \ --label environment:production \ --label team:analytics \ project:dataset.table
Update labels:
bq update --set_label environment:staging project:dataset.table
Description
Set description:
bq update \ --description "Customer transaction history with daily partitioning" \ project:dataset.transactions
Schema Management
Adding Columns
Cannot add required columns to existing data:
# Add optional column bq query --use_legacy_sql=false \ 'ALTER TABLE `project.dataset.table` ADD COLUMN new_field STRING'
Changing Column Modes
REQUIRED → NULLABLE (allowed):
ALTER TABLE `project.dataset.table` ALTER COLUMN field_name DROP NOT NULL;
NULLABLE → REQUIRED (NOT allowed if data exists)
Relaxing Column Types
Allowed type changes:
- INT64 → FLOAT64 ✅
- INT64 → NUMERIC ✅
- INT64 → BIGNUMERIC ✅
- INT64 → STRING ✅
Example:
ALTER TABLE `project.dataset.table` ALTER COLUMN amount SET DATA TYPE FLOAT64;
External Tables
Create External Table (GCS)
CSV in GCS:
bq mk \ --external_table_definition=gs://bucket/*.csv@CSV \ --schema='customer_id:STRING,amount:FLOAT' \ project:dataset.external_orders
Parquet in GCS (schema auto-detected):
bq mk \ --external_table_definition=gs://bucket/*.parquet@PARQUET \ project:dataset.external_data
Supported formats: CSV, JSON, AVRO, PARQUET, ORC
External Table Limitations
- No DML operations (INSERT, UPDATE, DELETE)
- No guaranteed performance SLAs
- Data must be in GCS, Drive, or Bigtable
- Cannot be partitioned (but can use hive partitioning)
Snapshots and Clones
Table Snapshots
Create snapshot:
CREATE SNAPSHOT TABLE `project.dataset.orders_snapshot` CLONE `project.dataset.orders`;
Restore from snapshot:
CREATE OR REPLACE TABLE `project.dataset.orders` CLONE `project.dataset.orders_snapshot`;
Snapshot retention: 7 days by default
Table Clones
Create table clone:
CREATE TABLE `project.dataset.orders_clone` CLONE `project.dataset.orders`;
Difference from snapshot:
- Clone = new independent table
- Snapshot = point-in-time reference
Time Travel
Query historical data:
-- Query table as it was 1 hour ago SELECT * FROM `project.dataset.orders` FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR); -- Query table at specific time SELECT * FROM `project.dataset.orders` FOR SYSTEM_TIME AS OF '2024-01-15 10:00:00 UTC';
Time travel window: 7 days (168 hours) by default
Best Practices
Partition Selection
Use time-based partitioning when:
- Data has timestamp/date column
- Queries filter by time ranges
- Data arrives chronologically
- Want automatic partition management
Use range partitioning when:
- Partitioning on integer column (ID, age, etc.)
- Predictable value distribution
- Fixed range boundaries
Use ingestion-time partitioning when:
- No natural timestamp column
- Loading data from streaming sources
- Want simple partition management
Clustering Selection
Cluster on columns that are:
- Frequently used in WHERE clauses
- Used in JOIN conditions
- Used in GROUP BY
- High cardinality (many distinct values)
Order matters:
- Most filtered column first
- Then second most filtered
- Up to 4 columns total
Partitioning + Clustering Strategy
Optimal pattern:
CREATE TABLE `project.dataset.optimized` ( event_timestamp TIMESTAMP, -- Partition on this customer_id STRING, -- Cluster on this (1st) product_category STRING, -- Cluster on this (2nd) amount FLOAT64 ) PARTITION BY DATE(event_timestamp) CLUSTER BY customer_id, product_category;
Query pattern:
-- Both partition and cluster benefit SELECT SUM(amount) FROM `project.dataset.optimized` WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31' AND customer_id = 'CUST123' GROUP BY product_category;
Checking Table Metadata
Get table information:
bq show --format=prettyjson project:dataset.table
Check partition info:
SELECT partition_id, total_rows, total_logical_bytes, last_modified_time FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS` WHERE table_name = 'orders' ORDER BY partition_id DESC LIMIT 10;
Check clustering info:
SELECT table_name, clustering_ordinal_position, column_name FROM `project.dataset.INFORMATION_SCHEMA.COLUMNS` WHERE clustering_ordinal_position IS NOT NULL ORDER BY table_name, clustering_ordinal_position;
Common Pitfalls
❌ Too many partitions
Problem: Creating 100,000+ partitions Limit: 10,000 partitions per table Solution: Use larger partition granularity (MONTH vs DAY)
❌ Wrong partition column
Problem: Partitioning on column not used in queries Solution: Partition on columns used in WHERE clauses
❌ Not filtering on partition
Problem: Query scans all partitions Solution: Always include partition filter in WHERE
❌ Clustering too many columns
Problem: Clustering on 5+ columns Limit: 4 columns maximum Solution: Choose most selective columns
❌ Wrong cluster order
Problem: Least selective column first Solution: Put most selective column first
Table Maintenance
Update partition expiration
Set partition expiration:
bq update \ --time_partitioning_expiration 2592000 \ project:dataset.partitioned_table
Query shows this: Partitions older than 30 days auto-delete
Optimize table storage
Run optimization query:
-- BigQuery automatically optimizes storage -- No manual VACUUM or OPTIMIZE needed
BigQuery automatically:
- Compacts data
- Sorts by clustering columns
- Removes deleted rows
- Optimizes storage format
Access Control & Security
Row-Level Security
Row-level access policies filter data based on user/group membership. They coexist with column-level security.
Creating row-level policies:
CREATE ROW ACCESS POLICY policy_name ON dataset.table GRANT TO ("user:[email protected]") FILTER USING (region = "US");
Multiple policies:
-- Policy for US users CREATE ROW ACCESS POLICY us_users_policy ON dataset.orders GRANT TO ("group:[email protected]") FILTER USING (region = "US"); -- Policy for managers (see all regions) CREATE ROW ACCESS POLICY managers_policy ON dataset.orders GRANT TO ("group:[email protected]") FILTER USING (TRUE);
Viewing policies:
SELECT * FROM dataset.INFORMATION_SCHEMA.ROW_ACCESS_POLICIES WHERE table_name = 'orders';
Dropping policies:
DROP ROW ACCESS POLICY policy_name ON dataset.table;
Column-Level Security
Use policy tags from Data Catalog to restrict access to sensitive columns:
Creating table with policy tags:
CREATE TABLE dataset.customers ( customer_id STRING, name STRING, email STRING, ssn STRING OPTIONS( policy_tags=("projects/PROJECT/locations/LOCATION/taxonomies/TAXONOMY/policyTags/PII_TAG") ), credit_score INT64 OPTIONS( policy_tags=("projects/PROJECT/locations/LOCATION/taxonomies/TAXONOMY/policyTags/SENSITIVE_TAG") ) );
Adding policy tags to existing columns:
ALTER TABLE dataset.customers ALTER COLUMN ssn SET OPTIONS( policy_tags=("projects/PROJECT/locations/LOCATION/taxonomies/TAXONOMY/policyTags/PII_TAG") );
How it works:
- Create taxonomy and policy tags in Data Catalog
- Apply policy tags to table columns
- Grant IAM roles on policy tags (datacatalog.categoryFineGrainedReader)
- Users without permission cannot query those columns
Authorized Views
Views that allow users to query data without direct table access:
Use cases:
- Sharing specific columns/rows without full table access
- Implementing business logic in access control
- Best performance for row/column filtering
Setup process:
-- 1. Create view in dataset A CREATE VIEW datasetA.public_orders AS SELECT order_id, customer_id, amount, order_date FROM datasetA.orders WHERE status = 'completed'; -- 2. Grant dataset B's view access to dataset A's table -- This is done via dataset permissions in Cloud Console or: bq update --source datasetA.orders \ --view datasetB.public_view
Example authorized view:
-- View in shared_views dataset CREATE VIEW shared_views.customer_summary AS SELECT customer_id, COUNT(*) as order_count, SUM(amount) as total_spent FROM private_data.orders GROUP BY customer_id; -- Grant access to view (not underlying table) -- Users can query shared_views.customer_summary -- but cannot access private_data.orders
Benefits:
- Row/column filtering without policy overhead
- Business logic in SQL (e.g., only show completed orders)
- Best query performance
- Centralized access control
Security Best Practices
1. Layered security:
- Use row-level policies for user-based filtering
- Use column-level security for sensitive data (PII, PHI)
- Use authorized views for complex access patterns
2. Performance:
- Authorized views: Best performance (compiled into query)
- Row-level policies: Slight overhead (filter applied)
- Column-level: No performance impact
3. Combining approaches:
-- Table with column-level security AND row-level policy CREATE TABLE dataset.sensitive_data ( user_id STRING, region STRING, ssn STRING OPTIONS(policy_tags=("...")), data JSON ) PARTITION BY DATE(created_at); -- Row-level policy CREATE ROW ACCESS POLICY regional_access ON dataset.sensitive_data GRANT TO ("group:[email protected]") FILTER USING (region = "US");
4. Auditing: Monitor access with Cloud Audit Logs:
SELECT timestamp, principal_email, resource_name, method_name FROM `PROJECT.DATASET.cloudaudit_googleapis_com_data_access_*` WHERE resource.type = "bigquery_dataset" ORDER BY timestamp DESC;
Quick Reference
Partition types:
- Time-based: HOUR, DAY, MONTH, YEAR
- Ingestion-time: Automatic _PARTITIONTIME
- Range: Integer column ranges
Clustering:
- Max 4 columns
- Order matters
- Works with or without partitioning
Security:
- Row-level: Filter by user/group
- Column-level: Policy tags for sensitive data
- Authorized views: Business logic filtering
Limits:
- 10,000 partitions per table
- 4 clustering columns
- 7-day time travel window
- 10,000 columns per table
- 100 row-level policies per table