Claude-skill-registry bigquery-etl-core
The core skill for working within the bigquery-etl repository. Use this skill when understanding project structure, conventions, and common patterns. Works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/bigquery-etl-core" ~/.claude/skills/majiayu000-claude-skill-registry-bigquery-etl-core && rm -rf "$T"
skills/data/bigquery-etl-core/SKILL.mdBigQuery ETL Core
Composable: Foundation skill that works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills When to use: Understanding project structure, conventions, common patterns, and finding schema descriptions for construction
Project Overview
The bigquery-etl project manages BigQuery table definitions, queries, and associated metadata for Mozilla. Similar to dbt, the repository maintains query definitions with associated metadata and schemas.
Each table/query typically consists of three files:
ORquery.sql
- The query definition (SQL or Python)query.py
- Metadata about scheduling, ownership, and dependencies (see metadata-manager skill)metadata.yaml
- BigQuery schema definition with field types and descriptions (see metadata-manager skill)schema.yaml
Note: Most tables use
query.sql (~95%). Use query.py for API calls, multi-project queries, or complex Python operations. See query-writer skill for details.
🚨 REQUIRED READING - Start Here
When starting work in bigquery-etl, READ these foundational references:
-
Naming Conventions: READ
references/naming_conventions.md- Table naming patterns
- Dataset organization
- Version suffix conventions
-
Dataset Organization: READ
references/dataset_naming_conventions.md- Common dataset suffixes (_derived, _stable, _live)
- When to use each dataset type
- Dataset naming rules
-
Schema Resources: READ
references/discovery_resources.md- Schema description sources (Glean Dictionary, ProbeInfo API, DataHub)
- Priority order for schema lookup during construction
- Common mozfun UDFs
-
Privacy Guidelines: READ
references/privacy_guidelines.md- Data handling requirements
- PII considerations
- Workgroup access patterns
Directory Structure
sql/{project}/{dataset}/{table_name}/ ├── query.sql OR query.py ├── metadata.yaml └── schema.yaml
See
assets/directory_structure_example.txt for detailed examples.
Key principles:
- Always flat:
sql/{project}/{dataset}/{table_name}/ - Never use subdirectories within table directories
- Table names always include version suffix (
,_v1
, etc.)_v2
Schema & Description Resources for Construction
Finding Schema Descriptions
Priority order for schema lookup during construction:
-
Local files first: Check
andsql/*/schema.yaml
filesmetadata.yaml- Most reliable and up-to-date source
- Contains field descriptions written by table owners
-
Glean Dictionary: For
and_live
tables_stable- URL: https://dictionary.telemetry.mozilla.org/
- Contains metric descriptions from Glean schema definitions
- Use WebFetch with targeted prompts to extract specific field descriptions
-
ProbeInfo API: For Glean metric metadata
- Endpoints:
https://probeinfo.telemetry.mozilla.org/glean/{product}/metrics - Provides metric definitions and descriptions programmatically
- Use for validating metric references in queries
- Endpoints:
-
DataHub MCP: Only as last resort
- MUST READ
BEFORE any DataHub queriesreferences/datahub_best_practices.md - Use for schema lookup when not available in local files or Glean Dictionary
- Extract ONLY necessary fields (column names, types, descriptions)
- Use for downstream impact analysis when modifying tables
- MUST READ
See
for:references/discovery_resources.md
- Detailed guidance on each schema source
- ProbeInfo API endpoints and usage patterns
- Glean Dictionary URL patterns for different products
- DataHub MCP best practices for construction
- Common mozfun UDFs
- Key documentation links
Naming Conventions
Table Names:
- Use snake_case with version suffix:
clients_daily_event_v1 - Common suffixes:
,_daily
,_hourly
,_aggregates_summary
Field Names:
- Use snake_case:
,submission_date
,client_idn_total_events - Prefix counts with
:n_
,n_eventsn_sessions - Standard Mozilla fields:
,submission_date
,client_id
,sample_id
,normalized_channel
,normalized_country_codeapp_version
See
for:references/naming_conventions.md
- Complete naming patterns and conventions
- Reserved/common patterns to avoid
- BigQuery project naming conventions
Dataset Organization
See
for:references/dataset_naming_conventions.md
- Dataset naming patterns by suffix (
,_derived
, etc.)_external - Common dataset prefixes by product/source
- Table versioning patterns
- Incremental vs full refresh query patterns
Privacy & Data Handling
Mozilla follows strict data privacy policies:
- No PII in derived tables
- Use client-level identifiers (
) not individual identifiersclient_id - Respect data retention policies (~2 years for client-level data)
- Label client-level tables with
in metadata.yamltable_type: client_level
See
for:references/privacy_guidelines.md
- Key principles from Mozilla's data platform
- Geo IP lookup and user agent parsing policies
- Best practices for data handling
- Deletion request support
- Sample ID usage for sampling
BigQuery & Mozilla Conventions
Partitioning & Clustering
- Most tables use day partitioning on
submission_date - Clustering improves query performance for filtered/joined fields
- See metadata-manager skill for detailed partitioning and clustering configuration
Common UDFs (mozfun)
Browse available functions: https://mozilla.github.io/bigquery-etl/mozfun/
Common functions:
- Extract values from key-value mapsmozfun.map.get_key()
- Normalize version stringsmozfun.norm.truncate_version()
- Statistical mode calculationmozfun.stats.mode_last()
UDF source code in
sql/mozfun/ directory.
Glean Overview
Glean is Mozilla's product analytics & telemetry solution, providing consistent measurement across all Mozilla products.
Key concepts:
- Metric types: Counter, boolean, string, event, etc.
- Pings: Collections of metrics (e.g.,
,baseline
,events
)metrics - Applications: Products using Glean (Fenix, Focus, Firefox iOS, etc.)
Common Glean datasets in BigQuery:
- Pattern:
(e.g.,{app_id}.{ping_name}
)org_mozilla_fenix.baseline - All have auto-generated schemas based on metric definitions
See
for:references/glean_overview.md
- What is Glean and how it differs from Firefox Desktop Telemetry
- Glean SDK and metric type details
- Common Glean datasets in BigQuery
- When to use Glean Dictionary
bigquery-etl CLI Commands
See
for:references/bqetl_cli_commands.md
- Key bqetl CLI commands for query creation, validation, schema updates
- How to find the right DAG for scheduling
- Backfill creation commands
Best Practices
General principles:
- Always include field descriptions in schema.yaml (see metadata-manager skill)
- Add header comments explaining query purpose (see query-writer skill)
- Reference bug/ticket numbers for context
- Document any data exclusions or filtering logic
See
for standard query structure.assets/query_structure_example.sql
Version migration:
- Create new
table when making breaking schema changes_v2 - Keep
running during migration period_v1 - Update views to point to new version
- Coordinate with downstream consumers before deprecating old version
For detailed best practices, see:
- Query writing: query-writer skill
- Metadata configuration: metadata-manager skill
- Performance optimization: https://docs.telemetry.mozilla.org/cookbooks/bigquery/optimization.html
- Recommended practices: https://mozilla.github.io/bigquery-etl/reference/recommended_practices/
Integration with Other Skills
bigquery-etl-core serves as the foundation skill that other skills build upon:
Works with model-requirements
- Provides naming conventions for new tables and datasets
- Supplies common field naming patterns for requirements gathering
- Offers privacy guidelines for data model planning
Works with query-writer
- Provides project structure and naming conventions
- Supplies common patterns and mozfun UDF references
- Offers schema description lookup guidance for construction
Works with metadata-manager
- Provides DAG naming patterns and scheduling conventions
- Supplies partitioning and clustering best practices
- Offers ownership and labeling patterns
Works with sql-test-generator
- Provides test structure and fixture naming conventions
- Supplies common table patterns for test creation
- Offers query parameter conventions
Works with bigconfig-generator
- Provides table naming conventions for Bigeye monitoring configuration
- Supplies dataset organization patterns
- Offers field naming standards for data quality checks
This skill is always available and does not need to be explicitly invoked - it provides foundational knowledge that other skills reference.
Reference Examples
Real query examples in the repository:
- Simple query:
sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/query.sql - Aggregation with GROUP BY:
sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_event_v1/query.sql - Complex query with CTEs:
sql/moz-fx-data-shared-prod/telemetry_derived/event_events_v1/query.sql - Python ETL (INFORMATION_SCHEMA):
sql/moz-fx-data-shared-prod/monitoring_derived/bigquery_table_storage_v1/query.py - Python ETL (External API):
sql/moz-fx-data-shared-prod/bigeye_derived/user_service_v1/query.py
For more examples, explore the
sql/moz-fx-data-shared-prod/ directory.
Bundled Resources
References
- Schema description sources (Glean Dictionary, ProbeInfo API, DataHub MCP), priority order for construction, documentation linksreferences/discovery_resources.md
- Complete naming patterns for tables, fields, and projectsreferences/naming_conventions.md
- Dataset organization and versioning patternsreferences/dataset_naming_conventions.md
- Mozilla data privacy policies and best practicesreferences/privacy_guidelines.md
- Glean SDK concepts and BigQuery dataset structuresreferences/glean_overview.md
- Key CLI commands and DAG discoveryreferences/bqetl_cli_commands.md
DataHub Usage (CRITICAL for Token Efficiency)
BEFORE using any DataHub MCP tools (
), you MUST:mcp__datahub-cloud__*
- READ
- Comprehensive token optimization strategiesreferences/datahub_best_practices.md - Follow priority order: local files → documentation → DataHub (only as last resort)
- Use search-first patterns and extract minimal fields from responses
Assets
- Standard query.sql structure with common patternsassets/query_structure_example.sql
- File organization examplesassets/directory_structure_example.txt