Claude-skill-registry bigquery-etl-core

The core skill for working within the bigquery-etl repository. Use this skill when understanding project structure, conventions, and common patterns. Works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/bigquery-etl-core" ~/.claude/skills/majiayu000-claude-skill-registry-bigquery-etl-core && rm -rf "$T"
manifest: skills/data/bigquery-etl-core/SKILL.md
source content

BigQuery ETL Core

Composable: Foundation skill that works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills When to use: Understanding project structure, conventions, common patterns, and finding schema descriptions for construction

Project Overview

The bigquery-etl project manages BigQuery table definitions, queries, and associated metadata for Mozilla. Similar to dbt, the repository maintains query definitions with associated metadata and schemas.

Each table/query typically consists of three files:

  • query.sql
    OR
    query.py
    - The query definition (SQL or Python)
  • metadata.yaml
    - Metadata about scheduling, ownership, and dependencies (see metadata-manager skill)
  • schema.yaml
    - BigQuery schema definition with field types and descriptions (see metadata-manager skill)

Note: Most tables use

query.sql
(~95%). Use
query.py
for API calls, multi-project queries, or complex Python operations. See query-writer skill for details.

🚨 REQUIRED READING - Start Here

When starting work in bigquery-etl, READ these foundational references:

  1. Naming Conventions: READ

    references/naming_conventions.md

    • Table naming patterns
    • Dataset organization
    • Version suffix conventions
  2. Dataset Organization: READ

    references/dataset_naming_conventions.md

    • Common dataset suffixes (_derived, _stable, _live)
    • When to use each dataset type
    • Dataset naming rules
  3. Schema Resources: READ

    references/discovery_resources.md

    • Schema description sources (Glean Dictionary, ProbeInfo API, DataHub)
    • Priority order for schema lookup during construction
    • Common mozfun UDFs
  4. Privacy Guidelines: READ

    references/privacy_guidelines.md

    • Data handling requirements
    • PII considerations
    • Workgroup access patterns

Directory Structure

sql/{project}/{dataset}/{table_name}/
├── query.sql OR query.py
├── metadata.yaml
└── schema.yaml

See

assets/directory_structure_example.txt
for detailed examples.

Key principles:

  • Always flat:
    sql/{project}/{dataset}/{table_name}/
  • Never use subdirectories within table directories
  • Table names always include version suffix (
    _v1
    ,
    _v2
    , etc.)

Schema & Description Resources for Construction

Finding Schema Descriptions

Priority order for schema lookup during construction:

  1. Local files first: Check

    sql/*/schema.yaml
    and
    metadata.yaml
    files

    • Most reliable and up-to-date source
    • Contains field descriptions written by table owners
  2. Glean Dictionary: For

    _live
    and
    _stable
    tables

  3. ProbeInfo API: For Glean metric metadata

    • Endpoints:
      https://probeinfo.telemetry.mozilla.org/glean/{product}/metrics
    • Provides metric definitions and descriptions programmatically
    • Use for validating metric references in queries
  4. DataHub MCP: Only as last resort

    • MUST READ
      references/datahub_best_practices.md
      BEFORE any DataHub queries
    • Use for schema lookup when not available in local files or Glean Dictionary
    • Extract ONLY necessary fields (column names, types, descriptions)
    • Use for downstream impact analysis when modifying tables

See

references/discovery_resources.md
for:

  • Detailed guidance on each schema source
  • ProbeInfo API endpoints and usage patterns
  • Glean Dictionary URL patterns for different products
  • DataHub MCP best practices for construction
  • Common mozfun UDFs
  • Key documentation links

Naming Conventions

Table Names:

  • Use snake_case with version suffix:
    clients_daily_event_v1
  • Common suffixes:
    _daily
    ,
    _hourly
    ,
    _aggregates
    ,
    _summary

Field Names:

  • Use snake_case:
    submission_date
    ,
    client_id
    ,
    n_total_events
  • Prefix counts with
    n_
    :
    n_events
    ,
    n_sessions
  • Standard Mozilla fields:
    submission_date
    ,
    client_id
    ,
    sample_id
    ,
    normalized_channel
    ,
    normalized_country_code
    ,
    app_version

See

references/naming_conventions.md
for:

  • Complete naming patterns and conventions
  • Reserved/common patterns to avoid
  • BigQuery project naming conventions

Dataset Organization

See

references/dataset_naming_conventions.md
for:

  • Dataset naming patterns by suffix (
    _derived
    ,
    _external
    , etc.)
  • Common dataset prefixes by product/source
  • Table versioning patterns
  • Incremental vs full refresh query patterns

Privacy & Data Handling

Mozilla follows strict data privacy policies:

  • No PII in derived tables
  • Use client-level identifiers (
    client_id
    ) not individual identifiers
  • Respect data retention policies (~2 years for client-level data)
  • Label client-level tables with
    table_type: client_level
    in metadata.yaml

See

references/privacy_guidelines.md
for:

  • Key principles from Mozilla's data platform
  • Geo IP lookup and user agent parsing policies
  • Best practices for data handling
  • Deletion request support
  • Sample ID usage for sampling

BigQuery & Mozilla Conventions

Partitioning & Clustering

  • Most tables use day partitioning on
    submission_date
  • Clustering improves query performance for filtered/joined fields
  • See metadata-manager skill for detailed partitioning and clustering configuration

Common UDFs (mozfun)

Browse available functions: https://mozilla.github.io/bigquery-etl/mozfun/

Common functions:

  • mozfun.map.get_key()
    - Extract values from key-value maps
  • mozfun.norm.truncate_version()
    - Normalize version strings
  • mozfun.stats.mode_last()
    - Statistical mode calculation

UDF source code in

sql/mozfun/
directory.

Glean Overview

Glean is Mozilla's product analytics & telemetry solution, providing consistent measurement across all Mozilla products.

Key concepts:

  • Metric types: Counter, boolean, string, event, etc.
  • Pings: Collections of metrics (e.g.,
    baseline
    ,
    events
    ,
    metrics
    )
  • Applications: Products using Glean (Fenix, Focus, Firefox iOS, etc.)

Common Glean datasets in BigQuery:

  • Pattern:
    {app_id}.{ping_name}
    (e.g.,
    org_mozilla_fenix.baseline
    )
  • All have auto-generated schemas based on metric definitions

See

references/glean_overview.md
for:

  • What is Glean and how it differs from Firefox Desktop Telemetry
  • Glean SDK and metric type details
  • Common Glean datasets in BigQuery
  • When to use Glean Dictionary

bigquery-etl CLI Commands

See

references/bqetl_cli_commands.md
for:

  • Key bqetl CLI commands for query creation, validation, schema updates
  • How to find the right DAG for scheduling
  • Backfill creation commands

Best Practices

General principles:

  • Always include field descriptions in schema.yaml (see metadata-manager skill)
  • Add header comments explaining query purpose (see query-writer skill)
  • Reference bug/ticket numbers for context
  • Document any data exclusions or filtering logic

See

assets/query_structure_example.sql
for standard query structure.

Version migration:

  • Create new
    _v2
    table when making breaking schema changes
  • Keep
    _v1
    running during migration period
  • Update views to point to new version
  • Coordinate with downstream consumers before deprecating old version

For detailed best practices, see:

Integration with Other Skills

bigquery-etl-core serves as the foundation skill that other skills build upon:

Works with model-requirements

  • Provides naming conventions for new tables and datasets
  • Supplies common field naming patterns for requirements gathering
  • Offers privacy guidelines for data model planning

Works with query-writer

  • Provides project structure and naming conventions
  • Supplies common patterns and mozfun UDF references
  • Offers schema description lookup guidance for construction

Works with metadata-manager

  • Provides DAG naming patterns and scheduling conventions
  • Supplies partitioning and clustering best practices
  • Offers ownership and labeling patterns

Works with sql-test-generator

  • Provides test structure and fixture naming conventions
  • Supplies common table patterns for test creation
  • Offers query parameter conventions

Works with bigconfig-generator

  • Provides table naming conventions for Bigeye monitoring configuration
  • Supplies dataset organization patterns
  • Offers field naming standards for data quality checks

This skill is always available and does not need to be explicitly invoked - it provides foundational knowledge that other skills reference.

Reference Examples

Real query examples in the repository:

  • Simple query:
    sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/query.sql
  • Aggregation with GROUP BY:
    sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_event_v1/query.sql
  • Complex query with CTEs:
    sql/moz-fx-data-shared-prod/telemetry_derived/event_events_v1/query.sql
  • Python ETL (INFORMATION_SCHEMA):
    sql/moz-fx-data-shared-prod/monitoring_derived/bigquery_table_storage_v1/query.py
  • Python ETL (External API):
    sql/moz-fx-data-shared-prod/bigeye_derived/user_service_v1/query.py

For more examples, explore the

sql/moz-fx-data-shared-prod/
directory.

Bundled Resources

References

  • references/discovery_resources.md
    - Schema description sources (Glean Dictionary, ProbeInfo API, DataHub MCP), priority order for construction, documentation links
  • references/naming_conventions.md
    - Complete naming patterns for tables, fields, and projects
  • references/dataset_naming_conventions.md
    - Dataset organization and versioning patterns
  • references/privacy_guidelines.md
    - Mozilla data privacy policies and best practices
  • references/glean_overview.md
    - Glean SDK concepts and BigQuery dataset structures
  • references/bqetl_cli_commands.md
    - Key CLI commands and DAG discovery

DataHub Usage (CRITICAL for Token Efficiency)

BEFORE using any DataHub MCP tools (

mcp__datahub-cloud__*
), you MUST:

  • READ
    references/datahub_best_practices.md
    - Comprehensive token optimization strategies
  • Follow priority order: local files → documentation → DataHub (only as last resort)
  • Use search-first patterns and extract minimal fields from responses

Assets

  • assets/query_structure_example.sql
    - Standard query.sql structure with common patterns
  • assets/directory_structure_example.txt
    - File organization examples