Claude-skill-registry bigquery-etl-core

The core skill for working within the bigquery-etl repository. Use this skill when understanding project structure, conventions, and common patterns. Works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/bigquery-etl-core" ~/.claude/skills/majiayu000-claude-skill-registry-bigquery-etl-core && rm -rf "$T"

manifest: skills/data/bigquery-etl-core/SKILL.md

BigQuery ETL Core

Composable: Foundation skill that works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills When to use: Understanding project structure, conventions, common patterns, and finding schema descriptions for construction

Project Overview

The bigquery-etl project manages BigQuery table definitions, queries, and associated metadata for Mozilla. Similar to dbt, the repository maintains query definitions with associated metadata and schemas.

Each table/query typically consists of three files:

```
query.sql
```
OR
```
query.py
```
- The query definition (SQL or Python)
```
metadata.yaml
```
- Metadata about scheduling, ownership, and dependencies (see metadata-manager skill)
```
schema.yaml
```
- BigQuery schema definition with field types and descriptions (see metadata-manager skill)

Note: Most tables use

query.sql

(~95%). Use

query.py

for API calls, multi-project queries, or complex Python operations. See query-writer skill for details.

🚨 REQUIRED READING - Start Here

When starting work in bigquery-etl, READ these foundational references:

Naming Conventions: READ
```
references/naming_conventions.md
```
- Table naming patterns
- Dataset organization
- Version suffix conventions
Dataset Organization: READ
```
references/dataset_naming_conventions.md
```
- Common dataset suffixes (_derived, _stable, _live)
- When to use each dataset type
- Dataset naming rules
Schema Resources: READ
```
references/discovery_resources.md
```
- Schema description sources (Glean Dictionary, ProbeInfo API, DataHub)
- Priority order for schema lookup during construction
- Common mozfun UDFs
Privacy Guidelines: READ
```
references/privacy_guidelines.md
```
- Data handling requirements
- PII considerations
- Workgroup access patterns

Directory Structure

sql/{project}/{dataset}/{table_name}/
├── query.sql OR query.py
├── metadata.yaml
└── schema.yaml

See

assets/directory_structure_example.txt

for detailed examples.

Key principles:

Always flat:
```
sql/{project}/{dataset}/{table_name}/
```
Never use subdirectories within table directories
Table names always include version suffix (
```
_v1
```
,
```
_v2
```
, etc.)

Schema & Description Resources for Construction

Finding Schema Descriptions

Priority order for schema lookup during construction:

Local files first: Check
```
sql/*/schema.yaml
```
and
```
metadata.yaml
```
files
- Most reliable and up-to-date source
- Contains field descriptions written by table owners
Glean Dictionary: For
```
_live
```
and
```
_stable
```
tables
- URL: https://dictionary.telemetry.mozilla.org/
- Contains metric descriptions from Glean schema definitions
- Use WebFetch with targeted prompts to extract specific field descriptions
ProbeInfo API: For Glean metric metadata
- Endpoints:
```
https://probeinfo.telemetry.mozilla.org/glean/{product}/metrics
```
- Provides metric definitions and descriptions programmatically
- Use for validating metric references in queries
DataHub MCP: Only as last resort
- MUST READ
  references/datahub_best_practices.md
  BEFORE any DataHub queries
- Use for schema lookup when not available in local files or Glean Dictionary
- Extract ONLY necessary fields (column names, types, descriptions)
- Use for downstream impact analysis when modifying tables

See

references/discovery_resources.md

for:

Detailed guidance on each schema source
ProbeInfo API endpoints and usage patterns
Glean Dictionary URL patterns for different products
DataHub MCP best practices for construction
Common mozfun UDFs
Key documentation links

Naming Conventions

Table Names:

Use snake_case with version suffix:
```
clients_daily_event_v1
```
Common suffixes:
```
_daily
```
,
```
_hourly
```
,
```
_aggregates
```
,
```
_summary
```

Field Names:

Use snake_case:
```
submission_date
```
,
```
client_id
```
,
```
n_total_events
```
Prefix counts with
```
n_
```
:
```
n_events
```
,
```
n_sessions
```

Standard Mozilla fields:

submission_date

client_id

sample_id

normalized_channel

normalized_country_code

app_version

See

references/naming_conventions.md

for:

Complete naming patterns and conventions
Reserved/common patterns to avoid
BigQuery project naming conventions

Dataset Organization

See

references/dataset_naming_conventions.md

for:

Dataset naming patterns by suffix (
```
_derived
```
,
```
_external
```
, etc.)
Common dataset prefixes by product/source
Table versioning patterns
Incremental vs full refresh query patterns

Privacy & Data Handling

Mozilla follows strict data privacy policies:

No PII in derived tables
Use client-level identifiers (
```
client_id
```
) not individual identifiers
Respect data retention policies (~2 years for client-level data)
Label client-level tables with
```
table_type: client_level
```
in metadata.yaml

See

references/privacy_guidelines.md

for:

Key principles from Mozilla's data platform
Geo IP lookup and user agent parsing policies
Best practices for data handling
Deletion request support
Sample ID usage for sampling

BigQuery & Mozilla Conventions

Partitioning & Clustering

Most tables use day partitioning on
```
submission_date
```
Clustering improves query performance for filtered/joined fields
See metadata-manager skill for detailed partitioning and clustering configuration

Common UDFs (mozfun)

Browse available functions: https://mozilla.github.io/bigquery-etl/mozfun/

Common functions:

```
mozfun.map.get_key()
```
- Extract values from key-value maps
```
mozfun.norm.truncate_version()
```
- Normalize version strings
```
mozfun.stats.mode_last()
```
- Statistical mode calculation

UDF source code in

sql/mozfun/

directory.

Glean Overview

Glean is Mozilla's product analytics & telemetry solution, providing consistent measurement across all Mozilla products.

Key concepts:

Metric types: Counter, boolean, string, event, etc.
Pings: Collections of metrics (e.g.,
```
baseline
```
,
```
events
```
,
```
metrics
```
)
Applications: Products using Glean (Fenix, Focus, Firefox iOS, etc.)

Common Glean datasets in BigQuery:

Pattern:

{app_id}.{ping_name}

(e.g.,

org_mozilla_fenix.baseline

)

All have auto-generated schemas based on metric definitions

See

references/glean_overview.md

for:

What is Glean and how it differs from Firefox Desktop Telemetry
Glean SDK and metric type details
Common Glean datasets in BigQuery
When to use Glean Dictionary

bigquery-etl CLI Commands

See

references/bqetl_cli_commands.md

for:

Key bqetl CLI commands for query creation, validation, schema updates
How to find the right DAG for scheduling
Backfill creation commands

Best Practices

General principles:

Always include field descriptions in schema.yaml (see metadata-manager skill)
Add header comments explaining query purpose (see query-writer skill)
Reference bug/ticket numbers for context
Document any data exclusions or filtering logic

See

assets/query_structure_example.sql

for standard query structure.

Version migration:

Create new
```
_v2
```
table when making breaking schema changes
Keep
```
_v1
```
running during migration period
Update views to point to new version
Coordinate with downstream consumers before deprecating old version

For detailed best practices, see:

Query writing: query-writer skill
Metadata configuration: metadata-manager skill
Performance optimization: https://docs.telemetry.mozilla.org/cookbooks/bigquery/optimization.html
Recommended practices: https://mozilla.github.io/bigquery-etl/reference/recommended_practices/

Integration with Other Skills

bigquery-etl-core serves as the foundation skill that other skills build upon:

Works with model-requirements

Provides naming conventions for new tables and datasets
Supplies common field naming patterns for requirements gathering
Offers privacy guidelines for data model planning

Works with query-writer

Provides project structure and naming conventions
Supplies common patterns and mozfun UDF references
Offers schema description lookup guidance for construction

Works with metadata-manager

Provides DAG naming patterns and scheduling conventions
Supplies partitioning and clustering best practices
Offers ownership and labeling patterns

Works with sql-test-generator

Provides test structure and fixture naming conventions
Supplies common table patterns for test creation
Offers query parameter conventions

Works with bigconfig-generator

Provides table naming conventions for Bigeye monitoring configuration
Supplies dataset organization patterns
Offers field naming standards for data quality checks

This skill is always available and does not need to be explicitly invoked - it provides foundational knowledge that other skills reference.

Reference Examples

Real query examples in the repository:

Simple query:

sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/query.sql

Aggregation with GROUP BY:

sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_event_v1/query.sql

Complex query with CTEs:

sql/moz-fx-data-shared-prod/telemetry_derived/event_events_v1/query.sql

Python ETL (INFORMATION_SCHEMA):

sql/moz-fx-data-shared-prod/monitoring_derived/bigquery_table_storage_v1/query.py

Python ETL (External API):

sql/moz-fx-data-shared-prod/bigeye_derived/user_service_v1/query.py

For more examples, explore the

sql/moz-fx-data-shared-prod/

directory.

Bundled Resources

References

```
references/discovery_resources.md
```
- Schema description sources (Glean Dictionary, ProbeInfo API, DataHub MCP), priority order for construction, documentation links
```
references/naming_conventions.md
```
- Complete naming patterns for tables, fields, and projects
```
references/dataset_naming_conventions.md
```
- Dataset organization and versioning patterns
```
references/privacy_guidelines.md
```
- Mozilla data privacy policies and best practices
```
references/glean_overview.md
```
- Glean SDK concepts and BigQuery dataset structures
```
references/bqetl_cli_commands.md
```
- Key CLI commands and DAG discovery

DataHub Usage (CRITICAL for Token Efficiency)

BEFORE using any DataHub MCP tools (

mcp__datahub-cloud__*

), you MUST:

READ
references/datahub_best_practices.md
- Comprehensive token optimization strategies
Follow priority order: local files → documentation → DataHub (only as last resort)
Use search-first patterns and extract minimal fields from responses

Assets

```
assets/query_structure_example.sql
```
- Standard query.sql structure with common patterns
```
assets/directory_structure_example.txt
```
- File organization examples