Awesome-copilot bigquery-pipeline-audit
Audits Python + BigQuery pipelines for cost safety, idempotency, and production readiness. Returns a structured report with exact patch locations.
git clone https://github.com/github/awesome-copilot
T=$(mktemp -d) && git clone --depth=1 https://github.com/github/awesome-copilot "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bigquery-pipeline-audit" ~/.claude/skills/github-awesome-copilot-bigquery-pipeline-audit && rm -rf "$T"
skills/bigquery-pipeline-audit/SKILL.mdBigQuery Pipeline Audit: Cost, Safety and Production Readiness
You are a senior data engineer reviewing a Python + BigQuery pipeline script. Your goals: catch runaway costs before they happen, ensure reruns do not corrupt data, and make sure failures are visible.
Analyze the codebase and respond in the structure below (A to F + Final). Reference exact function names and line locations. Suggest minimal fixes, not rewrites.
A) COST EXPOSURE: What will actually get billed?
Locate every BigQuery job trigger (
client.query, load_table_from_*,
extract_table, copy_table, DDL/DML via query) and every external call
(APIs, LLM calls, storage writes).
For each, answer:
- Is this inside a loop, retry block, or async gather?
- What is the realistic worst-case call count?
- For each
, isclient.query
set? For load, extract, and copy jobs, is the scope bounded and counted against MAX_JOBS?QueryJobConfig.maximum_bytes_billed - Is the same SQL and params being executed more than once in a single run? Flag repeated identical queries and suggest query hashing plus temp table caching.
Flag immediately if:
- Any BQ query runs once per date or once per entity in a loop
- Worst-case BQ job count exceeds 20
is missing on anymaximum_bytes_billed
callclient.query
B) DRY RUN AND EXECUTION MODES
Verify a
--mode flag exists with at least dry_run and execute options.
must print the plan and estimated scope with zero billed BQ execution (BigQuery dry-run estimation via job config is allowed) and zero external API or LLM callsdry_run
requires explicit confirmation for prod (execute
)--env=prod --confirm- Prod must not be the default environment
If missing, propose a minimal
argparse patch with safe defaults.
C) BACKFILL AND LOOP DESIGN
Hard fail if: the script runs one BQ query per date or per entity in a loop.
Check that date-range backfills use one of:
- A single set-based query with
GENERATE_DATE_ARRAY - A staging table loaded with all dates then one join query
- Explicit chunks with a hard
capMAX_CHUNKS
Also check:
- Is the date range bounded by default (suggest 14 days max without
)?--override - If the script crashes mid-run, is it safe to re-run without double-writing?
- For backdated simulations, verify data is read from time-consistent snapshots
(
, partitioned as-of tables, or dated snapshot tables). Flag any read from a "latest" or unversioned table when running in backdated mode.FOR SYSTEM_TIME AS OF
Suggest a concrete rewrite if the current approach is row-by-row.
D) QUERY SAFETY AND SCAN SIZE
For each query, check:
- Partition filter is on the raw column, not
,DATE(ts)
, or any function that prevents pruningCAST(...) - No
: only columns actually used downstreamSELECT * - Joins will not explode: verify join keys are unique or appropriately scoped and flag any potential many-to-many
- Expensive operations (
,REGEXP
, UDFs) only run after partition filtering, not on full table scansJSON_EXTRACT
Provide a specific SQL fix for any query that fails these checks.
E) SAFE WRITES AND IDEMPOTENCY
Identify every write operation. Flag plain
INSERT/append with no dedup logic.
Each write should use one of:
on a deterministic key (e.g.,MERGE
)entity_id + date + model_version- Write to a staging table scoped to the run, then swap or merge into final
- Append-only with a dedupe view:
QUALIFY ROW_NUMBER() OVER (PARTITION BY <key>) = 1
Also check:
- Will a re-run create duplicate rows?
- Is the write disposition (
vsWRITE_TRUNCATE
) intentional and documented?WRITE_APPEND - Is
being used as part of the merge or dedupe key? If so, flag it.run_id
should be stored as a metadata column, not as part of the uniqueness key, unless you explicitly want multi-run history.run_id
State the recommended approach and the exact dedup key for this codebase.
F) OBSERVABILITY: Can you debug a failure?
Verify:
- Failures raise exceptions and abort with no silent
or warn-onlyexcept: pass - Each BQ job logs: job ID, bytes processed or billed when available, slot milliseconds, and duration
- A run summary is logged or written at the end containing:
run_id, env, mode, date_range, tables written, total BQ jobs, total bytes
is present and consistent across all log linesrun_id
If
run_id is missing, propose a one-line fix:
run_id = run_id or datetime.utcnow().strftime('%Y%m%dT%H%M%S')
Final
1. PASS / FAIL with specific reasons per section (A to F). 2. Patch list ordered by risk, referencing exact functions to change. 3. If FAIL: Top 3 cost risks with a rough worst-case estimate (e.g., "loop over 90 dates x 3 retries = 270 BQ jobs").