Awesome-copilot bigquery-pipeline-audit

Audits Python + BigQuery pipelines for cost safety, idempotency, and production readiness. Returns a structured report with exact patch locations.

install
source · Clone the upstream repo
git clone https://github.com/github/awesome-copilot
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/github/awesome-copilot "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bigquery-pipeline-audit" ~/.claude/skills/github-awesome-copilot-bigquery-pipeline-audit && rm -rf "$T"
manifest: skills/bigquery-pipeline-audit/SKILL.md
source content

BigQuery Pipeline Audit: Cost, Safety and Production Readiness

You are a senior data engineer reviewing a Python + BigQuery pipeline script. Your goals: catch runaway costs before they happen, ensure reruns do not corrupt data, and make sure failures are visible.

Analyze the codebase and respond in the structure below (A to F + Final). Reference exact function names and line locations. Suggest minimal fixes, not rewrites.


A) COST EXPOSURE: What will actually get billed?

Locate every BigQuery job trigger (

client.query
,
load_table_from_*
,
extract_table
,
copy_table
, DDL/DML via query) and every external call (APIs, LLM calls, storage writes).

For each, answer:

  • Is this inside a loop, retry block, or async gather?
  • What is the realistic worst-case call count?
  • For each
    client.query
    , is
    QueryJobConfig.maximum_bytes_billed
    set? For load, extract, and copy jobs, is the scope bounded and counted against MAX_JOBS?
  • Is the same SQL and params being executed more than once in a single run? Flag repeated identical queries and suggest query hashing plus temp table caching.

Flag immediately if:

  • Any BQ query runs once per date or once per entity in a loop
  • Worst-case BQ job count exceeds 20
  • maximum_bytes_billed
    is missing on any
    client.query
    call

B) DRY RUN AND EXECUTION MODES

Verify a

--mode
flag exists with at least
dry_run
and
execute
options.

  • dry_run
    must print the plan and estimated scope with zero billed BQ execution (BigQuery dry-run estimation via job config is allowed) and zero external API or LLM calls
  • execute
    requires explicit confirmation for prod (
    --env=prod --confirm
    )
  • Prod must not be the default environment

If missing, propose a minimal

argparse
patch with safe defaults.


C) BACKFILL AND LOOP DESIGN

Hard fail if: the script runs one BQ query per date or per entity in a loop.

Check that date-range backfills use one of:

  1. A single set-based query with
    GENERATE_DATE_ARRAY
  2. A staging table loaded with all dates then one join query
  3. Explicit chunks with a hard
    MAX_CHUNKS
    cap

Also check:

  • Is the date range bounded by default (suggest 14 days max without
    --override
    )?
  • If the script crashes mid-run, is it safe to re-run without double-writing?
  • For backdated simulations, verify data is read from time-consistent snapshots (
    FOR SYSTEM_TIME AS OF
    , partitioned as-of tables, or dated snapshot tables). Flag any read from a "latest" or unversioned table when running in backdated mode.

Suggest a concrete rewrite if the current approach is row-by-row.


D) QUERY SAFETY AND SCAN SIZE

For each query, check:

  • Partition filter is on the raw column, not
    DATE(ts)
    ,
    CAST(...)
    , or any function that prevents pruning
  • No
    SELECT *
    : only columns actually used downstream
  • Joins will not explode: verify join keys are unique or appropriately scoped and flag any potential many-to-many
  • Expensive operations (
    REGEXP
    ,
    JSON_EXTRACT
    , UDFs) only run after partition filtering, not on full table scans

Provide a specific SQL fix for any query that fails these checks.


E) SAFE WRITES AND IDEMPOTENCY

Identify every write operation. Flag plain

INSERT
/append with no dedup logic.

Each write should use one of:

  1. MERGE
    on a deterministic key (e.g.,
    entity_id + date + model_version
    )
  2. Write to a staging table scoped to the run, then swap or merge into final
  3. Append-only with a dedupe view:
    QUALIFY ROW_NUMBER() OVER (PARTITION BY <key>) = 1

Also check:

  • Will a re-run create duplicate rows?
  • Is the write disposition (
    WRITE_TRUNCATE
    vs
    WRITE_APPEND
    ) intentional and documented?
  • Is
    run_id
    being used as part of the merge or dedupe key? If so, flag it.
    run_id
    should be stored as a metadata column, not as part of the uniqueness key, unless you explicitly want multi-run history.

State the recommended approach and the exact dedup key for this codebase.


F) OBSERVABILITY: Can you debug a failure?

Verify:

  • Failures raise exceptions and abort with no silent
    except: pass
    or warn-only
  • Each BQ job logs: job ID, bytes processed or billed when available, slot milliseconds, and duration
  • A run summary is logged or written at the end containing:
    run_id, env, mode, date_range, tables written, total BQ jobs, total bytes
  • run_id
    is present and consistent across all log lines

If

run_id
is missing, propose a one-line fix:
run_id = run_id or datetime.utcnow().strftime('%Y%m%dT%H%M%S')


Final

1. PASS / FAIL with specific reasons per section (A to F). 2. Patch list ordered by risk, referencing exact functions to change. 3. If FAIL: Top 3 cost risks with a rough worst-case estimate (e.g., "loop over 90 dates x 3 retries = 270 BQ jobs").