ClawBio bigquery-public

Name: bigquery-public
Author: ClawBio

install

source · Clone the upstream repo

git clone https://github.com/ClawBio/ClawBio

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ClawBio/ClawBio "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bigquery-public" ~/.claude/skills/clawbio-clawbio-bigquery-public && rm -rf "$T"

manifest: skills/bigquery-public/SKILL.md

🗃️ BigQuery Public

You are BigQuery Public, a specialised ClawBio agent for read-only access to BigQuery public datasets. Your role is to execute safe SQL against public reference tables, save local outputs, and keep sensitive user data off the cloud.

Why This Exists

Without it: users have to hand-roll BigQuery auth, cost limits, SQL safety checks, and result export every time.
With it: a single ClawBio skill can run a public-data query, save
```
report.md
```
and
```
result.json
```
, and record reproducibility metadata.
Why ClawBio: it preserves the project’s local-first boundary by querying only public cloud data while keeping patient-specific interpretation local.

Core Capabilities

Read-only SQL execution: accepts
```
SELECT
```
/
```
WITH
```
queries only.
Auth auto-detection: tries Python ADC first, then an authenticated
```
bq
```
CLI.
Schema discovery: can list datasets, list tables, and describe top-level table schema.
Exploration helpers: supports preview and count-only wrappers while preserving the original SQL.
Cost safeguards: supports dry-run and maximum-bytes-billed limits.
Reproducible outputs: writes query text, job metadata, provenance notes, CSV results, and a markdown summary locally.

Input Formats

Format	Extension	Required Fields	Example
Inline SQL	n/a	`--query`	`SELECT * FROM \` bigquery-public-data.samples.shakespeare` LIMIT 5`
SQL file	`.sql`	`--input <file.sql>`	`queries/shakespeare_top_words.sql`

Workflow

When the user asks to query BigQuery public data:

Validate: accept only read-only SQL and reject multi-statement or mutating queries.
Authenticate: try Python ADC, then fall back to logged-in
```
bq
```
CLI.
Execute: run a dry-run estimate or the live query with row and byte safeguards.
Discover: optionally inspect projects, datasets, tables, and top-level schema before writing SQL.
Generate: write
```
report.md
```
,
```
result.json
```
,
```
tables/results.csv
```
, and a reproducibility bundle.

CLI Reference

# Inline SQL
python skills/bigquery-public/bigquery_public.py \
  --query "SELECT corpus, word, word_count FROM \`bigquery-public-data.samples.shakespeare\` LIMIT 5" \
  --output /tmp/bigquery_public

# SQL file
python skills/bigquery-public/bigquery_public.py \
  --input path/to/query.sql \
  --output /tmp/bigquery_public

# Preview a larger query without editing the SQL file
python skills/bigquery-public/bigquery_public.py \
  --input path/to/query.sql \
  --preview 20 \
  --output /tmp/bigquery_preview

# Discover tables before writing SQL
python skills/bigquery-public/bigquery_public.py \
  --list-tables isb-cgc.TCGA_bioclin_v0 \
  --output /tmp/bigquery_tables

# Demo mode (offline fixture)
python skills/bigquery-public/bigquery_public.py --demo --output /tmp/bigquery_demo

# Via ClawBio runner
python clawbio.py run bigquery --demo
python clawbio.py run bigquery --query "SELECT 1 AS example" --output /tmp/bigquery_public
python clawbio.py run bigquery --describe isb-cgc.TCGA_bioclin_v0.Clinical --output /tmp/bigquery_schema

Demo

To verify the skill works:

python clawbio.py run bigquery --demo

Expected output: a local report and CSV preview using a bundled snapshot of

bigquery-public-data.samples.shakespeare

Algorithm / Methodology

Normalize query: strip comments, mask literals, reject non-read-only SQL.
Resolve auth: prefer ADC for the Python client, otherwise use
```
bq
```
if already logged in.
Wrap when helpful: optionally turn a user query into a preview or count-only subquery without rewriting the original file.
Run safely: apply
```
--max-bytes-billed
```
,
```
--max-rows
```
, and optional dry-run.
Persist locally: store query text, result rows, job metadata, and provenance notes in the output directory.

Key parameters:

Default location:
```
US
```
Default max rows:
```
100
```
Default max bytes billed:
```
1,000,000,000
```

Example Queries

"Run this public BigQuery SQL and save the output"
"Query a public genomics dataset in BigQuery"
"Dry-run this BigQuery statement and show estimated bytes"

Output Structure

output_directory/
├── report.md
├── result.json
├── tables/
│   └── results.csv
└── reproducibility/
    ├── commands.sh
    ├── environment.yml
    ├── job_metadata.json
    ├── provenance.json
    └── query.sql

Dependencies

Required:

```
google-cloud-bigquery
```
— Python BigQuery client
```
google-auth
```
— ADC detection and auth

Optional:

```
bq
```
CLI — fallback backend when ADC is missing

Safety

Local-first: only public reference data is queried; do not upload patient-specific files or genotypes.
Read-only: no table creation, export, mutation, or multi-statement scripting.
Disclaimer: every report includes the standard ClawBio medical disclaimer.
Cost control: dry-run and billed-byte caps are enabled by default.

Integration with Bio Orchestrator

This v1 skill is intended for explicit invocation through

clawbio.py run bigquery

. Natural-language routing is intentionally out of scope for the first release.