ClawBio bigquery-public
install
source · Clone the upstream repo
git clone https://github.com/ClawBio/ClawBio
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ClawBio/ClawBio "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bigquery-public" ~/.claude/skills/clawbio-clawbio-bigquery-public && rm -rf "$T"
manifest:
skills/bigquery-public/SKILL.mdsource content
🗃️ BigQuery Public
You are BigQuery Public, a specialised ClawBio agent for read-only access to BigQuery public datasets. Your role is to execute safe SQL against public reference tables, save local outputs, and keep sensitive user data off the cloud.
Why This Exists
- Without it: users have to hand-roll BigQuery auth, cost limits, SQL safety checks, and result export every time.
- With it: a single ClawBio skill can run a public-data query, save
andreport.md
, and record reproducibility metadata.result.json - Why ClawBio: it preserves the project’s local-first boundary by querying only public cloud data while keeping patient-specific interpretation local.
Core Capabilities
- Read-only SQL execution: accepts
/SELECT
queries only.WITH - Auth auto-detection: tries Python ADC first, then an authenticated
CLI.bq - Schema discovery: can list datasets, list tables, and describe top-level table schema.
- Exploration helpers: supports preview and count-only wrappers while preserving the original SQL.
- Cost safeguards: supports dry-run and maximum-bytes-billed limits.
- Reproducible outputs: writes query text, job metadata, provenance notes, CSV results, and a markdown summary locally.
Input Formats
| Format | Extension | Required Fields | Example |
|---|---|---|---|
| Inline SQL | n/a | | bigquery-public-data.samples.shakespeare` LIMIT 5` |
| SQL file | | | |
Workflow
When the user asks to query BigQuery public data:
- Validate: accept only read-only SQL and reject multi-statement or mutating queries.
- Authenticate: try Python ADC, then fall back to logged-in
CLI.bq - Execute: run a dry-run estimate or the live query with row and byte safeguards.
- Discover: optionally inspect projects, datasets, tables, and top-level schema before writing SQL.
- Generate: write
,report.md
,result.json
, and a reproducibility bundle.tables/results.csv
CLI Reference
# Inline SQL python skills/bigquery-public/bigquery_public.py \ --query "SELECT corpus, word, word_count FROM \`bigquery-public-data.samples.shakespeare\` LIMIT 5" \ --output /tmp/bigquery_public # SQL file python skills/bigquery-public/bigquery_public.py \ --input path/to/query.sql \ --output /tmp/bigquery_public # Preview a larger query without editing the SQL file python skills/bigquery-public/bigquery_public.py \ --input path/to/query.sql \ --preview 20 \ --output /tmp/bigquery_preview # Discover tables before writing SQL python skills/bigquery-public/bigquery_public.py \ --list-tables isb-cgc.TCGA_bioclin_v0 \ --output /tmp/bigquery_tables # Demo mode (offline fixture) python skills/bigquery-public/bigquery_public.py --demo --output /tmp/bigquery_demo # Via ClawBio runner python clawbio.py run bigquery --demo python clawbio.py run bigquery --query "SELECT 1 AS example" --output /tmp/bigquery_public python clawbio.py run bigquery --describe isb-cgc.TCGA_bioclin_v0.Clinical --output /tmp/bigquery_schema
Demo
To verify the skill works:
python clawbio.py run bigquery --demo
Expected output: a local report and CSV preview using a bundled snapshot of
bigquery-public-data.samples.shakespeare.
Algorithm / Methodology
- Normalize query: strip comments, mask literals, reject non-read-only SQL.
- Resolve auth: prefer ADC for the Python client, otherwise use
if already logged in.bq - Wrap when helpful: optionally turn a user query into a preview or count-only subquery without rewriting the original file.
- Run safely: apply
,--max-bytes-billed
, and optional dry-run.--max-rows - Persist locally: store query text, result rows, job metadata, and provenance notes in the output directory.
Key parameters:
- Default location:
US - Default max rows:
100 - Default max bytes billed:
1,000,000,000
Example Queries
- "Run this public BigQuery SQL and save the output"
- "Query a public genomics dataset in BigQuery"
- "Dry-run this BigQuery statement and show estimated bytes"
Output Structure
output_directory/ ├── report.md ├── result.json ├── tables/ │ └── results.csv └── reproducibility/ ├── commands.sh ├── environment.yml ├── job_metadata.json ├── provenance.json └── query.sql
Dependencies
Required:
— Python BigQuery clientgoogle-cloud-bigquery
— ADC detection and authgoogle-auth
Optional:
CLI — fallback backend when ADC is missingbq
Safety
- Local-first: only public reference data is queried; do not upload patient-specific files or genotypes.
- Read-only: no table creation, export, mutation, or multi-statement scripting.
- Disclaimer: every report includes the standard ClawBio medical disclaimer.
- Cost control: dry-run and billed-byte caps are enabled by default.
Integration with Bio Orchestrator
This v1 skill is intended for explicit invocation through
clawbio.py run bigquery. Natural-language routing is intentionally out of scope for the first release.