Asi vertex-ai-protein-interleave
Bridge layer connecting Vertex AI / Google Cloud to plurigrid/asi protein-scale biology skills. Wires AlphaFold, ESM, DiffDock, DeepChem, and TorchDrug into Vertex AI Pipelines, Endpoints, and BigQuery genomics for protein design-predict-validate loops.
git clone https://github.com/plurigrid/asi
T=$(mktemp -d) && git clone --depth=1 https://github.com/plurigrid/asi "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/vertex-ai-protein-interleave" ~/.claude/skills/plurigrid-asi-vertex-ai-protein-interleave && rm -rf "$T"
skills/vertex-ai-protein-interleave/SKILL.mdVertex AI × Protein-Scale Biology Interleave
Bridge layer connecting Google Cloud's orchestration capabilities (Vertex AI Pipelines, Endpoints, BigQuery) to the ASI protein skill cluster.
description: > Bridge connecting Vertex AI / Google Cloud to protein-scale biology skills. Wires AlphaFold, ESM, DiffDock, DeepChem, and TorchDrug into Vertex AI Pipelines, Endpoints, and BigQuery genomics for protein design-predict-validate loops. Use when orchestrating protein engineering on GCP, deploying ESM as a managed endpoint, querying gnomAD via BigQuery, or running batch docking.
Vertex AI x Protein-Scale Biology Interleave
Bridge connecting Google Cloud's orchestration (Vertex AI Pipelines, Endpoints, BigQuery) to the ASI protein skill cluster.
origin/main
ASI Protein Skill Cluster
<<<<<<< HEAD Protein Stack (existing in asi) ├── alphafold-database (-1) ← structure retrieval, pLDDT/PAE, 200M+ structures ├── esm (-1) ← ESM3/ESM C: sequence generation, inverse folding ├── diffdock (-1) ← structure-based docking, pose prediction ├── deepchem (0) ← ADMET prediction, GNNs, MoleculeNet, 30+ datasets ├── torchdrug (0) ← GNNs, retrosynthesis, KG reasoning, 40+ datasets ├── adaptyv (+1) ← wet-lab validation: binding, expression, stability ├── uniprot-database (0) ← search, sequence retrieval, ID mapping └── gget (+1) ← rapid bioinformatics: AlphaFold, ARCHS4, Enrichr
Vertex AI adds: orchestration, serverless inference, genomic warehouse, cost optimization.
GF(3) Tripartite Tag
alphafold-database(-1) ⊗ vertex-ai-protein-interleave(0) ⊗ adaptyv(+1) = 0
Structure (-1) × Bridge (0) × Validation (+1) = balanced protein design loop.
Integration Points
1. Design → Predict → Validate Loop (Vertex AI Pipelines)
The core missing link: orchestrate the full protein engineering iteration via KFP.
alphafold-database (-1) <- structure retrieval, pLDDT/PAE, 200M+ structures esm (-1) <- ESM3/ESM C: sequence generation, inverse folding diffdock (-1) <- structure-based docking, pose prediction deepchem (0) <- ADMET prediction, GNNs, MoleculeNet, 30+ datasets torchdrug (0) <- GNNs, retrosynthesis, KG reasoning, 40+ datasets adaptyv (+1) <- wet-lab validation: binding, expression, stability uniprot-database (0) <- search, sequence retrieval, ID mapping gget (+1) <- rapid bioinformatics: AlphaFold, ARCHS4, Enrichr
Vertex AI adds: orchestration, serverless inference, genomic warehouse, cost optimization. ## 1. Design -> Predict -> Validate Loop (Vertex AI Pipelines) >>>>>>> origin/main ```python from kfp import dsl @dsl.pipeline(name="protein-design-loop") def protein_pipeline(target_sequence: str, iterations: int = 3): <<<<<<< HEAD # Step 1: AlphaFold structure prediction ======= >>>>>>> origin/main fold = dsl.ContainerOp( name="alphafold-predict", image="gcr.io/PROJECT/alphafold:latest", command=["python", "predict.py"], arguments=["--sequence", target_sequence, "--output", "/tmp/structure.pdb"] ) <<<<<<< HEAD # Step 2: DiffDock binding site prediction ======= >>>>>>> origin/main dock = dsl.ContainerOp( name="diffdock", image="gcr.io/PROJECT/diffdock:latest", command=["python", "inference.py"], arguments=["--protein", fold.outputs["structure"], "--ligand", "/data/ligand.sdf"] ).after(fold) <<<<<<< HEAD # Step 3: ESM inverse folding → generate sequence variants ======= >>>>>>> origin/main variants = dsl.ContainerOp( name="esm-inverse-fold", image="gcr.io/PROJECT/esm:latest", command=["python", "inverse_fold.py"], arguments=["--structure", fold.outputs["structure"], "--num_seqs", "100"] ).after(fold) <<<<<<< HEAD # Step 4: DeepChem ADMET filtering ======= >>>>>>> origin/main filtered = dsl.ContainerOp( name="deepchem-admet", image="gcr.io/PROJECT/deepchem:latest", command=["python", "admet_screen.py"], arguments=["--sequences", variants.outputs["seqs"]] ).after(variants) <<<<<<< HEAD # Step 5: Adaptyv wet-lab order (top candidates) ======= >>>>>>> origin/main dsl.ContainerOp( name="adaptyv-order", image="gcr.io/PROJECT/adaptyv-client:latest", command=["python", "order.py"], arguments=["--candidates", filtered.outputs["top_k"], "--assay", "binding"] ).after(filtered)
<<<<<<< HEAD Deploy:
vertex ai pipelines run --pipeline-spec protein-design-loop.json
2. ESM Serverless Inference via Vertex AI Endpoints
Deploy ESM3 as a managed endpoint for on-demand embedding and sequence generation:
# Build and push ESM container docker build -t gcr.io/$PROJECT/esm-server:latest -f esm.Dockerfile . docker push gcr.io/$PROJECT/esm-server:latest # Upload model artifact ======= ## 2. ESM Serverless Inference via Vertex AI Endpoints ```bash docker build -t gcr.io/$PROJECT/esm-server:latest -f esm.Dockerfile . docker push gcr.io/$PROJECT/esm-server:latest >>>>>>> origin/main gcloud ai models upload \ --region=us-central1 \ --display-name=esm3-protein-lm \ --container-image-uri=gcr.io/$PROJECT/esm-server:latest \ --container-predict-route=/predict \ --container-health-route=/health <<<<<<< HEAD # Deploy to endpoint ======= >>>>>>> origin/main gcloud ai endpoints create --region=us-central1 --display-name=esm-endpoint gcloud ai endpoints deploy-model ESM_ENDPOINT_ID \ --region=us-central1 \ --model=ESM_MODEL_ID \ --machine-type=n1-standard-4 \ --accelerator=count=1,type=nvidia-tesla-t4 \ <<<<<<< HEAD --min-replica-count=0 \ # scale to zero when idle --max-replica-count=4 # Call endpoint ACCESS_TOKEN=$(gcloud auth print-access-token) curl -s "https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT}/locations/us-central1/endpoints/${ESM_ENDPOINT_ID}:predict" \ -H "Authorization: Bearer $ACCESS_TOKEN" \ -H "Content-Type: application/json" \ -d '{"instances": [{"sequence": "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQTLGQHDFSAGEGLYTHMKALRPDEDRLSPLHSVYVDQWDWERVMGDGERQFSTLKSTVEAIWAGIKATEAAVSEEFGLAPFLPDQIHFVHSQELLSRYPDLDAKGRERAIAKDLGAVFLVGIGGKLSDGHRHDVRAPDYDDWSTPSELGHAGLNGDILVWNPVLEDAFELSSMGIRVDADTLKHQLALTGDEDRLELEWHQALLRGEMPQTIGGGIGQSRLTMLLLQLPHIGQVQAGVWPAAVRESVPSLL"}]}'
3. BigQuery Genomics → Protein ML Pipeline
Wire gnomAD variant data through BigQuery to Vertex AI custom training:
-- Extract high-impact missense variants for protein modeling -- (gnomAD v3 in BigQuery: bigquery-public-data.gnomAD) SELECT reference_name, start_position, reference_bases, alternate_bases, vep.consequence_terms, vep.protein_id, vep.amino_acids, ======= --min-replica-count=0 \ --max-replica-count=4
3. BigQuery Genomics -> Protein ML Pipeline
-- gnomAD v3 in BigQuery: high-impact missense variants SELECT reference_name, start_position, reference_bases, alternate_bases, vep.consequence_terms, vep.protein_id, vep.amino_acids, >>>>>>> origin/main af.AF as allele_frequency FROM `bigquery-public-data.gnomAD.v3_1_2_genomes` CROSS JOIN UNNEST(vep) AS vep CROSS JOIN UNNEST(allele_freq) AS af WHERE 'missense_variant' IN UNNEST(vep.consequence_terms) <<<<<<< HEAD AND af.AF > 0.001 -- common variants ======= AND af.AF > 0.001 >>>>>>> origin/main AND vep.protein_id = @target_protein ORDER BY af.AF DESC LIMIT 10000;
<<<<<<< HEAD # Train Vertex AI custom model on variant-phenotype associations from google.cloud import aiplatform aiplatform.init(project=PROJECT, location="us-central1") ======= from google.cloud import aiplatform aiplatform.init(project=PROJECT, location="us-central1") >>>>>>> origin/main job = aiplatform.CustomTrainingJob( display_name="gnomad-variant-phenotype", script_path="train_variant_model.py", container_uri="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-13:latest", requirements=["scikit-learn", "pandas", "biopython"], ) <<<<<<< HEAD ======= >>>>>>> origin/main model = job.run( dataset=aiplatform.TabularDataset(DATASET_ID), model_display_name="variant-phenotype-predictor", machine_type="n1-standard-8", accelerator_type="NVIDIA_TESLA_T4", accelerator_count=1, )
<<<<<<< HEAD
4. AlphaFold Batch Inference Pipeline
For large-scale structural characterization using Vertex AI's pre-built AlphaFold integration:
# Use Vertex AI AlphaFold Pipeline (pre-built) ======= ## 4. AlphaFold Batch Inference Pipeline ```bash >>>>>>> origin/main gcloud ai pipelines run \ --pipeline-job-spec-uri=gs://vertex-pipeline-components-public/alphafold/alphafold_pipeline.yaml \ --parameter-values=' project='"$PROJECT"', region=us-central1, input_fasta_gs_path=gs://my-bucket/sequences.fasta, output_gs_path=gs://my-bucket/alphafold-output/, use_gpu=true, model_preset=multimer '
<<<<<<< HEAD Process results via
alphafold-database skill patterns:
# Load outputs into DuckDB for analysis ======= ```python >>>>>>> origin/main import duckdb conn = duckdb.connect("asi.db") conn.execute(""" CREATE TABLE alphafold_batch AS SELECT * FROM read_json_auto('gs://my-bucket/alphafold-output/**/*.json') """) <<<<<<< HEAD # Filter by pLDDT confidence conn.execute("SELECT * FROM alphafold_batch WHERE mean_plddt > 90 ORDER BY mean_plddt DESC")
5. Vertex AI Matching Engine for Protein Similarity Search
Index AlphaFold structure embeddings for semantic protein search:
from google.cloud import aiplatform import numpy as np # Step 1: Generate ESM embeddings for all proteins in dataset def embed_sequences(sequences): """Call ESM endpoint (from step 2) to batch embed sequences.""" ACCESS_TOKEN = subprocess.check_output(["gcloud", "auth", "print-access-token"]).strip().decode() response = requests.post( f"https://us-central1-aiplatform.googleapis.com/v1/projects/{PROJECT}/locations/us-central1/endpoints/{ESM_ENDPOINT_ID}:predict", headers={"Authorization": f"Bearer {ACCESS_TOKEN}"}, json={"instances": [{"sequence": s} for s in sequences]} ) return np.array(response.json()["predictions"]) # Step 2: Create Matching Engine index ======= conn.execute("SELECT * FROM alphafold_batch WHERE mean_plddt > 90 ORDER BY mean_plddt DESC")
5. Vertex AI Matching Engine for Protein Similarity Search
from google.cloud import aiplatform >>>>>>> origin/main index = aiplatform.MatchingEngineIndex.create_tree_ah_index( display_name="protein-embedding-index", contents_delta_uri=f"gs://{BUCKET}/protein-embeddings/", dimensions=1280, # ESM3 embedding dimension approximate_neighbors_count=150, distance_measure_type="DOT_PRODUCT_DISTANCE", ) <<<<<<< HEAD # Step 3: Query for similar proteins ======= >>>>>>> origin/main endpoint = aiplatform.MatchingEngineIndexEndpoint.create( display_name="protein-similarity-endpoint", public_endpoint_enabled=True, ) endpoint.deploy_index(index=index) <<<<<<< HEAD # Find proteins similar to query query_embedding = embed_sequences(["MKTAYIAKQR..."])[0] ======= >>>>>>> origin/main neighbors = endpoint.find_neighbors( deployed_index_id=DEPLOYED_INDEX_ID, queries=[query_embedding.tolist()], num_neighbors=10, )
<<<<<<< HEAD
6. Cost-Optimized Batch Docking via Vertex AI
Run DiffDock at scale using Vertex AI Batch Prediction with spot VMs:
from google.cloud import aiplatform # Create batch prediction job (spot VMs = 60-90% cost reduction) ======= ## 6. Cost-Optimized Batch Docking via Vertex AI ```python >>>>>>> origin/main batch_job = aiplatform.BatchPredictionJob.create( job_display_name="diffdock-batch-screen", model_name=DIFFDOCK_MODEL_ID, instances_format="jsonl", gcs_source=f"gs://{BUCKET}/ligand-protein-pairs.jsonl", predictions_format="jsonl", gcs_destination_prefix=f"gs://{BUCKET}/docking-results/", machine_type="n1-standard-8", accelerator_type="NVIDIA_TESLA_T4", accelerator_count=1, starting_replica_count=10, max_replica_count=50, <<<<<<< HEAD # Use spot VMs for 10-100x cost reduction service_account=SERVICE_ACCOUNT, )
Gap Registry: What Vertex AI Cannot Replace
| Capability | ASI Skill | Vertex AI | Status |
|---|---|---|---|
| Structure retrieval (200M) | alphafold-database | Inference only | ASI owns retrieval |
| Protein sequence design | esm (ESM3) | Ginkgo LLM (limited) | ASI owns design |
| Molecular docking | diffdock | None | ASI owns this |
| ADMET prediction | deepchem | None | ASI owns this |
| Wet-lab validation | adaptyv | None | ASI owns this |
| BigQuery genomics | None | ✅ gnomAD, 1TB free | Vertex AI gap |
| Workflow orchestration | None | ✅ KFP Pipelines | Vertex AI gap |
| Scalable inference | Manual | ✅ Managed Endpoints | Vertex AI gap |
| Protein similarity search | None | ✅ Matching Engine | Vertex AI gap |
Related ASI Skills
— structure retrieval; batch analysis feed for Vertex Pipelinesalphafold-database
— ESM3 for design; deployable as Vertex AI Endpointesm
— docking; deployable as Vertex Batch Prediction jobdiffdock
— ADMET screening; post-design filter stagedeepchem
— wet-lab validation; final pipeline stageadaptyv
— KG reasoning; drug target validationtorchdrug
— sequence/annotation retrieval; input data sourceuniprot-database
— rapid queries; replaces manual API calls in pipelinegget
— parent GCP bridge; gnomAD query patternsbigquery-asi-interleave
— sibling Vertex bridge; generative AI patternsvertex-asi-interleave
— latent diffusion physics; analogous pipeline pattern for PDE emulation ======= )lolita
## Gap Registry | Capability | ASI Skill | Vertex AI | Status | |---|---|---|---| | Structure retrieval (200M) | alphafold-database | Inference only | ASI owns retrieval | | Protein sequence design | esm (ESM3) | Limited | ASI owns design | | Molecular docking | diffdock | None | ASI owns this | | ADMET prediction | deepchem | None | ASI owns this | | Wet-lab validation | adaptyv | None | ASI owns this | | BigQuery genomics | None | gnomAD, 1TB free | Vertex AI gap | | Workflow orchestration | None | KFP Pipelines | Vertex AI gap | | Scalable inference | Manual | Managed Endpoints | Vertex AI gap | | Protein similarity search | None | Matching Engine | Vertex AI gap | >>>>>>> origin/main