Medical-research-skills esm

Toolkit for protein language models (ESM3 for multimodal generative protein design; ESM C for efficient embeddings). Use when you need sequence/structure/function generation or prediction, inverse folding, protein embeddings, or scalable inference via local weights or the Forge API.

install
source · Clone the upstream repo
git clone https://github.com/aipoch/medical-research-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Protocol Design/esm" ~/.claude/skills/aipoch-medical-research-skills-esm && rm -rf "$T"
manifest: scientific-skills/Protocol Design/esm/SKILL.md
source content

Source: https://github.com/aipoch/medical-research-skills

When to Use

  • Designing novel proteins with desired properties by generating sequences (optionally conditioned on structure/function) using ESM3.
  • Completing or editing sequences (e.g., filling masked residues, generating variants) for protein engineering workflows.
  • Predicting 3D structure from sequence or performing inverse folding (designing sequences for a target structure) with ESM3’s structure/sequence tracks.
  • Generating protein embeddings for downstream ML tasks (classification, clustering, similarity search, function prediction) using ESM C.
  • Scaling inference to many sequences using the Forge API (async/batch execution, hosted large models).

Key Features

  • ESM3 multimodal generation across sequence, structure, and function tracks.
  • Local inference (e.g.,
    esm3-sm-open-v1
    ) and cloud inference via Forge (e.g.,
    esm3-medium-2024-08
    ,
    esm3-large-2024-03
    ).
  • Structure prediction (sequence → coordinates/PDB) and inverse folding (structure → designed sequence).
  • Functional conditioning via function annotations to bias generation toward desired functional regions.
  • ESM C embeddings for efficient, high-quality protein representations.
  • Async batch processing with Forge for high-throughput workloads.

Additional reference docs (if present in this skill package):

  • references/esm3-api.md
    (ESM3 API, generation parameters, multimodal prompting)
  • references/esm-c-api.md
    (ESM C API, embedding strategies, optimization)
  • references/forge-api.md
    (authentication, rate limits, batching)
  • references/workflows.md
    (end-to-end workflows)

Dependencies

  • esm
    (Python package; install via pip/uv)
  • flash-attn
    (optional; recommended for faster attention on supported GPUs)

Version notes: exact versions depend on your environment and CUDA/PyTorch stack. Install commands below reflect the upstream package usage.

Example Usage

The following script demonstrates:

  1. local ESM3 sequence completion,
  2. Forge-based async batch generation, and
  3. local ESM C embeddings.
"""
End-to-end example for ESM:
- Local ESM3: sequence completion
- Forge ESM3: async batch generation (requires token)
- Local ESM C: embeddings
"""

import os
import asyncio

# ---------- 1) Local ESM3: sequence completion ----------
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig

def local_esm3_sequence_completion():
    # Load a local ESM3 model (open weights)
    model = ESM3.from_pretrained("esm3-sm-open-v1").to("cuda")

    # '_' indicates masked/unknown residues to be generated
    protein = ESMProtein(sequence="MPRT___KEND")

    completed = model.generate(
        protein,
        GenerationConfig(track="sequence", num_steps=8)
    )
    print("Local ESM3 completed sequence:", completed.sequence)


# ---------- 2) Forge ESM3: async batch generation ----------
from esm.sdk.forge import ESM3ForgeInferenceClient

async def forge_batch_generation():
    token = os.environ.get("FORGE_TOKEN", "<token>")
    client = ESM3ForgeInferenceClient(
        model="esm3-medium-2024-08",
        url="https://forge.evolutionaryscale.ai",
        token=token,
    )

    proteins = [ESMProtein(sequence="MPRT" + "_" * 50 + "KEND") for _ in range(5)]
    tasks = [
        client.async_generate(p, GenerationConfig(track="sequence", num_steps=50))
        for p in proteins
    ]
    results = await asyncio.gather(*tasks)
    print("Forge batch results (first):", results[0].sequence)


# ---------- 3) Local ESM C: embeddings ----------
from esm.models.esmc import ESMC

def local_esmc_embeddings():
    model = ESMC.from_pretrained("esmc-300m").to("cuda")

    protein = ESMProtein(sequence="MPRTKEINDAGLIVHSP")
    encoded = model.encode(protein)
    embeddings = model.forward(encoded)

    # embeddings is a tensor-like output; exact shape depends on model/config
    print("ESM C embeddings computed.")


if __name__ == "__main__":
    local_esm3_sequence_completion()

    # Run Forge example only if you have a valid token
    # export FORGE_TOKEN="..."
    asyncio.run(forge_batch_generation())

    local_esmc_embeddings()

Installation Commands

# Base
uv pip install esm

# Optional acceleration (GPU environments where supported)
uv pip install flash-attn --no-build-isolation

Implementation Details

ESM3 Tracks and Generation

  • Tracks determine what the model generates:

    • track="sequence"
      : generates amino-acid tokens (use
      _
      for masked positions).
    • track="structure"
      : predicts 3D coordinates; can be exported as PDB (see
      references/esm3-api.md
      ).
    • track="function"
      : predicts or conditions on functional annotations.
  • Core generation parameters (via

    GenerationConfig
    ):

    • num_steps
      : number of iterative generation steps; commonly aligned with the number of masked residues for sequence completion, or set to a design budget for de novo generation.
    • temperature
      : controls sampling diversity (lower = more deterministic; higher = more diverse).
    • Additional advanced controls and multimodal prompting patterns are documented in
      references/esm3-api.md
      .

Structure Prediction and Inverse Folding

  • Structure prediction: provide a sequence and generate on the
    structure
    track to obtain coordinates and/or a PDB representation.
  • Inverse folding: start from a target structure (e.g.,
    ESMProtein.from_pdb(...)
    ), remove/omit the sequence, then generate on the
    sequence
    track to design a sequence compatible with the structure.

ESM C Embeddings

  • ESM C models are optimized for representation learning:
    • Use
      model.encode(ESMProtein(...))
      to tokenize/prepare inputs.
    • Use
      model.forward(...)
      to obtain embeddings/logits suitable for downstream tasks (classification, clustering, similarity).
  • For batching and performance strategies (padding, caching, normalization), see
    references/esm-c-api.md
    .

Forge API (Hosted Inference)

  • Forge provides access to larger hosted models and scalable execution:
    • Use
      ESM3ForgeInferenceClient(...)
      with a token.
    • Prefer
      async_generate
      +
      asyncio.gather(...)
      for throughput.
  • Authentication, rate limits, and batching modes are detailed in
    references/forge-api.md
    .