Skills api-ai-modal
Serverless GPU compute platform for AI model deployment — web endpoints, GPU functions, model serving, and TypeScript client patterns
git clone https://github.com/agents-inc/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/agents-inc/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/dist/plugins/api-ai-modal/skills/api-ai-modal" ~/.claude/skills/agents-inc-skills-api-ai-modal && rm -rf "$T"
dist/plugins/api-ai-modal/skills/api-ai-modal/SKILL.mdModal Patterns
Quick Guide: Modal is a serverless GPU compute platform where you define Python functions with decorators and Modal handles containers, scaling, and GPU provisioning. TypeScript apps interact with Modal via HTTP endpoints (calling
or@modal.fastapi_endpointfunctions) or the@modal.asgi_appnpm SDK (calling functions directly via gRPC). Define container images, secrets, and volumes as code -- no YAML config files. Usemodalfor production,modal deployfor dev.modal serve
<critical_requirements>
CRITICAL: Before Using This Skill
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
, named constants)import type
(You MUST define Modal functions in Python -- the TypeScript SDK can call functions and manage resources but cannot define them)
(You MUST use
(not the old @modal.fastapi_endpoint
) for simple web endpoints -- renamed in Modal 1.0)@modal.web_endpoint
(You MUST use
for model weight caching -- modal.Volume
is deprecated in Modal 1.0)@modal.build
(You MUST never hardcode secrets in Modal code -- use
and access via modal.Secret.from_name()
)os.environ
(You MUST bind to
(not 0.0.0.0
) when using 127.0.0.1
)@modal.web_server
</critical_requirements>
Auto-detection: Modal, modal, modal.App, modal.Image, modal.Volume, modal.Secret, modal.gpu, modal.fastapi_endpoint, modal.asgi_app, modal.web_server, modal.Cron, modal.Period, modal deploy, modal serve, MODAL_TOKEN_ID, MODAL_TOKEN_SECRET, ModalClient
When to use:
- Deploying ML models (vLLM, Hugging Face, custom PyTorch) on serverless GPUs
- Creating HTTP API endpoints backed by GPU compute for TypeScript apps to consume
- Running scheduled GPU jobs (fine-tuning, batch inference, data processing)
- Calling Modal functions from TypeScript using the
npm SDKmodal - Building AI inference pipelines with auto-scaling and pay-per-second billing
Key patterns covered:
- Web endpoints (
,@modal.fastapi_endpoint
,@modal.asgi_app
) for HTTP access@modal.web_server - TypeScript client patterns (fetch-based and
npm SDK)modal - Container images, secrets, volumes, and GPU configuration
- Model serving with vLLM and custom inference
- Scheduled functions and deployment
When NOT to use:
- Pure Python ML workloads with no TypeScript consumer -- this skill focuses on the TypeScript interaction surface
- Simple CPU-only tasks where a regular server or cloud function suffices
- When you need persistent long-running servers (Modal scales to zero by default)
Examples Index
- Core: Web Endpoints & TypeScript Client -- Defining endpoints, calling from TypeScript, authentication, GPU functions, images, secrets, volumes
- Quick API Reference -- CLI commands, decorator parameters, URL patterns, GPU types
<philosophy>
Philosophy
Modal eliminates infrastructure management for GPU workloads. Everything is code -- container images, GPU allocation, secrets, volumes, scaling rules. There are no YAML configs, Dockerfiles, or Kubernetes manifests.
Core principles:
- Infrastructure as Python code -- Container images, GPU types, secrets, and volumes are all declared as Python decorators and objects. No separate config files.
- Serverless GPU scaling -- Functions scale from zero to hundreds of GPUs automatically. You pay per second of compute, not for idle capacity.
- Two interaction models for TypeScript -- Call Modal via HTTP endpoints (most common) or via the
npm SDK for direct function invocation without HTTP overhead.modal - Immutable deployments --
creates a named, persistent deployment with stable URLs.modal deploy
creates ephemeral dev endpoints.modal serve
When to use Modal:
- GPU inference endpoints that your TypeScript app calls via fetch
- Batch processing jobs (fine-tuning, embeddings generation, data pipelines)
- Model serving with auto-scaling (vLLM, TGI, custom PyTorch)
- Scheduled GPU tasks (cron-based retraining, periodic inference)
When NOT to use:
- You need sub-100ms cold starts (Modal cold starts are 2-4 seconds)
- You need persistent WebSocket connections beyond request/response
- Your workload is CPU-only and doesn't benefit from GPU acceleration
<patterns>
Core Patterns
Pattern 1: Web Endpoint (TypeScript Consumption)
The most common pattern: define a Python endpoint on Modal, call it from TypeScript via fetch.
Python Side
# inference.py import modal app = modal.App("my-inference-api") image = modal.Image.debian_slim().uv_pip_install(["transformers", "torch"]) @app.function(image=image, gpu="A10G") @modal.fastapi_endpoint(method="POST") def predict(payload: dict): # GPU-accelerated inference text = payload["text"] result = run_model(text) return {"prediction": result}
TypeScript Side
// lib/modal-client.ts const MODAL_ENDPOINT = "https://your-workspace--my-inference-api-predict.modal.run"; const REQUEST_TIMEOUT_MS = 30_000; interface PredictionRequest { text: string; } interface PredictionResponse { prediction: string; } async function predict(input: PredictionRequest): Promise<PredictionResponse> { const response = await fetch(MODAL_ENDPOINT, { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify(input), signal: AbortSignal.timeout(REQUEST_TIMEOUT_MS), }); if (!response.ok) { throw new Error( `Modal API error [${response.status}]: ${await response.text()}`, ); } return response.json() as Promise<PredictionResponse>; } export { predict };
Why good: Clean separation -- Python handles GPU compute, TypeScript handles the app. Named constants for config, typed request/response, timeout handling, proper error checking.
// BAD: No timeout, no error handling, hardcoded URL in call site const res = await fetch("https://workspace--app-fn.modal.run", { method: "POST", body: JSON.stringify({ text: "hello" }), }); const data = await res.json();
Why bad: No Content-Type header (may fail), no timeout (hangs on cold start), no error checking, URL scattered across codebase
Pattern 2: Authenticated Endpoints
Modal supports proxy auth tokens that protect endpoints without spinning up containers for unauthorized requests.
Python Side
@app.function(image=image, gpu="A10G") @modal.fastapi_endpoint(method="POST", requires_proxy_auth=True) def predict_secure(payload: dict): return {"prediction": run_model(payload["text"])}
TypeScript Side
// lib/modal-client.ts const MODAL_KEY = process.env.MODAL_PROXY_KEY; const MODAL_SECRET = process.env.MODAL_PROXY_SECRET; async function predictSecure( input: PredictionRequest, ): Promise<PredictionResponse> { if (!MODAL_KEY || !MODAL_SECRET) { throw new Error("Missing MODAL_PROXY_KEY or MODAL_PROXY_SECRET env vars"); } const response = await fetch(MODAL_ENDPOINT, { method: "POST", headers: { "Content-Type": "application/json", "Modal-Key": MODAL_KEY, "Modal-Secret": MODAL_SECRET, }, body: JSON.stringify(input), signal: AbortSignal.timeout(REQUEST_TIMEOUT_MS), }); if (response.status === 401) { throw new Error( "Modal proxy auth failed -- check MODAL_PROXY_KEY and MODAL_PROXY_SECRET", ); } if (!response.ok) { throw new Error( `Modal API error [${response.status}]: ${await response.text()}`, ); } return response.json() as Promise<PredictionResponse>; }
Why good: Auth handled at Modal's proxy layer (no container spin-up for bad requests), env vars for credentials, explicit 401 handling
Pattern 3: Modal npm SDK (Direct Function Calls)
For TypeScript apps that need to call Modal functions without HTTP overhead. Requires Node 22+.
import { ModalClient } from "modal"; const modal = new ModalClient(); // Create once, reuse const fn = await modal.functions.fromName("my-inference-api", "predict"); const result = await fn.remote([text]); // sync call const call = await fn.spawn([text]); // async (fire-and-forget) const later = await call.get(); // retrieve result later
Why good: No HTTP serialization overhead, typed SDK, supports async spawn for long-running jobs
See examples/core.md for complete TypeScript SDK patterns including error handling and fire-and-forget job IDs.
When to use: Backend-to-Modal calls where you control the Node.js runtime (Node 22+). Not for browser or edge runtimes.
Pattern 4: GPU Functions and Container Images
Modal functions define their compute environment inline.
import modal app = modal.App("gpu-inference") # Container image with ML dependencies inference_image = ( modal.Image.debian_slim(python_version="3.11") .uv_pip_install(["torch==2.5.0", "transformers==4.47.0", "accelerate"]) .apt_install(["libgl1"]) ) MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct" @app.function( image=inference_image, gpu="A100", # GPU type: "T4", "A10G", "A100", "H100", etc. secrets=[modal.Secret.from_name("huggingface-secret")], volumes={"/models": modal.Volume.from_name("model-cache", create_if_missing=True)}, min_containers=1, # Keep warm to avoid cold starts scaledown_window=300, # Seconds before scaling to zero ) def generate(prompt: str) -> str: from transformers import AutoModelForCausalLM, AutoTokenizer # Load from volume cache tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir="/models") model = AutoModelForCausalLM.from_pretrained(MODEL_ID, cache_dir="/models") # ... generate and return
Why good: Pinned dependency versions, volume-based model caching (avoids re-download),
min_containers for warm starts, secrets for HF token
# BAD: No version pinning, no volume cache, model re-downloads every cold start @app.function(gpu="A100") def generate(prompt: str): from transformers import pipeline pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct") return pipe(prompt)[0]["generated_text"]
Why bad: Unpinned deps break reproducibility, no volume means multi-GB model download on every cold start (30-60s+ delay), no secret for gated models
Pattern 5: Full ASGI App (FastAPI)
For endpoints that need routing, middleware, or multiple routes.
import modal from fastapi import FastAPI app = modal.App("my-api") web_app = FastAPI() @web_app.post("/predict") async def predict(payload: dict): return {"result": "prediction"} @web_app.get("/health") async def health(): return {"status": "ok"} @app.function(image=modal.Image.debian_slim().uv_pip_install(["fastapi"])) @modal.asgi_app() def serve(): return web_app
Why good: Full FastAPI capabilities (routing, middleware, validation), multiple endpoints under one function
Pattern 6: Secrets and Environment Variables
# Creating secrets via CLI # modal secret create my-api-keys API_KEY=sk-xxx DB_URL=postgres://... @app.function( secrets=[ modal.Secret.from_name("my-api-keys"), modal.Secret.from_name("huggingface-secret"), ] ) def my_function(): import os api_key = os.environ["API_KEY"] # Injected by Modal hf_token = os.environ["HF_TOKEN"]
Why good: Secrets created via dashboard or CLI, referenced by name in code, accessed as standard env vars, multiple secrets composable
Pattern 7: Scheduled Functions
@app.function( schedule=modal.Cron("0 2 * * *"), # 2 AM daily image=inference_image, gpu="A10G", volumes={"/data": modal.Volume.from_name("training-data")}, ) def nightly_batch_inference(): # Process accumulated data # Write results to volume pass @app.function(schedule=modal.Period(hours=6)) def periodic_health_check(): # Check model freshness, data quality, etc. pass
Why good:
modal.Cron for precise scheduling, modal.Period for intervals. Scheduled functions cannot accept arguments -- use volumes or secrets for input data.
</patterns>
<decision_framework>
Decision Framework
How to Expose Modal to TypeScript
Does your TypeScript app need to call Modal? +-- Via HTTP (most common) | +-- Single endpoint? -> @modal.fastapi_endpoint | +-- Multiple routes? -> @modal.asgi_app with FastAPI | +-- Non-Python server (vLLM, TGI)? -> @modal.web_server(port=8000) | +-- Need auth? -> Add requires_proxy_auth=True +-- Via SDK (direct gRPC) | +-- Node 22+ backend? -> npm install modal, use ModalClient | +-- Browser/edge? -> Use HTTP endpoints instead +-- Async job? +-- Fire-and-forget? -> SDK spawn() + later get() +-- Webhook callback? -> Modal calls your endpoint on completion
When to Use Each Endpoint Type
What are you serving? +-- Simple function -> @modal.fastapi_endpoint (auto-wraps in FastAPI) +-- Full web app -> @modal.asgi_app (FastAPI, Starlette, FastHTML) +-- Legacy sync app -> @modal.wsgi_app (Flask, Django) +-- Custom server binary -> @modal.web_server(port=8000) (vLLM, TGI, Ollama)
HTTP vs SDK
How should TypeScript call Modal? +-- Browser/edge runtime? -> HTTP (fetch) +-- Server-side Node 22+? -> Either works | +-- Need simplicity? -> HTTP | +-- Need speed (no serialization overhead)? -> SDK | +-- Need async spawn? -> SDK +-- Multiple providers? -> HTTP (vendor-agnostic)
</decision_framework>
<red_flags>
RED FLAGS
High Priority Issues:
- Using deprecated
instead of@modal.web_endpoint
(renamed in Modal 1.0)@modal.fastapi_endpoint - Using
for downloading model weights (deprecated -- use@modal.build
instead)modal.Volume - Using
for object references (deprecated -- use.lookup()
).from_name() - Hardcoding secrets in Python source (use
+modal.Secret.from_name()
)os.environ - Binding
to@modal.web_server
instead of127.0.0.1
(endpoint unreachable)0.0.0.0
Medium Priority Issues:
- Not pinning dependency versions in
(breaks reproducibility)uv_pip_install() - No
for latency-sensitive endpoints (2-4s cold starts)min_containers - Missing
on TypeScript fetch calls (hangs on cold starts)signal: AbortSignal.timeout() - Not using volumes for model weight caching (re-downloads multi-GB models on cold starts)
- Using
when you need exact times (usemodal.Period
-- Period resets on redeploy)modal.Cron
Common Mistakes:
- Confusing
(ephemeral dev) withmodal serve
(persistent production)modal deploy - Using
parameter withmessages
(it is not OpenAI -- it is a plain HTTP endpoint)@modal.fastapi_endpoint - Forgetting that scheduled functions cannot accept arguments -- use volumes, secrets, or global variables for input
- Not setting
header from TypeScript (FastAPI endpoints may reject the request)Content-Type: application/json - Using the
npm SDK in browser or edge runtimes (requires Node 22+, native modules)modal
Gotchas & Edge Cases:
- Web endpoint max request timeout is 150 seconds (enforced by Modal's proxy)
- Rate limit: 200 requests/second default with 5-second burst multiplier
- Request bodies up to 4 GiB, response bodies unlimited
URLs get amodal serve
suffix to avoid production conflicts-dev- URL pattern:
https://<workspace>--<app-name>-<function-name>.modal.run - Labels exceeding 63 characters are truncated with a SHA-256 hash suffix
- Volumes v1 limited to 500,000 files; v2 has no limit (use
)version=2
maintains schedule across redeploys;modal.Cron
resetsmodal.Period- Container class parameters are limited to primitives:
,str
,int
,boolbytes - Local Python files require
(automounting removed in 1.0)Image.add_local_python_source()
</red_flags>
<critical_reminders>
CRITICAL REMINDERS
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
, named constants)import type
(You MUST define Modal functions in Python -- the TypeScript SDK can call functions and manage resources but cannot define them)
(You MUST use
(not the old @modal.fastapi_endpoint
) for simple web endpoints -- renamed in Modal 1.0)@modal.web_endpoint
(You MUST use
for model weight caching -- modal.Volume
is deprecated in Modal 1.0)@modal.build
(You MUST never hardcode secrets in Modal code -- use
and access via modal.Secret.from_name()
)os.environ
(You MUST bind to
(not 0.0.0.0
) when using 127.0.0.1
)@modal.web_server
Failure to follow these rules will produce broken deployments, security vulnerabilities, or unreachable endpoints.
</critical_reminders>