Learn-skills.dev ai-infrastructure-replicate
Replicate SDK patterns for TypeScript/Node.js -- client setup, predictions, streaming, webhooks, file handling, model versioning, deployments, and training
git clone https://github.com/NeverSight/learn-skills.dev
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/agents-inc/skills/ai-infrastructure-replicate" ~/.claude/skills/neversight-learn-skills-dev-ai-infrastructure-replicate && rm -rf "$T"
data/skills-md/agents-inc/skills/ai-infrastructure-replicate/SKILL.mdReplicate SDK Patterns
Quick Guide: Use the
npm package to run open-source ML models on serverless GPUs. Usereplicatefor synchronous execution that returns output directly,replicate.run()for SSE-based streaming, orreplicate.stream()for async background jobs with webhook notifications. Models are referenced asreplicate.predictions.create()(uses latest version) orowner/model(pinned). File outputs areowner/model:versionobjects implementingFileOutput. Cold starts are expected for infrequently-used models -- use deployments withReadableStreamto keep models warm.min_instances
<critical_requirements>
CRITICAL: Before Using This Skill
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
, named constants)import type
(You MUST never hardcode API tokens -- always use environment variables via
)process.env.REPLICATE_API_TOKEN
(You MUST handle
objects for models that return files -- do not assume outputs are plain strings or URLs)FileOutput
(You MUST validate webhooks using
from the validateWebhook()
package -- never trust unverified webhook payloads)replicate
(You MUST account for cold starts when running infrequently-used models -- use deployments for latency-sensitive applications)
(You MUST specify model versions (
) in production to ensure reproducible results -- unversioned references use the latest, which can change)owner/model:version
</critical_requirements>
Auto-detection: Replicate, replicate, replicate.run, replicate.stream, replicate.predictions, replicate.deployments, replicate.trainings, replicate.models, FileOutput, validateWebhook, REPLICATE_API_TOKEN, serverless GPU, cold start, webhook_events_filter
When to use:
- Running open-source ML models (Llama, Stable Diffusion, Whisper, etc.) without managing GPU infrastructure
- Generating images, transcribing audio, running LLMs, or any ML inference via API
- Streaming LLM output in real-time with server-sent events
- Processing predictions asynchronously with webhook notifications
- Fine-tuning models with custom training data
- Running models on dedicated hardware with custom scaling via deployments
Key patterns covered:
- Client initialization and configuration (auth, user agent, file encoding)
- Running predictions (
,replicate.run()
,replicate.predictions.create()
)replicate.wait() - Streaming output (
with SSE events)replicate.stream() - Model versioning (
vsowner/model
)owner/model:version - File input/output handling (
, file uploads,FileOutput
inputs)Buffer - Webhooks (setup, event filtering, signature validation)
- Deployments (custom hardware, scaling, keeping models warm)
- Training / fine-tuning
When NOT to use:
- You need a unified multi-provider LLM SDK (OpenAI, Anthropic, Google) -- use a provider-agnostic SDK
- You want to run models locally -- Replicate is a cloud-only serverless platform
- You need sub-second latency guarantees without deployments -- cold starts can take minutes
Examples Index
- Core: Setup, Predictions & Files -- Client init, run(), predictions.create(), wait(), file I/O, error handling
- Streaming & Webhooks -- stream(), SSE events, webhook setup, signature validation
- Deployments & Training -- Custom hardware, scaling, fine-tuning, model management
- Quick API Reference -- Method signatures, constructor options, error types, model reference format
<philosophy>
Philosophy
Replicate provides serverless GPU infrastructure for running open-source ML models. You send inputs, Replicate allocates GPU hardware, runs the model, and returns outputs. No Docker, no CUDA drivers, no GPU provisioning.
Core principles:
- Serverless execution -- Models run on-demand on Replicate's infrastructure. You pay only for compute time. Cold starts are a trade-off for not maintaining always-on GPUs.
- Model marketplace -- Thousands of community and official models available at
. Run any public model with just its identifier.replicate.com/explore - Version pinning for reproducibility -- Models are versioned with SHA-256 hashes. Pin to a version in production (
) to guarantee identical behavior across deploys.owner/model:abc123... - Three execution modes --
for synchronous wait,replicate.run()
for real-time SSE output,replicate.stream()
for fire-and-forget with webhooks.replicate.predictions.create() - File-first I/O -- Many models accept and produce files (images, audio, video). The SDK handles file uploads automatically and returns
objects for file outputs.FileOutput
<patterns>
Core Patterns
Pattern 1: Client Setup
Initialize the Replicate client. It auto-reads
REPLICATE_API_TOKEN from the environment.
// lib/replicate.ts -- basic setup import Replicate from "replicate"; const replicate = new Replicate(); export { replicate };
// lib/replicate.ts -- explicit auth + custom user agent import Replicate from "replicate"; const replicate = new Replicate({ auth: process.env.REPLICATE_API_TOKEN, // Auto-reads from env if omitted userAgent: "my-app/1.0.0", }); export { replicate };
Why good: Minimal setup, env var auto-detected, explicit auth optional but useful for clarity
// BAD: Hardcoded token const replicate = new Replicate({ auth: "r8_abc123...", });
Why bad: Hardcoded API token is a security risk, will leak in version control
See: examples/core.md for full constructor options, error handling patterns
Pattern 2: Running Predictions
Use
replicate.run() for synchronous execution. Returns the model output directly.
// Run an image generation model const [output] = await replicate.run("black-forest-labs/flux-schnell", { input: { prompt: "a serene mountain landscape at sunset", }, }); // output is a FileOutput object for image models console.log(output.url()); // URL of generated image
// Run an LLM -- output is a string for text models const output = await replicate.run("meta/meta-llama-3-70b-instruct", { input: { prompt: "Explain TypeScript generics in 3 sentences.", max_tokens: 512, }, }); console.log(output); // Text response
Why good: Simple API, returns output directly, destructuring works for array outputs (images)
// BAD: Not pinning version in production const output = await replicate.run("community-user/experimental-model", { input: { prompt: "hello" }, });
Why bad: Community models without version pinning can change behavior unexpectedly when authors push updates
See: examples/core.md for version pinning,
predictions.create() + wait(), and progress callbacks
Pattern 3: Streaming
Use
replicate.stream() for real-time SSE output from language models.
const stream = replicate.stream("meta/meta-llama-3-70b-instruct", { input: { prompt: "Write a short poem about TypeScript.", max_tokens: 512, }, }); for await (const event of stream) { if (event.event === "output") { process.stdout.write(event.data); } }
Why good: Progressive output for better UX, event-based with typed
event and data fields
// BAD: Using replicate.run() for user-facing LLM output const output = await replicate.run("meta/meta-llama-3-70b-instruct", { input: { prompt: "Write a long essay..." }, }); // User waits for entire generation to complete before seeing anything
Why bad: No progressive feedback, user sees a blank screen for seconds
See: examples/streaming-webhooks.md for event types, error handling, cancellation
Pattern 4: Model Versioning
Models are referenced as
owner/model (latest version) or owner/model:sha256hash (pinned version).
// Development: use latest version for convenience const output = await replicate.run("stability-ai/sdxl", { input: { prompt: "a cat" }, }); // Production: pin to a specific version for reproducibility const VERSION_HASH = "39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b"; const output = await replicate.run(`stability-ai/sdxl:${VERSION_HASH}`, { input: { prompt: "a cat" }, });
Why good: Pinned version guarantees identical behavior, hash is immutable
See: examples/core.md for listing model versions, getting version details
Pattern 5: File Handling
Models that output files return
FileOutput objects implementing ReadableStream.
import { writeFile } from "node:fs/promises"; const [output] = await replicate.run("black-forest-labs/flux-schnell", { input: { prompt: "a sunset over mountains" }, }); // FileOutput has .url() and .blob() methods console.log(output.url()); // Underlying URL // Save to disk const blob = await output.blob(); const buffer = Buffer.from(await blob.arrayBuffer()); await writeFile("./output.png", buffer);
// File inputs: pass URLs, Buffers, or ReadStreams import { readFile } from "node:fs/promises"; const imageBuffer = await readFile("./input.png"); const output = await replicate.run("some-user/image-model", { input: { image: imageBuffer, // Auto-uploaded (max 100 MiB) }, });
Why good:
FileOutput is a ReadableStream, works with Node.js stream APIs, .url() for the underlying URL
// BAD: Treating file output as a plain URL string const [output] = await replicate.run("black-forest-labs/flux-schnell", { input: { prompt: "hello" }, }); const url = output; // WRONG: output is a FileOutput object, not a string
Why bad:
FileOutput is an object, not a string -- use .url() to get the URL
See: examples/core.md for file uploads, large file handling, encoding strategies
Pattern 6: Async Predictions with Webhooks
Use
replicate.predictions.create() for background jobs with webhook notifications.
const prediction = await replicate.predictions.create({ model: "owner/model", // OR version: "sha256hash" for pinned version input: { prompt: "a painting of a cat" }, webhook: "https://my.app/webhooks/replicate", webhook_events_filter: ["completed"], }); console.log(prediction.id); // Use to track status console.log(prediction.status); // "starting"
// Webhook signature validation (CRITICAL for security) import { validateWebhook } from "replicate"; async function handleWebhook(request: Request): Promise<Response> { const secret = process.env.REPLICATE_WEBHOOK_SIGNING_SECRET; const isValid = await validateWebhook(request, secret); if (!isValid) { return new Response("Invalid signature", { status: 401 }); } const prediction = await request.json(); // Process prediction.output safely return new Response("OK", { status: 200 }); }
Why good: Decoupled processing, secure signature validation, filtered events reduce noise
See: examples/streaming-webhooks.md for webhook event types, polling alternative
Pattern 7: Deployments
Deployments give you a private, fixed endpoint with custom hardware and scaling.
// Create a prediction on a deployment (no cold start if min_instances > 0) const prediction = await replicate.deployments.predictions.create( "my-org/my-deployment", { input: { prompt: "hello world" }, }, ); const result = await replicate.wait(prediction); console.log(result.output);
Why good: Predictable latency with
min_instances, private endpoint, custom hardware selection
See: examples/deployments-training.md for creating/managing deployments, training API
Pattern 8: Error Handling
Catch API errors with status codes. The SDK auto-retries on 429 and 5xx errors (5 retries by default with exponential backoff).
try { const output = await replicate.run("owner/model", { input: { prompt: "hello" }, }); } catch (error) { if (error instanceof Error) { console.error(`Replicate error: ${error.message}`); // Check for specific HTTP status codes in the error if ("status" in error) { const status = (error as { status: number }).status; if (status === 401) { throw new Error("Invalid API token. Check REPLICATE_API_TOKEN."); } if (status === 422) { console.error("Invalid input parameters"); } if (status === 429) { console.error( "Rate limited -- SDK auto-retries (5 attempts) exhausted", ); } } } throw error; }
Why good: Checks error type, handles specific status codes, re-throws unexpected errors
See: examples/core.md for full error handling example with status code handling
</patterns><performance>
Performance Optimization
Cold Start Mitigation
Frequent model with varying load -> Use deployments with min_instances >= 1 One-off batch jobs -> Use predictions.create() with webhooks (no waiting) Popular public models -> Usually warm, replicate.run() is fine Custom/niche models -> Expect 30s-5min cold start on first run
Key Optimization Patterns
- Use deployments for latency-sensitive applications -- set
to eliminate cold startsmin_instances: 1 - Use webhooks instead of polling for async jobs -- reduces API calls and latency
- Batch file inputs as URLs instead of uploading buffers -- avoids 100 MiB upload limit and is faster
- Pin model versions in production -- avoids unexpected behavior changes and enables caching
- Use
for LLMs -- progressive output feels faster than waiting for full completionreplicate.stream() - Cancel unneeded predictions with
-- stops billing immediatelyreplicate.predictions.cancel()
<decision_framework>
Decision Framework
Which Execution Method to Use
Is this a user-facing LLM response? +-- YES -> Use replicate.stream() for real-time SSE output +-- NO -> Do you need the result immediately? +-- YES -> Use replicate.run() (blocks until complete) +-- NO -> Use replicate.predictions.create() + webhook +-- Need to poll instead? -> Use replicate.wait(prediction)
Model Reference Format
Are you in development/prototyping? +-- YES -> Use owner/model (latest version, convenient) +-- NO -> Are you in production? +-- YES -> Use owner/model:version_hash (pinned, reproducible) +-- Does the model change frequently? +-- YES -> Pin version, test updates explicitly +-- NO -> Either format works, prefer pinned
Deployments vs Direct API
Do you need consistent low latency? +-- YES -> Create a deployment with min_instances >= 1 +-- NO -> Do you need custom hardware (A100, H100)? +-- YES -> Create a deployment with specific hardware +-- NO -> Use replicate.run() / replicate.stream() directly (Replicate auto-allocates hardware)
When to Use This SDK vs Other AI SDKs
Are you running open-source models on serverless GPUs? +-- YES -> Use Replicate SDK +-- NO -> Are you calling proprietary APIs (OpenAI, Anthropic)? +-- YES -> Not this skill's scope -- use provider-specific SDKs +-- NO -> Do you need to switch between multiple providers? +-- YES -> Not this skill's scope -- use a unified provider SDK +-- NO -> Do you want to self-host models? +-- YES -> Not this skill's scope -- consider Cog or vLLM +-- NO -> Replicate SDK is appropriate
</decision_framework>
<red_flags>
RED FLAGS
High Priority Issues:
- Hardcoding
in source code (security breach risk)REPLICATE_API_TOKEN - Treating
as a string (it is aFileOutput
object -- useReadableStream
or.url()
).blob() - Not validating webhook signatures with
(allows forged webhook payloads)validateWebhook() - Using
for long-running models in request handlers (blocks the response, can timeout)replicate.run()
Medium Priority Issues:
- Not pinning model versions in production (
uses latest, which can change without notice)owner/model - Relying solely on default retry behavior for production (5 retries with exponential backoff may be too aggressive for some use cases)
- Uploading large files as
instead of hosting them at a URL (100 MiB limit on uploads)Buffer - Ignoring cold start latency for infrequently-used models (first request can take minutes)
Common Mistakes:
- Confusing
(returns output directly) withreplicate.run()
(returns a prediction object with status/id)replicate.predictions.create() - Destructuring image output incorrectly:
instead ofconst output = await replicate.run(...)
(image models return arrays)const [output] = await replicate.run(...) - Using
with models that do not support streaming (only language models with SSE support)replicate.stream() - Forgetting that
accepts either areplicate.predictions.create()
hash or aversion
string (model
) -- useowner/model
for pinned reproducibility,version
for latest-version conveniencemodel - Not consuming the async iterator from
(events are lost)replicate.stream()
Gotchas & Edge Cases:
- Prediction inputs and outputs are automatically deleted after one hour -- persist outputs via webhooks or download immediately
- The SDK auto-retries on 429 (rate limit) and 5xx errors -- 5 retries by default with exponential backoff. GET requests retry on 429 and 5xx; non-GET requests retry only on 429
returnsreplicate.stream()
objects withServerSentEvent
(.event
,"output"
,"error"
) and"done"
(string) properties.data- File uploads are limited to 100 MiB -- for larger files, host them at a URL and pass the URL as input
- Browser usage is not supported -- the SDK requires a server-side environment (Node.js 18+, Bun, Deno, Cloudflare Workers)
acceptswebhook_events_filter
-- use["start", "output", "logs", "completed"]
unless you need intermediate status updates["completed"]- The
header enables sync mode on the HTTP API (up to 60s), butPrefer: wait
already handles this automaticallyreplicate.run() - Community models may disappear or change without warning -- pin versions and maintain fallbacks for critical workflows
polls the API until the prediction completes -- use webhooks for production to avoid polling overheadreplicate.wait()
returns the underlying URL, but these URLs are temporary -- download or persist the file before it expiresFileOutput.url()
</red_flags>
<critical_reminders>
CRITICAL REMINDERS
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
, named constants)import type
(You MUST never hardcode API tokens -- always use environment variables via
)process.env.REPLICATE_API_TOKEN
(You MUST handle
objects for models that return files -- do not assume outputs are plain strings or URLs)FileOutput
(You MUST validate webhooks using
from the validateWebhook()
package -- never trust unverified webhook payloads)replicate
(You MUST account for cold starts when running infrequently-used models -- use deployments for latency-sensitive applications)
(You MUST specify model versions (
) in production to ensure reproducible results -- unversioned references use the latest, which can change)owner/model:version
Failure to follow these rules will produce insecure, unreliable, or unpredictable AI integrations.
</critical_reminders>