Learn-skills.dev ai-infrastructure-replicate

Replicate SDK patterns for TypeScript/Node.js -- client setup, predictions, streaming, webhooks, file handling, model versioning, deployments, and training

install

source · Clone the upstream repo

git clone https://github.com/NeverSight/learn-skills.dev

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/agents-inc/skills/ai-infrastructure-replicate" ~/.claude/skills/neversight-learn-skills-dev-ai-infrastructure-replicate && rm -rf "$T"

manifest: data/skills-md/agents-inc/skills/ai-infrastructure-replicate/SKILL.md

source content

Replicate SDK Patterns

Quick Guide: Use the
replicate
npm package to run open-source ML models on serverless GPUs. Use
replicate.run()
for synchronous execution that returns output directly,
replicate.stream()
for SSE-based streaming, or
replicate.predictions.create()
for async background jobs with webhook notifications. Models are referenced as
owner/model
(uses latest version) or
owner/model:version
(pinned). File outputs are
FileOutput
objects implementing
ReadableStream
. Cold starts are expected for infrequently-used models -- use deployments with
min_instances
to keep models warm.

<critical_requirements>

CRITICAL: Before Using This Skill

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type
, named constants)

(You MUST never hardcode API tokens -- always use environment variables via

process.env.REPLICATE_API_TOKEN

)

(You MUST handle

FileOutput

objects for models that return files -- do not assume outputs are plain strings or URLs)

(You MUST validate webhooks using

validateWebhook()

from the
replicate
package -- never trust unverified webhook payloads)

(You MUST account for cold starts when running infrequently-used models -- use deployments for latency-sensitive applications)

(You MUST specify model versions (

owner/model:version

) in production to ensure reproducible results -- unversioned references use the latest, which can change)

</critical_requirements>

Auto-detection: Replicate, replicate, replicate.run, replicate.stream, replicate.predictions, replicate.deployments, replicate.trainings, replicate.models, FileOutput, validateWebhook, REPLICATE_API_TOKEN, serverless GPU, cold start, webhook_events_filter

When to use:

Running open-source ML models (Llama, Stable Diffusion, Whisper, etc.) without managing GPU infrastructure
Generating images, transcribing audio, running LLMs, or any ML inference via API
Streaming LLM output in real-time with server-sent events
Processing predictions asynchronously with webhook notifications
Fine-tuning models with custom training data
Running models on dedicated hardware with custom scaling via deployments

Key patterns covered:

Client initialization and configuration (auth, user agent, file encoding)

Running predictions (

replicate.run()

replicate.predictions.create()

replicate.wait()

)

Streaming output (
```
replicate.stream()
```
with SSE events)
Model versioning (
```
owner/model
```
vs
```
owner/model:version
```
)
File input/output handling (
```
FileOutput
```
, file uploads,
```
Buffer
```
inputs)
Webhooks (setup, event filtering, signature validation)
Deployments (custom hardware, scaling, keeping models warm)
Training / fine-tuning

When NOT to use:

You need a unified multi-provider LLM SDK (OpenAI, Anthropic, Google) -- use a provider-agnostic SDK
You want to run models locally -- Replicate is a cloud-only serverless platform
You need sub-second latency guarantees without deployments -- cold starts can take minutes

Examples Index

Core: Setup, Predictions & Files -- Client init, run(), predictions.create(), wait(), file I/O, error handling
Streaming & Webhooks -- stream(), SSE events, webhook setup, signature validation
Deployments & Training -- Custom hardware, scaling, fine-tuning, model management
Quick API Reference -- Method signatures, constructor options, error types, model reference format

Philosophy

Replicate provides serverless GPU infrastructure for running open-source ML models. You send inputs, Replicate allocates GPU hardware, runs the model, and returns outputs. No Docker, no CUDA drivers, no GPU provisioning.

Core principles:

Serverless execution -- Models run on-demand on Replicate's infrastructure. You pay only for compute time. Cold starts are a trade-off for not maintaining always-on GPUs.
Model marketplace -- Thousands of community and official models available at
```
replicate.com/explore
```
. Run any public model with just its identifier.
Version pinning for reproducibility -- Models are versioned with SHA-256 hashes. Pin to a version in production (
```
owner/model:abc123...
```
) to guarantee identical behavior across deploys.
Three execution modes --
```
replicate.run()
```
for synchronous wait,
```
replicate.stream()
```
for real-time SSE output,
```
replicate.predictions.create()
```
for fire-and-forget with webhooks.
File-first I/O -- Many models accept and produce files (images, audio, video). The SDK handles file uploads automatically and returns
```
FileOutput
```
objects for file outputs.

</philosophy>

Core Patterns

Pattern 1: Client Setup

Initialize the Replicate client. It auto-reads

REPLICATE_API_TOKEN

from the environment.

// lib/replicate.ts -- basic setup
import Replicate from "replicate";

const replicate = new Replicate();

export { replicate };

// lib/replicate.ts -- explicit auth + custom user agent
import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN, // Auto-reads from env if omitted
  userAgent: "my-app/1.0.0",
});

export { replicate };

Why good: Minimal setup, env var auto-detected, explicit auth optional but useful for clarity

// BAD: Hardcoded token
const replicate = new Replicate({
  auth: "r8_abc123...",
});

Why bad: Hardcoded API token is a security risk, will leak in version control

See: examples/core.md for full constructor options, error handling patterns

Pattern 2: Running Predictions

Use

replicate.run()

for synchronous execution. Returns the model output directly.

// Run an image generation model
const [output] = await replicate.run("black-forest-labs/flux-schnell", {
  input: {
    prompt: "a serene mountain landscape at sunset",
  },
});

// output is a FileOutput object for image models
console.log(output.url()); // URL of generated image

// Run an LLM -- output is a string for text models
const output = await replicate.run("meta/meta-llama-3-70b-instruct", {
  input: {
    prompt: "Explain TypeScript generics in 3 sentences.",
    max_tokens: 512,
  },
});

console.log(output); // Text response

Why good: Simple API, returns output directly, destructuring works for array outputs (images)

// BAD: Not pinning version in production
const output = await replicate.run("community-user/experimental-model", {
  input: { prompt: "hello" },
});

Why bad: Community models without version pinning can change behavior unexpectedly when authors push updates

See: examples/core.md for version pinning,

predictions.create()

wait()

, and progress callbacks

Pattern 3: Streaming

Use

replicate.stream()

for real-time SSE output from language models.

const stream = replicate.stream("meta/meta-llama-3-70b-instruct", {
  input: {
    prompt: "Write a short poem about TypeScript.",
    max_tokens: 512,
  },
});

for await (const event of stream) {
  if (event.event === "output") {
    process.stdout.write(event.data);
  }
}

Why good: Progressive output for better UX, event-based with typed

event

and

data

fields

// BAD: Using replicate.run() for user-facing LLM output
const output = await replicate.run("meta/meta-llama-3-70b-instruct", {
  input: { prompt: "Write a long essay..." },
});
// User waits for entire generation to complete before seeing anything

Why bad: No progressive feedback, user sees a blank screen for seconds

See: examples/streaming-webhooks.md for event types, error handling, cancellation

Pattern 4: Model Versioning

Models are referenced as

owner/model

(latest version) or

owner/model:sha256hash

(pinned version).

// Development: use latest version for convenience
const output = await replicate.run("stability-ai/sdxl", {
  input: { prompt: "a cat" },
});

// Production: pin to a specific version for reproducibility
const VERSION_HASH =
  "39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b";
const output = await replicate.run(`stability-ai/sdxl:${VERSION_HASH}`, {
  input: { prompt: "a cat" },
});

Why good: Pinned version guarantees identical behavior, hash is immutable

See: examples/core.md for listing model versions, getting version details

Pattern 5: File Handling

Models that output files return

FileOutput

objects implementing

ReadableStream

import { writeFile } from "node:fs/promises";

const [output] = await replicate.run("black-forest-labs/flux-schnell", {
  input: { prompt: "a sunset over mountains" },
});

// FileOutput has .url() and .blob() methods
console.log(output.url()); // Underlying URL

// Save to disk
const blob = await output.blob();
const buffer = Buffer.from(await blob.arrayBuffer());
await writeFile("./output.png", buffer);

// File inputs: pass URLs, Buffers, or ReadStreams
import { readFile } from "node:fs/promises";

const imageBuffer = await readFile("./input.png");

const output = await replicate.run("some-user/image-model", {
  input: {
    image: imageBuffer, // Auto-uploaded (max 100 MiB)
  },
});

Why good:

FileOutput

is a

ReadableStream

, works with Node.js stream APIs,

.url()

for the underlying URL

// BAD: Treating file output as a plain URL string
const [output] = await replicate.run("black-forest-labs/flux-schnell", {
  input: { prompt: "hello" },
});
const url = output; // WRONG: output is a FileOutput object, not a string

Why bad:

FileOutput

is an object, not a string -- use

.url()

to get the URL

See: examples/core.md for file uploads, large file handling, encoding strategies

Pattern 6: Async Predictions with Webhooks

Use

replicate.predictions.create()

for background jobs with webhook notifications.

const prediction = await replicate.predictions.create({
  model: "owner/model", // OR version: "sha256hash" for pinned version
  input: { prompt: "a painting of a cat" },
  webhook: "https://my.app/webhooks/replicate",
  webhook_events_filter: ["completed"],
});

console.log(prediction.id); // Use to track status
console.log(prediction.status); // "starting"

// Webhook signature validation (CRITICAL for security)
import { validateWebhook } from "replicate";

async function handleWebhook(request: Request): Promise<Response> {
  const secret = process.env.REPLICATE_WEBHOOK_SIGNING_SECRET;
  const isValid = await validateWebhook(request, secret);

  if (!isValid) {
    return new Response("Invalid signature", { status: 401 });
  }

  const prediction = await request.json();
  // Process prediction.output safely
  return new Response("OK", { status: 200 });
}

Why good: Decoupled processing, secure signature validation, filtered events reduce noise

See: examples/streaming-webhooks.md for webhook event types, polling alternative

Pattern 7: Deployments

Deployments give you a private, fixed endpoint with custom hardware and scaling.

// Create a prediction on a deployment (no cold start if min_instances > 0)
const prediction = await replicate.deployments.predictions.create(
  "my-org/my-deployment",
  {
    input: { prompt: "hello world" },
  },
);

const result = await replicate.wait(prediction);
console.log(result.output);

Why good: Predictable latency with

min_instances

, private endpoint, custom hardware selection

See: examples/deployments-training.md for creating/managing deployments, training API

Pattern 8: Error Handling

Catch API errors with status codes. The SDK auto-retries on 429 and 5xx errors (5 retries by default with exponential backoff).

try {
  const output = await replicate.run("owner/model", {
    input: { prompt: "hello" },
  });
} catch (error) {
  if (error instanceof Error) {
    console.error(`Replicate error: ${error.message}`);

    // Check for specific HTTP status codes in the error
    if ("status" in error) {
      const status = (error as { status: number }).status;
      if (status === 401) {
        throw new Error("Invalid API token. Check REPLICATE_API_TOKEN.");
      }
      if (status === 422) {
        console.error("Invalid input parameters");
      }
      if (status === 429) {
        console.error(
          "Rate limited -- SDK auto-retries (5 attempts) exhausted",
        );
      }
    }
  }
  throw error;
}

Why good: Checks error type, handles specific status codes, re-throws unexpected errors

See: examples/core.md for full error handling example with status code handling

</patterns>

Performance Optimization

Cold Start Mitigation

Frequent model with varying load   -> Use deployments with min_instances >= 1
One-off batch jobs                  -> Use predictions.create() with webhooks (no waiting)
Popular public models               -> Usually warm, replicate.run() is fine
Custom/niche models                 -> Expect 30s-5min cold start on first run

Key Optimization Patterns

Use deployments for latency-sensitive applications -- set
```
min_instances: 1
```
to eliminate cold starts
Use webhooks instead of polling for async jobs -- reduces API calls and latency
Batch file inputs as URLs instead of uploading buffers -- avoids 100 MiB upload limit and is faster
Pin model versions in production -- avoids unexpected behavior changes and enables caching
Use
replicate.stream()
for LLMs -- progressive output feels faster than waiting for full completion
Cancel unneeded predictions with
```
replicate.predictions.cancel()
```
-- stops billing immediately

</performance>

<decision_framework>

Decision Framework

Which Execution Method to Use

Is this a user-facing LLM response?
+-- YES -> Use replicate.stream() for real-time SSE output
+-- NO -> Do you need the result immediately?
    +-- YES -> Use replicate.run() (blocks until complete)
    +-- NO -> Use replicate.predictions.create() + webhook
        +-- Need to poll instead? -> Use replicate.wait(prediction)

Model Reference Format

Are you in development/prototyping?
+-- YES -> Use owner/model (latest version, convenient)
+-- NO -> Are you in production?
    +-- YES -> Use owner/model:version_hash (pinned, reproducible)
    +-- Does the model change frequently?
        +-- YES -> Pin version, test updates explicitly
        +-- NO -> Either format works, prefer pinned

Deployments vs Direct API

Do you need consistent low latency?
+-- YES -> Create a deployment with min_instances >= 1
+-- NO -> Do you need custom hardware (A100, H100)?
    +-- YES -> Create a deployment with specific hardware
    +-- NO -> Use replicate.run() / replicate.stream() directly
        (Replicate auto-allocates hardware)

When to Use This SDK vs Other AI SDKs

Are you running open-source models on serverless GPUs?
+-- YES -> Use Replicate SDK
+-- NO -> Are you calling proprietary APIs (OpenAI, Anthropic)?
    +-- YES -> Not this skill's scope -- use provider-specific SDKs
    +-- NO -> Do you need to switch between multiple providers?
        +-- YES -> Not this skill's scope -- use a unified provider SDK
        +-- NO -> Do you want to self-host models?
            +-- YES -> Not this skill's scope -- consider Cog or vLLM
            +-- NO -> Replicate SDK is appropriate

</decision_framework>

<red_flags>

RED FLAGS

High Priority Issues:

Hardcoding
```
REPLICATE_API_TOKEN
```
in source code (security breach risk)
Treating
```
FileOutput
```
as a string (it is a
```
ReadableStream
```
object -- use
```
.url()
```
or
```
.blob()
```
)
Not validating webhook signatures with
```
validateWebhook()
```
(allows forged webhook payloads)
Using
```
replicate.run()
```
for long-running models in request handlers (blocks the response, can timeout)

Medium Priority Issues:

Not pinning model versions in production (
```
owner/model
```
uses latest, which can change without notice)
Relying solely on default retry behavior for production (5 retries with exponential backoff may be too aggressive for some use cases)
Uploading large files as
```
Buffer
```
instead of hosting them at a URL (100 MiB limit on uploads)
Ignoring cold start latency for infrequently-used models (first request can take minutes)

Common Mistakes:

Confusing
```
replicate.run()
```
(returns output directly) with
```
replicate.predictions.create()
```
(returns a prediction object with status/id)

Destructuring image output incorrectly:

const output = await replicate.run(...)

instead of

const [output] = await replicate.run(...)

(image models return arrays)

Using
```
replicate.stream()
```
with models that do not support streaming (only language models with SSE support)
Forgetting that
```
replicate.predictions.create()
```
accepts either a
```
version
```
hash or a
```
model
```
string (
```
owner/model
```
) -- use
```
version
```
for pinned reproducibility,
```
model
```
for latest-version convenience
Not consuming the async iterator from
```
replicate.stream()
```
(events are lost)

Gotchas & Edge Cases:

Prediction inputs and outputs are automatically deleted after one hour -- persist outputs via webhooks or download immediately
The SDK auto-retries on 429 (rate limit) and 5xx errors -- 5 retries by default with exponential backoff. GET requests retry on 429 and 5xx; non-GET requests retry only on 429

replicate.stream()

returns

ServerSentEvent

objects with

.event

(

"output"

"error"

"done"

) and

.data

(string) properties

File uploads are limited to 100 MiB -- for larger files, host them at a URL and pass the URL as input
Browser usage is not supported -- the SDK requires a server-side environment (Node.js 18+, Bun, Deno, Cloudflare Workers)

webhook_events_filter

accepts

["start", "output", "logs", "completed"]

-- use

["completed"]

unless you need intermediate status updates

The
```
Prefer: wait
```
header enables sync mode on the HTTP API (up to 60s), but
```
replicate.run()
```
already handles this automatically
Community models may disappear or change without warning -- pin versions and maintain fallbacks for critical workflows
```
replicate.wait()
```
polls the API until the prediction completes -- use webhooks for production to avoid polling overhead
```
FileOutput.url()
```
returns the underlying URL, but these URLs are temporary -- download or persist the file before it expires

</red_flags>

<critical_reminders>

CRITICAL REMINDERS