Claude-skills cloudflare-workers-ai

Cloudflare Workers AI for serverless GPU inference. Use for LLMs, text/image generation, embeddings, or encountering AI_ERROR, rate limits, token exceeded errors.

install
source · Clone the upstream repo
git clone https://github.com/secondsky/claude-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/secondsky/claude-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/cloudflare-workers-ai/skills/cloudflare-workers-ai" ~/.claude/skills/secondsky-claude-skills-cloudflare-workers-ai && rm -rf "$T"
manifest: plugins/cloudflare-workers-ai/skills/cloudflare-workers-ai/SKILL.md
source content

Cloudflare Workers AI - Complete Reference

Production-ready knowledge domain for building AI-powered applications with Cloudflare Workers AI.

Status: Production Ready ✅ Last Updated: 2025-11-21 Dependencies: cloudflare-worker-base (for Worker setup) Latest Versions: wrangler@4.43.0, @cloudflare/workers-types@4.20251014.0


Table of Contents

  1. Quick Start (5 minutes)
  2. Workers AI API Reference
  3. Model Selection Guide
  4. Common Patterns
  5. AI Gateway Integration
  6. Rate Limits & Pricing
  7. Production Checklist

Quick Start (5 minutes)

1. Add AI Binding

wrangler.jsonc:

{
  "ai": {
    "binding": "AI"
  }
}

2. Run Your First Model

export interface Env {
  AI: Ai;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      prompt: 'What is Cloudflare?',
    });

    return Response.json(response);
  },
};

3. Add Streaming (Recommended)

const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true, // Always use streaming for text generation!
});

return new Response(stream, {
  headers: { 'content-type': 'text/event-stream' },
});

Why streaming?

  • Prevents buffering large responses in memory
  • Faster time-to-first-token
  • Better user experience for long-form content
  • Avoids Worker timeout issues

Workers AI API Reference

Core API:
env.AI.run()

const response = await env.AI.run(model, inputs, options?);
ParameterTypeDescription
model
stringModel ID (e.g.,
@cf/meta/llama-3.1-8b-instruct
)
inputs
objectModel-specific inputs (see model type below)
options.gateway.id
stringAI Gateway ID for caching/logging
options.gateway.skipCache
booleanSkip AI Gateway cache

Returns:

Promise<ModelOutput>
(non-streaming) or
ReadableStream
(streaming)

Input Types by Model Category

CategoryKey InputsOutput
Text Generation
messages[]
,
stream
,
max_tokens
,
temperature
{ response: string }
Embeddings
text: string | string[]
{ data: number[][], shape: number[] }
Image Generation
prompt
,
num_steps
,
guidance
Binary PNG
Vision
messages[].content[].image_url
{ response: string }

📄 Full model details: Load

references/models-catalog.md
for complete model list, parameters, and rate limits.


Model Selection Guide

Text Generation (LLMs)

ModelBest ForRate LimitSize
@cf/meta/llama-3.1-8b-instruct
General purpose, fast300/min8B
@cf/meta/llama-3.2-1b-instruct
Ultra-fast, simple tasks300/min1B
@cf/qwen/qwen1.5-14b-chat-awq
High quality, complex reasoning150/min14B
@cf/deepseek-ai/deepseek-r1-distill-qwen-32b
Coding, technical content300/min32B
@hf/thebloke/mistral-7b-instruct-v0.1-awq
Fast, efficient400/min7B

Text Embeddings

ModelDimensionsBest ForRate Limit
@cf/baai/bge-base-en-v1.5
768General purpose RAG3000/min
@cf/baai/bge-large-en-v1.5
1024High accuracy search1500/min
@cf/baai/bge-small-en-v1.5
384Fast, low storage3000/min

Image Generation

ModelBest ForRate LimitSpeed
@cf/black-forest-labs/flux-1-schnell
High quality, photorealistic720/minFast
@cf/stabilityai/stable-diffusion-xl-base-1.0
General purpose720/minMedium
@cf/lykon/dreamshaper-8-lcm
Artistic, stylized720/minFast

Vision Models

ModelBest ForRate Limit
@cf/meta/llama-3.2-11b-vision-instruct
Image understanding720/min
@cf/unum/uform-gen2-qwen-500m
Fast image captioning720/min

Common Patterns

Pattern 1: Chat with Streaming

app.post('/chat', async (c) => {
  const { messages } = await c.req.json<{ messages: Array<{ role: string; content: string }> }>();
  const stream = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', { messages, stream: true });
  return new Response(stream, { headers: { 'content-type': 'text/event-stream' } });
});

Pattern 2: RAG (Retrieval Augmented Generation)

// 1. Generate embedding for query
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [userQuery] });
// 2. Search Vectorize
const matches = await env.VECTORIZE.query(embeddings.data[0], { topK: 3 });
// 3. Build context
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');
// 4. Generate with context
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    { role: 'system', content: `Answer using this context:\n${context}` },
    { role: 'user', content: userQuery },
  ],
  stream: true,
});
return new Response(stream, { headers: { 'content-type': 'text/event-stream' } });

📄 More patterns: Load

references/best-practices.md
for structured output, image generation, multi-model consensus, and production patterns.


AI Gateway Integration

Enable caching, logging, and cost tracking with AI Gateway:

const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', { prompt: 'Hello' }, {
  gateway: { id: 'my-gateway', skipCache: false },
});

Benefits: Cost tracking, response caching (50-90% savings on repeated queries), request logging, rate limiting, analytics.


Rate Limits & Pricing

Information last verified: 2025-01-14

Rate limits and pricing vary significantly by model. Always check the official documentation for the most current information:

Free Tier: 10,000 neurons/day Paid Tier: $0.011 per 1,000 neurons

📄 Per-model details: See

references/models-catalog.md
for specific rate limits and pricing for each model.


Production Checklist

Essential before deploying:

  • Enable AI Gateway for cost tracking
  • Implement streaming for text generation
  • Add rate limit retry with exponential backoff
  • Validate input length (prevent token limit errors)
  • Add input sanitization (prevent prompt injection)

📄 Full checklist: Load

references/best-practices.md
for complete production checklist, error handling patterns, monitoring, and cost optimization.


External SDK Integrations

Workers AI supports OpenAI SDK compatibility and Vercel AI SDK:

// OpenAI SDK - use same patterns with Workers AI models
const openai = new OpenAI({
  apiKey: env.CLOUDFLARE_API_KEY,
  baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.CLOUDFLARE_ACCOUNT_ID}/ai/v1`,
});

// Vercel AI SDK - native integration
import { createWorkersAI } from 'workers-ai-provider';
const workersai = createWorkersAI({ binding: env.AI });

📄 Full integration guide: Load

references/integrations.md
for OpenAI SDK, Vercel AI SDK, and REST API examples.


Limits Summary

FeatureLimit
Concurrent requestsNo hard limit (rate limits apply)
Max input tokensVaries by model (typically 2K-128K)
Max output tokensVaries by model (typically 512-2048)
Streaming chunk size~1 KB
Image size (output)~5 MB
Request timeoutWorkers timeout applies (30s default, 5m max CPU)
Daily free neurons10,000
Rate limitsSee "Rate Limits & Pricing" section

When to Load References

Reference FileLoad When...
references/models-catalog.md
Choosing a model, checking rate limits, comparing model capabilities
references/best-practices.md
Production deployment, error handling, cost optimization, security
references/integrations.md
Using OpenAI SDK, Vercel AI SDK, or REST API instead of native binding

References