Claude-skills cloudflare-workers-ai
Cloudflare Workers AI for serverless GPU inference. Use for LLMs, text/image generation, embeddings, or encountering AI_ERROR, rate limits, token exceeded errors.
git clone https://github.com/secondsky/claude-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/secondsky/claude-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/cloudflare-workers-ai/skills/cloudflare-workers-ai" ~/.claude/skills/secondsky-claude-skills-cloudflare-workers-ai && rm -rf "$T"
plugins/cloudflare-workers-ai/skills/cloudflare-workers-ai/SKILL.mdCloudflare Workers AI - Complete Reference
Production-ready knowledge domain for building AI-powered applications with Cloudflare Workers AI.
Status: Production Ready ✅ Last Updated: 2025-11-21 Dependencies: cloudflare-worker-base (for Worker setup) Latest Versions: wrangler@4.43.0, @cloudflare/workers-types@4.20251014.0
Table of Contents
- Quick Start (5 minutes)
- Workers AI API Reference
- Model Selection Guide
- Common Patterns
- AI Gateway Integration
- Rate Limits & Pricing
- Production Checklist
Quick Start (5 minutes)
1. Add AI Binding
wrangler.jsonc:
{ "ai": { "binding": "AI" } }
2. Run Your First Model
export interface Env { AI: Ai; } export default { async fetch(request: Request, env: Env): Promise<Response> { const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', { prompt: 'What is Cloudflare?', }); return Response.json(response); }, };
3. Add Streaming (Recommended)
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', { messages: [{ role: 'user', content: 'Tell me a story' }], stream: true, // Always use streaming for text generation! }); return new Response(stream, { headers: { 'content-type': 'text/event-stream' }, });
Why streaming?
- Prevents buffering large responses in memory
- Faster time-to-first-token
- Better user experience for long-form content
- Avoids Worker timeout issues
Workers AI API Reference
Core API: env.AI.run()
env.AI.run()const response = await env.AI.run(model, inputs, options?);
| Parameter | Type | Description |
|---|---|---|
| string | Model ID (e.g., ) |
| object | Model-specific inputs (see model type below) |
| string | AI Gateway ID for caching/logging |
| boolean | Skip AI Gateway cache |
Returns:
Promise<ModelOutput> (non-streaming) or ReadableStream (streaming)
Input Types by Model Category
| Category | Key Inputs | Output |
|---|---|---|
| Text Generation | , , , | |
| Embeddings | | |
| Image Generation | , , | Binary PNG |
| Vision | | |
📄 Full model details: Load
references/models-catalog.md for complete model list, parameters, and rate limits.
Model Selection Guide
Text Generation (LLMs)
| Model | Best For | Rate Limit | Size |
|---|---|---|---|
| General purpose, fast | 300/min | 8B |
| Ultra-fast, simple tasks | 300/min | 1B |
| High quality, complex reasoning | 150/min | 14B |
| Coding, technical content | 300/min | 32B |
| Fast, efficient | 400/min | 7B |
Text Embeddings
| Model | Dimensions | Best For | Rate Limit |
|---|---|---|---|
| 768 | General purpose RAG | 3000/min |
| 1024 | High accuracy search | 1500/min |
| 384 | Fast, low storage | 3000/min |
Image Generation
| Model | Best For | Rate Limit | Speed |
|---|---|---|---|
| High quality, photorealistic | 720/min | Fast |
| General purpose | 720/min | Medium |
| Artistic, stylized | 720/min | Fast |
Vision Models
| Model | Best For | Rate Limit |
|---|---|---|
| Image understanding | 720/min |
| Fast image captioning | 720/min |
Common Patterns
Pattern 1: Chat with Streaming
app.post('/chat', async (c) => { const { messages } = await c.req.json<{ messages: Array<{ role: string; content: string }> }>(); const stream = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', { messages, stream: true }); return new Response(stream, { headers: { 'content-type': 'text/event-stream' } }); });
Pattern 2: RAG (Retrieval Augmented Generation)
// 1. Generate embedding for query const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [userQuery] }); // 2. Search Vectorize const matches = await env.VECTORIZE.query(embeddings.data[0], { topK: 3 }); // 3. Build context const context = matches.matches.map((m) => m.metadata.text).join('\n\n'); // 4. Generate with context const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', { messages: [ { role: 'system', content: `Answer using this context:\n${context}` }, { role: 'user', content: userQuery }, ], stream: true, }); return new Response(stream, { headers: { 'content-type': 'text/event-stream' } });
📄 More patterns: Load
references/best-practices.md for structured output, image generation, multi-model consensus, and production patterns.
AI Gateway Integration
Enable caching, logging, and cost tracking with AI Gateway:
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', { prompt: 'Hello' }, { gateway: { id: 'my-gateway', skipCache: false }, });
Benefits: Cost tracking, response caching (50-90% savings on repeated queries), request logging, rate limiting, analytics.
Rate Limits & Pricing
Information last verified: 2025-01-14
Rate limits and pricing vary significantly by model. Always check the official documentation for the most current information:
- Rate Limits: https://developers.cloudflare.com/workers-ai/platform/limits/
- Pricing: https://developers.cloudflare.com/workers-ai/platform/pricing/
Free Tier: 10,000 neurons/day Paid Tier: $0.011 per 1,000 neurons
📄 Per-model details: See
references/models-catalog.md for specific rate limits and pricing for each model.
Production Checklist
Essential before deploying:
- Enable AI Gateway for cost tracking
- Implement streaming for text generation
- Add rate limit retry with exponential backoff
- Validate input length (prevent token limit errors)
- Add input sanitization (prevent prompt injection)
📄 Full checklist: Load
references/best-practices.md for complete production checklist, error handling patterns, monitoring, and cost optimization.
External SDK Integrations
Workers AI supports OpenAI SDK compatibility and Vercel AI SDK:
// OpenAI SDK - use same patterns with Workers AI models const openai = new OpenAI({ apiKey: env.CLOUDFLARE_API_KEY, baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.CLOUDFLARE_ACCOUNT_ID}/ai/v1`, }); // Vercel AI SDK - native integration import { createWorkersAI } from 'workers-ai-provider'; const workersai = createWorkersAI({ binding: env.AI });
📄 Full integration guide: Load
references/integrations.md for OpenAI SDK, Vercel AI SDK, and REST API examples.
Limits Summary
| Feature | Limit |
|---|---|
| Concurrent requests | No hard limit (rate limits apply) |
| Max input tokens | Varies by model (typically 2K-128K) |
| Max output tokens | Varies by model (typically 512-2048) |
| Streaming chunk size | ~1 KB |
| Image size (output) | ~5 MB |
| Request timeout | Workers timeout applies (30s default, 5m max CPU) |
| Daily free neurons | 10,000 |
| Rate limits | See "Rate Limits & Pricing" section |
When to Load References
| Reference File | Load When... |
|---|---|
| Choosing a model, checking rate limits, comparing model capabilities |
| Production deployment, error handling, cost optimization, security |
| Using OpenAI SDK, Vercel AI SDK, or REST API instead of native binding |