Claude-skill-registry llm-gateway-routing
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/llm-gateway-routing" ~/.claude/skills/majiayu000-claude-skill-registry-llm-gateway-routing && rm -rf "$T"
manifest:
skills/data/llm-gateway-routing/SKILL.mdsource content
LLM Gateway & Routing
Configure multi-model access, fallbacks, cost optimization, and A/B testing.
Why Use a Gateway?
Without gateway:
- Vendor lock-in (one provider)
- No fallbacks (provider down = app down)
- Hard to A/B test models
- Scattered API keys and configs
With gateway:
- Single API for 400+ models
- Automatic fallbacks
- Easy model switching
- Unified cost tracking
Quick Decision
| Need | Solution |
|---|---|
| Fastest setup, multi-model | OpenRouter |
| Full control, self-hosted | LiteLLM |
| Observability + routing | Helicone |
| Enterprise, guardrails | Portkey |
OpenRouter (Recommended)
Why OpenRouter
- 400+ models: OpenAI, Anthropic, Google, Meta, Mistral, and more
- Single API: One key for all providers
- Automatic fallbacks: Built-in reliability
- A/B testing: Easy model comparison
- Cost tracking: Unified billing dashboard
- Free credits: $1 free to start
Setup
# 1. Sign up at openrouter.ai # 2. Get API key from dashboard # 3. Add to .env: OPENROUTER_API_KEY=sk-or-v1-...
Basic Usage
// Using fetch const response = await fetch('https://openrouter.ai/api/v1/chat/completions', { method: 'POST', headers: { 'Authorization': `Bearer ${process.env.OPENROUTER_API_KEY}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ model: 'anthropic/claude-3-5-sonnet', messages: [{ role: 'user', content: 'Hello!' }], }), });
With Vercel AI SDK (Recommended)
import { createOpenAI } from "@ai-sdk/openai"; import { generateText } from "ai"; const openrouter = createOpenAI({ baseURL: "https://openrouter.ai/api/v1", apiKey: process.env.OPENROUTER_API_KEY, }); const { text } = await generateText({ model: openrouter("anthropic/claude-3-5-sonnet"), prompt: "Explain quantum computing", });
Model IDs
// Format: provider/model-name const models = { // Anthropic claude35Sonnet: "anthropic/claude-3-5-sonnet", claudeHaiku: "anthropic/claude-3-5-haiku", // OpenAI gpt4o: "openai/gpt-4o", gpt4oMini: "openai/gpt-4o-mini", // Google geminiPro: "google/gemini-pro-1.5", geminiFlash: "google/gemini-flash-1.5", // Meta llama3: "meta-llama/llama-3.1-70b-instruct", // Auto (OpenRouter picks best) auto: "openrouter/auto", };
Fallback Chains
// Define fallback order const modelChain = [ "anthropic/claude-3-5-sonnet", // Primary "openai/gpt-4o", // Fallback 1 "google/gemini-pro-1.5", // Fallback 2 ]; async function callWithFallback(messages: Message[]) { for (const model of modelChain) { try { return await openrouter.chat({ model, messages }); } catch (error) { console.log(`${model} failed, trying next...`); } } throw new Error("All models failed"); }
Cost Routing
// Route based on query complexity function selectModel(query: string): string { const complexity = analyzeComplexity(query); if (complexity === "simple") { // Simple queries → cheap model return "openai/gpt-4o-mini"; // ~$0.15/1M tokens } else if (complexity === "medium") { // Medium → balanced return "google/gemini-flash-1.5"; // ~$0.075/1M tokens } else { // Complex → best quality return "anthropic/claude-3-5-sonnet"; // ~$3/1M tokens } } function analyzeComplexity(query: string): "simple" | "medium" | "complex" { // Simple heuristics if (query.length < 50) return "simple"; if (query.includes("explain") || query.includes("analyze")) return "complex"; return "medium"; }
A/B Testing
// Random assignment function getModel(userId: string): string { const hash = userId.charCodeAt(0) % 100; if (hash < 50) { return "anthropic/claude-3-5-sonnet"; // 50% } else { return "openai/gpt-4o"; // 50% } } // Track which model was used const model = getModel(userId); const response = await openrouter.chat({ model, messages }); await analytics.track("llm_call", { model, userId, latency, cost });
LiteLLM (Self-Hosted)
Why LiteLLM
- Self-hosted: Full control over data
- 100+ providers: Same coverage as OpenRouter
- Load balancing: Distribute across providers
- Cost tracking: Built-in spend management
- Caching: Redis or in-memory
- Rate limiting: Per-user limits
Setup
# Install pip install litellm[proxy] # Run proxy litellm --config config.yaml # Use as OpenAI-compatible endpoint export OPENAI_API_BASE=http://localhost:4000
Configuration
# config.yaml model_list: # Claude models - model_name: claude-sonnet litellm_params: model: anthropic/claude-3-5-sonnet-latest api_key: sk-ant-... # OpenAI models - model_name: gpt-4o litellm_params: model: openai/gpt-4o api_key: sk-... # Load balanced (multiple providers) - model_name: balanced litellm_params: model: anthropic/claude-3-5-sonnet-latest litellm_params: model: openai/gpt-4o # Requests distributed across both # General settings general_settings: master_key: sk-master-... database_url: postgresql://... # Routing router_settings: routing_strategy: simple-shuffle # or latency-based-routing num_retries: 3 timeout: 30 # Rate limiting litellm_settings: max_budget: 100 # $100/month budget_duration: monthly
Fallbacks in LiteLLM
model_list: - model_name: primary litellm_params: model: anthropic/claude-3-5-sonnet-latest fallbacks: - model_name: fallback-1 litellm_params: model: openai/gpt-4o - model_name: fallback-2 litellm_params: model: google/gemini-pro
Usage
// Use like OpenAI SDK import OpenAI from "openai"; const client = new OpenAI({ baseURL: "http://localhost:4000", apiKey: "sk-master-...", }); const response = await client.chat.completions.create({ model: "claude-sonnet", // Maps to configured model messages: [{ role: "user", content: "Hello!" }], });
Routing Strategies
1. Cost-Based Routing
const costTiers = { cheap: ["openai/gpt-4o-mini", "google/gemini-flash-1.5"], balanced: ["anthropic/claude-3-5-haiku", "openai/gpt-4o"], premium: ["anthropic/claude-3-5-sonnet", "openai/o1-preview"], }; function routeByCost(budget: "cheap" | "balanced" | "premium"): string { const models = costTiers[budget]; return models[Math.floor(Math.random() * models.length)]; }
2. Latency-Based Routing
// Track latency per model const latencyStats: Record<string, number[]> = {}; function routeByLatency(): string { const avgLatencies = Object.entries(latencyStats) .map(([model, times]) => ({ model, avg: times.reduce((a, b) => a + b, 0) / times.length, })) .sort((a, b) => a.avg - b.avg); return avgLatencies[0].model; } // Update after each call function recordLatency(model: string, latencyMs: number) { if (!latencyStats[model]) latencyStats[model] = []; latencyStats[model].push(latencyMs); // Keep last 100 samples if (latencyStats[model].length > 100) { latencyStats[model].shift(); } }
3. Task-Based Routing
const taskModels = { coding: "anthropic/claude-3-5-sonnet", // Best for code reasoning: "openai/o1-preview", // Best for logic creative: "anthropic/claude-3-5-sonnet", // Best for writing simple: "openai/gpt-4o-mini", // Cheap and fast multimodal: "google/gemini-pro-1.5", // Vision + text }; function routeByTask(task: keyof typeof taskModels): string { return taskModels[task]; }
4. Hybrid Routing
interface RoutingConfig { task: string; maxCost: number; maxLatency: number; } function hybridRoute(config: RoutingConfig): string { // Filter by cost const affordable = models.filter(m => m.cost <= config.maxCost); // Filter by latency const fast = affordable.filter(m => m.avgLatency <= config.maxLatency); // Select best for task const taskScores = fast.map(m => ({ model: m.id, score: getTaskScore(m.id, config.task), })); return taskScores.sort((a, b) => b.score - a.score)[0].model; }
Best Practices
1. Always Have Fallbacks
// Bad: Single point of failure const response = await openai.chat({ model: "gpt-4o", messages }); // Good: Fallback chain const models = ["gpt-4o", "claude-3-5-sonnet", "gemini-pro"]; for (const model of models) { try { return await gateway.chat({ model, messages }); } catch (e) { continue; } }
2. Pin Model Versions
// Bad: Model can change const model = "gpt-4"; // Good: Pinned version const model = "openai/gpt-4-0125-preview";
3. Track Costs
// Log every call async function trackedCall(model: string, messages: Message[]) { const start = Date.now(); const response = await gateway.chat({ model, messages }); const latency = Date.now() - start; await analytics.track("llm_call", { model, inputTokens: response.usage.prompt_tokens, outputTokens: response.usage.completion_tokens, cost: calculateCost(model, response.usage), latency, }); return response; }
4. Set Token Limits
// Prevent runaway costs const response = await gateway.chat({ model, messages, max_tokens: 500, // Limit output length });
5. Use Caching
// LiteLLM caching litellm_settings: cache: true cache_params: type: redis host: localhost port: 6379 ttl: 3600 # 1 hour
References
- OpenRouter deep divereferences/openrouter-guide.md
- LiteLLM self-hostingreferences/litellm-guide.md
- Advanced routing patternsreferences/routing-strategies.md
- Helicone, Portkey, etc.references/alternatives.md
Templates
- TypeScript OpenRouter setuptemplates/openrouter-config.ts
- LiteLLM proxy configtemplates/litellm-config.yaml
- Fallback implementationtemplates/fallback-chain.ts