Trending-skills freellmapi-proxy
OpenAI-compatible proxy aggregating 14 free-tier LLM providers with automatic failover and per-key rate tracking.
git clone https://github.com/Aradotso/trending-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aradotso/trending-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/freellmapi-proxy" ~/.claude/skills/aradotso-trending-skills-freellmapi-proxy && rm -rf "$T"
skills/freellmapi-proxy/SKILL.mdFreeLLMAPI Proxy
Skill by ara.so — Daily 2026 Skills collection.
FreeLLMAPI is a self-hosted OpenAI-compatible proxy that aggregates free-tier API keys from ~14 AI providers (Google, Groq, Cerebras, SambaNova, NVIDIA, Mistral, OpenRouter, GitHub Models, Hugging Face, Cohere, Cloudflare, Zhipu, Moonshot, MiniMax) behind a single
/v1/chat/completions endpoint. It handles automatic failover on 429/5xx, per-key rate tracking, sticky sessions for multi-turn conversations, and AES-256-GCM encrypted key storage.
Installation
Prerequisites: Node.js 20+, npm.
git clone https://github.com/tashfeenahmed/freellmapi.git cd freellmapi npm install # Generate encryption key and set up environment cp .env.example .env echo "ENCRYPTION_KEY=$(node -e "console.log(require('crypto').randomBytes(32).toString('hex'))")" >> .env # Development (server + Vite dashboard on :5173) npm run dev # Production build npm run build node server/dist/index.js # serves API + dashboard on :3001
Environment Variables
# .env ENCRYPTION_KEY=<64-char hex string> # Required — AES-256 key for provider key storage PORT=3001 # Optional — defaults to 3001 NODE_ENV=production # Optional
Never commit
.env. The ENCRYPTION_KEY protects all stored provider API keys.
Key Commands
npm run dev # Start Express server + Vite dashboard in watch mode npm run build # Compile TypeScript server + build React dashboard npm run lint # ESLint across server/ and client/ npm run test # Run test suite
Provider Setup
- Open the dashboard at
(dev) orhttp://localhost:5173
(prod).http://localhost:3001 - Navigate to Keys page.
- Add raw API keys for each provider you have. Keys are encrypted before SQLite storage.
- Navigate to Fallback Chain to reorder provider priority.
- Copy your unified
bearer token from the Keys page header.freellmapi-…
Supported providers and what to put in:
| Provider | Where to get a free key |
|---|---|
| Google Gemini | https://ai.google.dev |
| Groq | https://groq.com |
| Cerebras | https://cerebras.ai |
| SambaNova | https://cloud.sambanova.ai |
| NVIDIA NIM | https://build.nvidia.com |
| Mistral | https://mistral.ai |
| OpenRouter | https://openrouter.ai |
| GitHub Models | https://github.com/marketplace/models |
| Hugging Face | https://huggingface.co |
| Cohere | https://cohere.com |
| Cloudflare Workers AI | https://developers.cloudflare.com/workers-ai |
| Zhipu | https://bigmodel.cn |
| Moonshot | https://platform.moonshot.cn |
| MiniMax | https://platform.minimax.io |
Using the API
Python (openai SDK)
from openai import OpenAI client = OpenAI( base_url="http://localhost:3001/v1", api_key="freellmapi-your-unified-key", # from dashboard Keys page ) # Let the router pick the best available provider response = client.chat.completions.create( model="auto", messages=[{"role": "user", "content": "Explain async/await in Python in two sentences."}], ) print(response.choices[0].message.content) # Which provider actually served this request: print("Routed via:", response.headers.get("x-routed-via"))
Request a specific model
# Request a specific model — router finds a provider that has it response = client.chat.completions.create( model="gemini-2.5-flash", messages=[{"role": "user", "content": "Write a haiku about SQLite."}], )
Streaming
stream = client.chat.completions.create( model="auto", messages=[{"role": "user", "content": "List 5 TypeScript best practices."}], stream=True, ) for chunk in stream: delta = chunk.choices[0].delta.content if delta: print(delta, end="", flush=True) print()
curl
# Non-streaming curl http://localhost:3001/v1/chat/completions \ -H "Authorization: Bearer $FREELLMAPI_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [{"role": "user", "content": "Hello"}] }' # Streaming curl http://localhost:3001/v1/chat/completions \ -H "Authorization: Bearer $FREELLMAPI_KEY" \ -H "Content-Type: application/json" \ --no-buffer \ -d '{ "model": "auto", "messages": [{"role": "user", "content": "Count to 5 slowly"}], "stream": true }' # List available models curl http://localhost:3001/v1/models \ -H "Authorization: Bearer $FREELLMAPI_KEY"
TypeScript / Node.js
import OpenAI from "openai"; const client = new OpenAI({ baseURL: "http://localhost:3001/v1", apiKey: process.env.FREELLMAPI_KEY, }); async function chat(userMessage: string): Promise<string> { const response = await client.chat.completions.create({ model: "auto", messages: [{ role: "user", content: userMessage }], }); return response.choices[0].message.content ?? ""; } // Streaming version async function streamChat(userMessage: string): Promise<void> { const stream = await client.chat.completions.create({ model: "auto", messages: [{ role: "user", content: userMessage }], stream: true, }); for await (const chunk of stream) { const delta = chunk.choices[0]?.delta?.content; if (delta) process.stdout.write(delta); } console.log(); }
Tool Calling
Tool calling works across all supported providers. OpenAI-compatible providers receive requests verbatim; Gemini requests are automatically translated to
functionDeclarations/functionResponse format and back.
from openai import OpenAI client = OpenAI( base_url="http://localhost:3001/v1", api_key="freellmapi-your-unified-key", ) tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a city.", "parameters": { "type": "object", "properties": { "city": {"type": "string", "description": "City name"}, }, "required": ["city"], }, }, } ] # Step 1: Model requests a tool call first = client.chat.completions.create( model="auto", messages=[{"role": "user", "content": "What's the weather in Karachi?"}], tools=tools, tool_choice="required", ) call = first.choices[0].message.tool_calls[0] print(f"Tool requested: {call.function.name}({call.function.arguments})") # Step 2: Execute the tool locally, feed result back final = client.chat.completions.create( model="auto", messages=[ {"role": "user", "content": "What's the weather in Karachi?"}, first.choices[0].message, # assistant message with tool_calls { "role": "tool", "tool_call_id": call.id, "content": '{"temp_c": 32, "condition": "sunny"}', }, ], tools=tools, ) print(final.choices[0].message.content)
Streaming tool calls
stream = client.chat.completions.create( model="auto", messages=[{"role": "user", "content": "What's the weather in Karachi?"}], tools=tools, tool_choice="required", stream=True, ) tool_call_chunks = [] for chunk in stream: delta = chunk.choices[0].delta if delta.tool_calls: tool_call_chunks.extend(delta.tool_calls) if chunk.choices[0].finish_reason == "tool_calls": print("Tool call complete — assemble chunks and execute")
Multi-turn Conversations (Sticky Sessions)
The proxy keeps multi-turn conversations on the same model for 30 minutes to avoid hallucination spikes from mid-conversation model switches. Pass a consistent
session_id in requests if the provider supports it, or rely on the proxy's automatic session tracking.
messages = [{"role": "system", "content": "You are a helpful coding assistant."}] # Turn 1 messages.append({"role": "user", "content": "Write a Python function to flatten a nested list."}) resp1 = client.chat.completions.create(model="auto", messages=messages) assistant_msg = resp1.choices[0].message messages.append({"role": "assistant", "content": assistant_msg.content}) print(assistant_msg.content) # Turn 2 — sticky session keeps same provider messages.append({"role": "user", "content": "Now add type hints to that function."}) resp2 = client.chat.completions.create(model="auto", messages=messages) print(resp2.choices[0].message.content)
LangChain Integration
from langchain_openai import ChatOpenAI from langchain_core.messages import HumanMessage import os llm = ChatOpenAI( model="auto", openai_api_base="http://localhost:3001/v1", openai_api_key=os.environ["FREELLMAPI_KEY"], streaming=True, ) response = llm.invoke([HumanMessage(content="Summarise the CAP theorem in one paragraph.")]) print(response.content)
Response Headers
Every response includes diagnostic headers:
| Header | Description |
|---|---|
| — which provider served the request |
| Number of providers tried before success (only present if > 0) |
response = client.chat.completions.create( model="auto", messages=[{"role": "user", "content": "hi"}], ) # Headers are on the raw httpx response: raw = response._response # openai SDK exposes underlying httpx response print(raw.headers.get("x-routed-via")) # e.g. "groq/llama-4-scout" print(raw.headers.get("x-fallback-attempts")) # e.g. "2"
How the Router Works
Request arrives │ ▼ Router scans fallback chain (priority order) │ ├─ For each model: is there a healthy key under all rate caps? │ RPM / RPD / TPM / TPD tracked per (platform, model, key) │ ├─ Picks first viable (platform, model, key) tuple │ ├─ Decrypts key in-memory, calls provider SDK │ └─ On 429 / 5xx / timeout: Put key on cooldown → retry next model (up to 20 attempts)
Rate limit tracking: The router tracks
RPM, RPD, TPM, and TPD counters per (platform, model, key) triple. When a key hits a cap it's cooled down automatically and the next viable key/model is tried.
Health checks: Background probes classify each key as
healthy, rate_limited, invalid, or error. The router skips non-healthy keys without making a live request.
Dashboard Pages
| Page | Purpose |
|---|---|
| Keys | Add/remove provider credentials, view health status, copy unified API key |
| Fallback Chain | Drag to reorder provider priority |
| Playground | Interactive chat showing which provider served each message + latency |
| Analytics | Request volume, success rate, token counts, latency, per-provider breakdown (24h/7d/30d) |
Production Deployment (Raspberry Pi / Linux)
# Build npm run build # Install PM2 npm install -g pm2 # Start pm2 start server/dist/index.js --name freellmapi pm2 save pm2 startup # nginx reverse proxy (optional) # /etc/nginx/sites-available/freellmapi server { listen 80; server_name your.domain.com; location / { proxy_pass http://localhost:3001; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection 'upgrade'; proxy_set_header Host $host; proxy_buffering off; # Required for SSE streaming proxy_cache_control no-cache; # Required for SSE streaming } }
Memory footprint: ~40 MB RSS at idle on a Pi 4.
Adding a New Provider
Create a new adapter in
server/src/providers/:
// server/src/providers/myprovider.ts import type { ProviderAdapter, ChatRequest, ChatResponse } from "../types"; export const myProviderAdapter: ProviderAdapter = { name: "myprovider", models: ["my-model-v1", "my-model-v2"], async chat(request: ChatRequest, apiKey: string): Promise<ChatResponse> { // Call provider API, return OpenAI-shaped response const res = await fetch("https://api.myprovider.com/v1/chat", { method: "POST", headers: { Authorization: `Bearer ${apiKey}`, "Content-Type": "application/json", }, body: JSON.stringify({ model: request.model, messages: request.messages, }), }); const data = await res.json(); return { id: data.id, object: "chat.completion", choices: [{ message: data.choices[0].message, finish_reason: "stop", index: 0 }], usage: data.usage, }; }, async *stream(request: ChatRequest, apiKey: string): AsyncGenerator<string> { // Yield SSE chunks }, };
Register in
server/src/providers/index.ts and add rate limit caps to the router config.
Troubleshooting
"No healthy keys available"
- Check the Keys dashboard — all keys may be rate-limited or invalid.
- Wait for cooldown (usually a few minutes for RPM limits) or add more keys.
- Verify the key is valid by testing it directly against the provider's API.
Requests always fall back to the same provider
- Check the Fallback Chain order in the dashboard.
- Ensure keys for higher-priority providers are marked
.healthy
Streaming stops mid-response
- If behind nginx, ensure
is set.proxy_buffering off - Check provider-side token/minute caps — the stream may be cut by a mid-stream rate limit.
error on startupENCRYPTION_KEY
- Ensure
inENCRYPTION_KEY
is exactly 64 hex characters (32 bytes)..env - Regenerate:
node -e "console.log(require('crypto').randomBytes(32).toString('hex'))"
Tool calls not working with a specific provider
- Not all free-tier models support function calling. Check the provider's docs.
- Try
— the router will pick a tool-capable model.model="auto" - Gemini tool calls are auto-translated; others pass through as-is.
High latency on first request
- Health checks run periodically in the background. The first request after startup may probe a few keys. Subsequent requests are faster.
Limitations
- Text-only — no vision/multimodal inputs
- No embeddings (
)/v1/embeddings - No image generation (
)/v1/images/* - No audio/speech (
)/v1/audio/* - No legacy completions (
)/v1/completions - No moderation (
)/v1/moderations
not supported (single completion per request)n > 1- Single-user by design — no per-user billing or multi-tenant auth
- Personal/experimental use only — review each provider's ToS before production use