Trending-skills freellmapi-proxy

OpenAI-compatible proxy aggregating 14 free-tier LLM providers with automatic failover and per-key rate tracking.

install
source · Clone the upstream repo
git clone https://github.com/Aradotso/trending-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aradotso/trending-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/freellmapi-proxy" ~/.claude/skills/aradotso-trending-skills-freellmapi-proxy && rm -rf "$T"
manifest: skills/freellmapi-proxy/SKILL.md
source content

FreeLLMAPI Proxy

Skill by ara.so — Daily 2026 Skills collection.

FreeLLMAPI is a self-hosted OpenAI-compatible proxy that aggregates free-tier API keys from ~14 AI providers (Google, Groq, Cerebras, SambaNova, NVIDIA, Mistral, OpenRouter, GitHub Models, Hugging Face, Cohere, Cloudflare, Zhipu, Moonshot, MiniMax) behind a single

/v1/chat/completions
endpoint. It handles automatic failover on 429/5xx, per-key rate tracking, sticky sessions for multi-turn conversations, and AES-256-GCM encrypted key storage.


Installation

Prerequisites: Node.js 20+, npm.

git clone https://github.com/tashfeenahmed/freellmapi.git
cd freellmapi
npm install

# Generate encryption key and set up environment
cp .env.example .env
echo "ENCRYPTION_KEY=$(node -e "console.log(require('crypto').randomBytes(32).toString('hex'))")" >> .env

# Development (server + Vite dashboard on :5173)
npm run dev

# Production build
npm run build
node server/dist/index.js   # serves API + dashboard on :3001

Environment Variables

# .env
ENCRYPTION_KEY=<64-char hex string>   # Required — AES-256 key for provider key storage
PORT=3001                              # Optional — defaults to 3001
NODE_ENV=production                    # Optional

Never commit

.env
. The
ENCRYPTION_KEY
protects all stored provider API keys.


Key Commands

npm run dev        # Start Express server + Vite dashboard in watch mode
npm run build      # Compile TypeScript server + build React dashboard
npm run lint       # ESLint across server/ and client/
npm run test       # Run test suite

Provider Setup

  1. Open the dashboard at
    http://localhost:5173
    (dev) or
    http://localhost:3001
    (prod).
  2. Navigate to Keys page.
  3. Add raw API keys for each provider you have. Keys are encrypted before SQLite storage.
  4. Navigate to Fallback Chain to reorder provider priority.
  5. Copy your unified
    freellmapi-…
    bearer token from the Keys page header.

Supported providers and what to put in:

ProviderWhere to get a free key
Google Geminihttps://ai.google.dev
Groqhttps://groq.com
Cerebrashttps://cerebras.ai
SambaNovahttps://cloud.sambanova.ai
NVIDIA NIMhttps://build.nvidia.com
Mistralhttps://mistral.ai
OpenRouterhttps://openrouter.ai
GitHub Modelshttps://github.com/marketplace/models
Hugging Facehttps://huggingface.co
Coherehttps://cohere.com
Cloudflare Workers AIhttps://developers.cloudflare.com/workers-ai
Zhipuhttps://bigmodel.cn
Moonshothttps://platform.moonshot.cn
MiniMaxhttps://platform.minimax.io

Using the API

Python (openai SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3001/v1",
    api_key="freellmapi-your-unified-key",  # from dashboard Keys page
)

# Let the router pick the best available provider
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Explain async/await in Python in two sentences."}],
)

print(response.choices[0].message.content)
# Which provider actually served this request:
print("Routed via:", response.headers.get("x-routed-via"))

Request a specific model

# Request a specific model — router finds a provider that has it
response = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=[{"role": "user", "content": "Write a haiku about SQLite."}],
)

Streaming

stream = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "List 5 TypeScript best practices."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()

curl

# Non-streaming
curl http://localhost:3001/v1/chat/completions \
  -H "Authorization: Bearer $FREELLMAPI_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

# Streaming
curl http://localhost:3001/v1/chat/completions \
  -H "Authorization: Bearer $FREELLMAPI_KEY" \
  -H "Content-Type: application/json" \
  --no-buffer \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Count to 5 slowly"}],
    "stream": true
  }'

# List available models
curl http://localhost:3001/v1/models \
  -H "Authorization: Bearer $FREELLMAPI_KEY"

TypeScript / Node.js

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:3001/v1",
  apiKey: process.env.FREELLMAPI_KEY,
});

async function chat(userMessage: string): Promise<string> {
  const response = await client.chat.completions.create({
    model: "auto",
    messages: [{ role: "user", content: userMessage }],
  });
  return response.choices[0].message.content ?? "";
}

// Streaming version
async function streamChat(userMessage: string): Promise<void> {
  const stream = await client.chat.completions.create({
    model: "auto",
    messages: [{ role: "user", content: userMessage }],
    stream: true,
  });

  for await (const chunk of stream) {
    const delta = chunk.choices[0]?.delta?.content;
    if (delta) process.stdout.write(delta);
  }
  console.log();
}

Tool Calling

Tool calling works across all supported providers. OpenAI-compatible providers receive requests verbatim; Gemini requests are automatically translated to

functionDeclarations
/
functionResponse
format and back.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3001/v1",
    api_key="freellmapi-your-unified-key",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                },
                "required": ["city"],
            },
        },
    }
]

# Step 1: Model requests a tool call
first = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "What's the weather in Karachi?"}],
    tools=tools,
    tool_choice="required",
)

call = first.choices[0].message.tool_calls[0]
print(f"Tool requested: {call.function.name}({call.function.arguments})")

# Step 2: Execute the tool locally, feed result back
final = client.chat.completions.create(
    model="auto",
    messages=[
        {"role": "user", "content": "What's the weather in Karachi?"},
        first.choices[0].message,  # assistant message with tool_calls
        {
            "role": "tool",
            "tool_call_id": call.id,
            "content": '{"temp_c": 32, "condition": "sunny"}',
        },
    ],
    tools=tools,
)

print(final.choices[0].message.content)

Streaming tool calls

stream = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "What's the weather in Karachi?"}],
    tools=tools,
    tool_choice="required",
    stream=True,
)

tool_call_chunks = []
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.tool_calls:
        tool_call_chunks.extend(delta.tool_calls)
    if chunk.choices[0].finish_reason == "tool_calls":
        print("Tool call complete — assemble chunks and execute")

Multi-turn Conversations (Sticky Sessions)

The proxy keeps multi-turn conversations on the same model for 30 minutes to avoid hallucination spikes from mid-conversation model switches. Pass a consistent

session_id
in requests if the provider supports it, or rely on the proxy's automatic session tracking.

messages = [{"role": "system", "content": "You are a helpful coding assistant."}]

# Turn 1
messages.append({"role": "user", "content": "Write a Python function to flatten a nested list."})
resp1 = client.chat.completions.create(model="auto", messages=messages)
assistant_msg = resp1.choices[0].message
messages.append({"role": "assistant", "content": assistant_msg.content})
print(assistant_msg.content)

# Turn 2 — sticky session keeps same provider
messages.append({"role": "user", "content": "Now add type hints to that function."})
resp2 = client.chat.completions.create(model="auto", messages=messages)
print(resp2.choices[0].message.content)

LangChain Integration

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
import os

llm = ChatOpenAI(
    model="auto",
    openai_api_base="http://localhost:3001/v1",
    openai_api_key=os.environ["FREELLMAPI_KEY"],
    streaming=True,
)

response = llm.invoke([HumanMessage(content="Summarise the CAP theorem in one paragraph.")])
print(response.content)

Response Headers

Every response includes diagnostic headers:

HeaderDescription
X-Routed-Via
<platform>/<model>
— which provider served the request
X-Fallback-Attempts
Number of providers tried before success (only present if > 0)
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "hi"}],
)
# Headers are on the raw httpx response:
raw = response._response  # openai SDK exposes underlying httpx response
print(raw.headers.get("x-routed-via"))        # e.g. "groq/llama-4-scout"
print(raw.headers.get("x-fallback-attempts")) # e.g. "2"

How the Router Works

Request arrives
      │
      ▼
Router scans fallback chain (priority order)
      │
      ├─ For each model: is there a healthy key under all rate caps?
      │     RPM / RPD / TPM / TPD tracked per (platform, model, key)
      │
      ├─ Picks first viable (platform, model, key) tuple
      │
      ├─ Decrypts key in-memory, calls provider SDK
      │
      └─ On 429 / 5xx / timeout:
            Put key on cooldown → retry next model (up to 20 attempts)

Rate limit tracking: The router tracks

RPM
,
RPD
,
TPM
, and
TPD
counters per
(platform, model, key)
triple. When a key hits a cap it's cooled down automatically and the next viable key/model is tried.

Health checks: Background probes classify each key as

healthy
,
rate_limited
,
invalid
, or
error
. The router skips non-healthy keys without making a live request.


Dashboard Pages

PagePurpose
KeysAdd/remove provider credentials, view health status, copy unified API key
Fallback ChainDrag to reorder provider priority
PlaygroundInteractive chat showing which provider served each message + latency
AnalyticsRequest volume, success rate, token counts, latency, per-provider breakdown (24h/7d/30d)

Production Deployment (Raspberry Pi / Linux)

# Build
npm run build

# Install PM2
npm install -g pm2

# Start
pm2 start server/dist/index.js --name freellmapi
pm2 save
pm2 startup

# nginx reverse proxy (optional)
# /etc/nginx/sites-available/freellmapi
server {
    listen 80;
    server_name your.domain.com;
    location / {
        proxy_pass http://localhost:3001;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_buffering off;          # Required for SSE streaming
        proxy_cache_control no-cache; # Required for SSE streaming
    }
}

Memory footprint: ~40 MB RSS at idle on a Pi 4.


Adding a New Provider

Create a new adapter in

server/src/providers/
:

// server/src/providers/myprovider.ts
import type { ProviderAdapter, ChatRequest, ChatResponse } from "../types";

export const myProviderAdapter: ProviderAdapter = {
  name: "myprovider",
  models: ["my-model-v1", "my-model-v2"],

  async chat(request: ChatRequest, apiKey: string): Promise<ChatResponse> {
    // Call provider API, return OpenAI-shaped response
    const res = await fetch("https://api.myprovider.com/v1/chat", {
      method: "POST",
      headers: {
        Authorization: `Bearer ${apiKey}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        model: request.model,
        messages: request.messages,
      }),
    });
    const data = await res.json();
    return {
      id: data.id,
      object: "chat.completion",
      choices: [{ message: data.choices[0].message, finish_reason: "stop", index: 0 }],
      usage: data.usage,
    };
  },

  async *stream(request: ChatRequest, apiKey: string): AsyncGenerator<string> {
    // Yield SSE chunks
  },
};

Register in

server/src/providers/index.ts
and add rate limit caps to the router config.


Troubleshooting

"No healthy keys available"

  • Check the Keys dashboard — all keys may be rate-limited or invalid.
  • Wait for cooldown (usually a few minutes for RPM limits) or add more keys.
  • Verify the key is valid by testing it directly against the provider's API.

Requests always fall back to the same provider

  • Check the Fallback Chain order in the dashboard.
  • Ensure keys for higher-priority providers are marked
    healthy
    .

Streaming stops mid-response

  • If behind nginx, ensure
    proxy_buffering off
    is set.
  • Check provider-side token/minute caps — the stream may be cut by a mid-stream rate limit.

ENCRYPTION_KEY
error on startup

  • Ensure
    ENCRYPTION_KEY
    in
    .env
    is exactly 64 hex characters (32 bytes).
  • Regenerate:
    node -e "console.log(require('crypto').randomBytes(32).toString('hex'))"

Tool calls not working with a specific provider

  • Not all free-tier models support function calling. Check the provider's docs.
  • Try
    model="auto"
    — the router will pick a tool-capable model.
  • Gemini tool calls are auto-translated; others pass through as-is.

High latency on first request

  • Health checks run periodically in the background. The first request after startup may probe a few keys. Subsequent requests are faster.

Limitations

  • Text-only — no vision/multimodal inputs
  • No embeddings (
    /v1/embeddings
    )
  • No image generation (
    /v1/images/*
    )
  • No audio/speech (
    /v1/audio/*
    )
  • No legacy completions (
    /v1/completions
    )
  • No moderation (
    /v1/moderations
    )
  • n > 1
    not supported (single completion per request)
  • Single-user by design — no per-user billing or multi-tenant auth
  • Personal/experimental use only — review each provider's ToS before production use