Tfy-gateway-skills truefoundry-ai-gateway
Configures TrueFoundry AI Gateway for unified OpenAI-compatible LLM access. Covers auth (PAT/VAT), model routing, rate limiting, and budget controls.
git clone https://github.com/truefoundry/tfy-gateway-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/truefoundry/tfy-gateway-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/ai-gateway" ~/.claude/skills/truefoundry-tfy-gateway-skills-truefoundry-ai-gateway && rm -rf "$T"
skills/ai-gateway/SKILL.md- makes HTTP requests (curl)
- references API keys
<objective>Routing note: For ambiguous user intents, use the shared clarification templates in references/intent-clarification.md.
AI Gateway
Use TrueFoundry's AI Gateway to access 1000+ LLMs through a unified OpenAI-compatible API with rate limiting, budget controls, load balancing, routing, and observability.
When to Use
Access LLMs through TrueFoundry's unified OpenAI-compatible gateway, configure auth tokens (PAT/VAT), set up rate limiting, budget controls, or load balancing across providers.
When NOT to Use
- User wants to deploy a self-hosted model → deploying self-hosted models requires a TrueFoundry Enterprise account with a connected cluster. See https://truefoundry.com
- User wants to deploy tool servers → deploying workloads requires a TrueFoundry Enterprise account with a connected cluster. See https://truefoundry.com
- User wants to manage TrueFoundry platform credentials → prefer
skill; ask if the user wants another valid pathstatus
Overview
The AI Gateway sits between your application and LLM providers:
Your App → AI Gateway → OpenAI / Anthropic / Azure / Self-hosted vLLM / etc. ↑ Unified API + Auth + Rate Limiting + Routing + Logging
Key benefits:
- Single endpoint for all models (cloud + self-hosted)
- One API key (PAT or VAT) instead of managing per-provider keys
- OpenAI-compatible — works with any OpenAI SDK client
- Rate limiting per user, team, or application
- Budget controls to enforce cost limits
- Load balancing across model instances with fallback
- Observability — request logging, cost tracking, analytics
Gateway Endpoint
The gateway base URL is your TrueFoundry platform URL +
/api/llm:
{TFY_BASE_URL}/api/llm
Example:
https://your-org.truefoundry.cloud/api/llm
Authentication
Personal Access Token (PAT)
For development and individual use:
- Go to TrueFoundry dashboard → Access → Personal Access Tokens
- Click New Personal Access Token
- Copy the token
Virtual Access Token (VAT)
For production applications (recommended):
- Go to TrueFoundry dashboard → Access → Virtual Account Tokens
- Click New Virtual Account (requires admin privileges)
- Name it and select which models it can access
- Copy the token
VATs are recommended for production because:
- Not tied to a specific user (survives team changes)
- Support granular model access control
- Better for tracking per-application usage
Calling Models
Python (OpenAI SDK)
from openai import OpenAI client = OpenAI( api_key="<your-PAT-or-VAT>", base_url="https://<your-truefoundry-url>/api/llm", ) # Chat completion response = client.chat.completions.create( model="openai/gpt-4o", # or any configured model name messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}, ], max_tokens=200, ) print(response.choices[0].message.content)
Python (Streaming)
stream = client.chat.completions.create( model="openai/gpt-4o", messages=[{"role": "user", "content": "Write a haiku about AI"}], stream=True, ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="")
cURL
curl "${TFY_BASE_URL}/api/llm/chat/completions" \ -H "Authorization: Bearer ${TFY_API_KEY}" \ -H "Content-Type: application/json" \ -d '{ "model": "openai/gpt-4o", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 200 }'
JavaScript / Node.js
import OpenAI from "openai"; const client = new OpenAI({ apiKey: "<your-PAT-or-VAT>", baseURL: "https://<your-truefoundry-url>/api/llm", }); const response = await client.chat.completions.create({ model: "openai/gpt-4o", messages: [{ role: "user", content: "Hello!" }], });
Environment Variables
Set these to use with any OpenAI-compatible library:
export OPENAI_BASE_URL="${TFY_BASE_URL}/api/llm" export OPENAI_API_KEY="<your-PAT-or-VAT>"
Then any code using
openai.OpenAI() without explicit parameters will use the gateway automatically.
Supported APIs
| API | Endpoint | Description |
|---|---|---|
| Chat Completions | | Chat with any model (streaming + non-streaming) |
| Completions | | Legacy text completions |
| Embeddings | | Text embeddings (text + list inputs) |
| Image Generation | | Generate images |
| Image Editing | | Edit images |
| Audio Transcription | | Speech-to-text |
| Audio Translation | | Translate audio |
| Text-to-Speech | | Generate speech |
| Reranking | | Rerank documents |
| Batch Processing | | Batch predictions |
| Moderations | | Content safety |
Supported Providers
The gateway supports 25+ providers including:
| Provider | Example Model Names |
|---|---|
| OpenAI | , |
| Anthropic | |
| Google Vertex | |
| AWS Bedrock | |
| Azure OpenAI | |
| Mistral | |
| Groq | |
| Cohere | |
| Together AI | |
| Self-hosted (vLLM/TGI) | |
Model names depend on how they're configured in your gateway. Check the TrueFoundry dashboard → AI Gateway → Models for exact names.
Adding Models & Providers
Currently done through the TrueFoundry dashboard UI:
- Go to AI Gateway → Models
- Click Add Provider Account
- Select provider (OpenAI, Anthropic, etc.)
- Enter API credentials
- Select models to enable
Adding Self-Hosted Models (Cluster-Internal)
After deploying a self-hosted model:
- Go to AI Gateway → Models → Add Provider Account
- Select "Self Hosted" as the provider type
- Enter the internal endpoint:
http://{model-name}.{namespace}.svc.cluster.local:8000 - The model becomes accessible through the gateway alongside cloud models
Security: Only register model endpoints that you control. External or untrusted model endpoints can return manipulated responses. Use internal cluster DNS (
) for self-hosted models. Verify provider API credentials are stored securely in TrueFoundry secrets, not hardcoded.svc.cluster.local
Adding External OpenAI-Compatible APIs (NVIDIA, custom providers)
For externally hosted APIs that are OpenAI-compatible (e.g. NVIDIA Cloud APIs, custom inference endpoints), use
type: provider-account/self-hosted-model with auth_data:
# gateway.yaml — External hosted API (e.g. NVIDIA Cloud) - name: nvidia-external type: provider-account/self-hosted-model integrations: - name: nemotron-nano type: integration/model/self-hosted-model hosted_model_name: nvidia/nemotron-3-nano-30b-a3b url: "https://integrate.api.nvidia.com/v1" model_server: "openai-compatible" model_types: ["chat"] auth_data: type: bearer-auth bearer_token: "tfy-secret://<tenant>:<group>:<key>"
And in a virtual model routing target, reference it as
"<provider-account-name>/<integration-name>":
targets: - model: "nvidia-external/nemotron-nano" # "<provider-account-name>/<integration-name>"
Apply with:
tfy apply -f gateway.yaml
WARNING:
does not exist in the schema — do not use it. Useprovider-account/nvidia-nimwithprovider-account/self-hosted-modelfor all external OpenAI-compatible APIs (as shown above).auth_data
Schema source of truth: For authoritative field names and types, read
in the platform repo. Do not guess field names from documentation alone.servicefoundry-server/src/autogen/models.ts
Applying Gateway Config
Gateway YAML is applied directly with
tfy apply — no service build or Docker image involved:
# Preview changes tfy apply -f gateway.yaml --dry-run --show-diff # Apply tfy apply -f gateway.yaml
Do NOT delegate gateway applies to a deployment skill. Gateway configs (
type: gateway-*, type: provider-account/*) are applied inline with tfy apply.
Test after apply:
# Quick smoke test via curl curl "${TFY_BASE_URL}/api/llm/chat/completions" \ -H "Authorization: Bearer ${TFY_API_KEY}" \ -H "Content-Type: application/json" \ -d '{ "model": "nvidia-external/nemotron-nano", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50 }'
Or via Python:
from openai import OpenAI client = OpenAI(api_key="<PAT-or-VAT>", base_url=f"{TFY_BASE_URL}/api/llm") resp = client.chat.completions.create( model="nvidia-external/nemotron-nano", messages=[{"role": "user", "content": "Hello!"}], ) print(resp.choices[0].message.content)
Note: One-off gateway config applies should use
directly. For CI/CD pipelines, integratetfy applyinto your existing automation.tfy apply
Virtual Models & Load Balancing
Virtual models route requests across multiple model instances using a
gateway-load-balancing-config manifest. Targets reference real catalog models as "<provider-account-name>/<integration-name>".
Weight-Based Routing
name: chat-routing type: gateway-load-balancing-config rules: - id: weighted-chat type: weight-based-routing when: subjects: ["*"] models: ["openai/gpt-4o"] load_balance_targets: - target: "openai-main/gpt-4o" weight: 70 fallback_candidate: true retry_config: delay: 100 attempts: 1 on_status_codes: ["429", "500", "502", "503"] - target: "azure-backup/gpt-4o" weight: 30 fallback_candidate: true retry_config: delay: 100 attempts: 1 on_status_codes: ["429", "500", "502", "503"]
Latency-Based Routing
Automatically routes to the lowest-latency model (measures time per output token over last 20 minutes):
rules: - id: latency-chat type: latency-based-routing when: subjects: ["*"] models: ["openai/gpt-4o"] load_balance_targets: - target: "openai-main/gpt-4o" fallback_candidate: true - target: "azure-backup/gpt-4o" fallback_candidate: true
Priority-Based Routing
Routes to highest-priority healthy model with SLA cutoff (auto-marks unhealthy when TPOT exceeds threshold):
rules: - id: priority-chat type: priority-based-routing when: subjects: ["team:premium"] models: ["*"] load_balance_targets: - target: "openai-main/gpt-4o" priority: 0 sla_cutoff: time_per_output_token_ms: 50 fallback_candidate: true - target: "azure-backup/gpt-4o" priority: 1 fallback_candidate: true
Sticky Sessions
Pin users to the same target for a duration:
rules: - id: sticky-chat type: weight-based-routing sticky_routing: ttl_seconds: 3600 session_identifiers: - key: x-user-id source: headers load_balance_targets: - target: "openai-main/gpt-4o" weight: 50 - target: "azure-backup/gpt-4o" weight: 50
Header Overrides Per Target
load_balance_targets: - target: "openai-main/gpt-4o" weight: 80 headers_override: set: x-region: us-east-1 remove: - x-internal-debug
Fallback Behavior
Fallback is configured per-target inside
load_balance_targets:
: defaults tofallback_status_codes["401", "403", "404", "429", "500", "502", "503"]
marks a target as eligible for failoverfallback_candidate: true
controls which errors trigger retriesretry_config.on_status_codes
Apply
tfy apply -f gateway-load-balancing-config.yaml --dry-run --show-diff tfy apply -f gateway-load-balancing-config.yaml
Note: Targets must be real catalog models, not nested virtual models.
Rate Limiting
Configure rate limits per user, team, model, or custom metadata using a
gateway-rate-limiting-config manifest. Only the first matching rule applies — place specific rules before generic ones.
name: rate-limits type: gateway-rate-limiting-config rules: - id: "team-rpm-limit" when: subjects: ["team:backend"] models: ["openai-main/gpt-4o"] limit_to: 20000 unit: tokens_per_minute - id: "user-daily-limit" when: subjects: ["user:bob@example.com"] models: ["openai-main/gpt-4o"] limit_to: 1000 unit: requests_per_day - id: "per-project-hourly" when: {} limit_to: 50000 unit: tokens_per_hour rate_limit_applies_per: ["metadata.project_id"] - id: "global-fallback" when: {} limit_to: 500 unit: requests_per_minute rate_limit_applies_per: ["user"]
Units:
requests_per_minute, requests_per_hour, requests_per_day, tokens_per_minute, tokens_per_hour, tokens_per_day
: Creates separate limits per entity (max 2 values). Options: rate_limit_applies_per
user, model, virtualaccount, metadata.<key>.
tfy apply -f gateway-rate-limiting-config.yaml
Budget Controls
Enforce cost limits per user, team, or metadata using a
gateway-budget-config manifest. Costs are tracked automatically based on model pricing.
name: budget-controls type: gateway-budget-config rules: - id: "team-monthly-budget" when: subjects: ["team:engineering"] limit_to: 5000 unit: cost_per_month budget_applies_per: ["team"] alerts: thresholds: [75, 90, 100] notification_target: - type: email notification_channel: "budget-alerts" to_emails: ["lead@example.com"] - id: "user-daily-budget" when: {} limit_to: 100 unit: cost_per_day budget_applies_per: ["user"] - id: "project-daily-budget" when: metadata: environment: "production" limit_to: 200 unit: cost_per_day budget_applies_per: ["metadata.project_id"]
Units:
cost_per_day (resets UTC midnight), cost_per_week (resets Monday), cost_per_month (resets 1st)
: Same options as rate limiting — budget_applies_per
user, model, team, virtualaccount, metadata.<key>.
Alerts: Configure threshold percentages with email, Slack webhook, or Slack bot notifications.
tfy apply -f gateway-budget-config.yaml
Observability
Request Logging
All gateway requests are logged with:
- Input/output tokens
- Latency (TTFT, total)
- Cost
- Model and provider
- User identity
- Custom metadata
Custom Metadata
Tag requests with custom metadata for tracking:
response = client.chat.completions.create( model="openai/gpt-4o", messages=[{"role": "user", "content": "Hello"}], extra_headers={ "X-TFY-LOGGING-CONFIG": '{"project": "my-app", "environment": "production"}' }, )
Analytics
View usage analytics in TrueFoundry dashboard:
- Requests/minute per model
- Tokens/minute per model
- Failures/minute per model
- Cost breakdown by model, user, team
OpenTelemetry Integration
Export traces to your observability stack:
- Prometheus + Grafana
- Datadog
- Custom OTEL collectors
Guardrails
For content filtering, PII detection, prompt injection prevention, and custom safety rules, use the
guardrails skill. It configures guardrail providers and rules that apply to this gateway's traffic.
MCP Gateway Attachment Flow
If a user has already deployed a tool server and wants to attach it to MCP gateway:
- Verify deployment status and endpoint URL via the TrueFoundry dashboard
- Register the endpoint as an MCP server (
skill)mcp-servers - Confirm registration ID/name and share how to reference it in policies
Framework Integration
The gateway works with popular AI frameworks:
LangChain
from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="openai/gpt-4o", api_key="<your-PAT-or-VAT>", base_url="https://<your-truefoundry-url>/api/llm", )
LlamaIndex
from llama_index.llms.openai import OpenAI llm = OpenAI( model="openai/gpt-4o", api_key="<your-PAT-or-VAT>", api_base="https://<your-truefoundry-url>/api/llm", )
Cursor / Claude Code / Cline
Configure the gateway as a custom API endpoint in your coding assistant settings:
- Base URL:
{TFY_BASE_URL}/api/llm - API Key: Your PAT or VAT
Presenting Gateway Info
When the user asks about gateway configuration:
</instructions>AI Gateway: Endpoint: https://your-org.truefoundry.cloud/api/llm Auth: Personal Access Token (PAT) or Virtual Access Token (VAT) Available Models (check dashboard for current list): | Model Name | Provider | Type | |-------------------|-------------|-------------| | openai/gpt-4o | OpenAI | Cloud | | my-gemma-2b | Self-hosted | vLLM (T4) | | anthropic/claude | Anthropic | Cloud | Usage: export OPENAI_BASE_URL="https://your-org.truefoundry.cloud/api/llm" export OPENAI_API_KEY="your-token" # Then use any OpenAI-compatible SDK
<success_criteria>
Success Criteria
- The user can call LLMs through the gateway endpoint using an OpenAI-compatible SDK or cURL
- The user has a valid authentication token (PAT or VAT) configured for gateway access
- The agent has confirmed the target model name is available in the user's gateway configuration
- The user can verify successful responses from the gateway with correct model output
- The agent has provided working code snippets tailored to the user's language and framework
- Rate limiting, budget controls, or routing are configured if the user requested them
</success_criteria>
<references>Composability
- Deploy model first: Deploy a self-hosted model (requires TrueFoundry Enterprise), then add to gateway
- Need API key: Create PAT/VAT in TrueFoundry dashboard → Access
- Rate limiting: Configure in dashboard → AI Gateway → Rate Limiting
- Routing config: Apply routing YAML directly with
; for CI/CD pipelines, integratetfy apply
into your automationtfy apply - Tool servers: Deploy tool servers to your infrastructure, then register in gateway
- Check deployed models: Check the TrueFoundry dashboard to see running model services
- Benchmark through gateway: Use your preferred load-testing tool against gateway endpoints
Error Handling
401 Unauthorized
Gateway authentication failed. Check: - API key (PAT or VAT) is valid and not expired - Using correct header: Authorization: Bearer <token>
403 Forbidden
Model access denied. Your token may not have access to this model. - PATs inherit user permissions - VATs only have access to explicitly selected models - Check with your admin to grant model access
429 Rate Limited
Rate limit exceeded. Options: - Wait and retry (check Retry-After header) - Request higher limits from admin - Use load balancing to distribute across providers
502/503 Provider Error
Upstream provider error. The gateway will automatically: - Retry on configured status codes - Fallback to alternate models if routing is configured If persistent, check provider status page or self-hosted model health.
Model Not Found
</troubleshooting>Model name not found in gateway. Check: - Exact model name in TrueFoundry dashboard → AI Gateway → Models - Provider account is active and model is enabled - Your token has access to this model