Claude-skill-registry litellm

When calling LLM APIs from Python code. When connecting to llamafile or local LLM servers. When switching between OpenAI/Anthropic/local providers. When implementing retry/fallback logic for LLM calls. When code imports litellm or uses completion() patterns.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/litellm-bbgnsurftech-claude-skills-collec" ~/.claude/skills/majiayu000-claude-skill-registry-litellm && rm -rf "$T"
manifest: skills/data/litellm-bbgnsurftech-claude-skills-collec/SKILL.md
source content

LiteLLM

Unified Python interface for calling 100+ LLM APIs using consistent OpenAI format. Provides standardized exception handling, retry/fallback logic, and cost tracking across multiple providers.

When to Use This Skill

Use this skill when:

  • Integrating with multiple LLM providers through a single interface
  • Routing requests to local llamafile servers using OpenAI-compatible endpoints
  • Implementing retry and fallback logic for LLM calls
  • Building applications requiring consistent error handling across providers
  • Tracking LLM usage costs across different providers
  • Converting between provider-specific APIs and OpenAI format
  • Deploying LLM proxy servers with unified configuration
  • Testing applications against both cloud and local LLM endpoints

Core Capabilities

Provider Support

LiteLLM supports 100+ providers through consistent OpenAI-style API:

  • Cloud Providers: OpenAI, Anthropic, Google, Azure, AWS Bedrock
  • Local Servers: llamafile, Ollama, LocalAI, vLLM
  • Unified Format: All requests use OpenAI message format
  • Exception Mapping: All provider errors map to OpenAI exception types

Key Features

  1. Unified API: Single
    completion()
    function for all providers
  2. Exception Handling: All exceptions inherit from OpenAI types
  3. Retry Logic: Built-in retry with configurable attempts
  4. Streaming Support: Sync and async streaming for all providers
  5. Cost Tracking: Automatic usage and cost calculation
  6. Proxy Mode: Deploy centralized LLM gateway

Installation

# Using pip
pip install litellm

# Using uv
uv add litellm

Llamafile Integration

Provider Configuration

All llamafile models MUST use the

llamafile/
prefix for routing:

model = "llamafile/mistralai/mistral-7b-instruct-v0.2"
model = "llamafile/gemma-3-3b"

API Base URL

The

api_base
MUST point to llamafile's OpenAI-compatible endpoint:

api_base = "http://localhost:8080/v1"

Critical Requirements:

  • Include
    /v1
    suffix
  • Do NOT add endpoint paths like
    /chat/completions
    (LiteLLM adds these automatically)
  • Default llamafile port is 8080

Environment Variable Configuration

import os

os.environ["LLAMAFILE_API_BASE"] = "http://localhost:8080/v1"

Basic Usage Patterns

Synchronous Completion

import litellm

response = litellm.completion(
    model="llamafile/mistralai/mistral-7b-instruct-v0.2",
    messages=[{"role": "user", "content": "Summarize this diff"}],
    api_base="http://localhost:8080/v1",
    temperature=0.2,
    max_tokens=80,
)

print(response.choices[0].message.content)

Asynchronous Completion

from litellm import acompletion
import asyncio

async def generate_message():
    response = await acompletion(
        model="llamafile/gemma-3-3b",
        messages=[{"role": "user", "content": "Write a commit message"}],
        api_base="http://localhost:8080/v1",
        temperature=0.3,
        max_tokens=200,
    )
    return response.choices[0].message.content

result = asyncio.run(generate_message())
print(result)

Async Streaming

from litellm import acompletion
import asyncio

async def stream_response():
    response = await acompletion(
        model="llamafile/gemma-3-3b",
        messages=[{"role": "user", "content": "Hello, how are you?"}],
        api_base="http://localhost:8080/v1",
        stream=True,
    )

    async for chunk in response:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
    print()

asyncio.run(stream_response())

Embeddings

from litellm import embedding
import os

os.environ["LLAMAFILE_API_BASE"] = "http://localhost:8080/v1"

response = embedding(
    model="llamafile/sentence-transformers/all-MiniLM-L6-v2",
    input=["Hello world"],
)

print(response)

Exception Handling

Import Pattern

All exceptions can be imported directly from

litellm
:

from litellm import (
    BadRequestError,           # 400 errors
    AuthenticationError,       # 401 errors
    NotFoundError,             # 404 errors
    Timeout,                   # 408 errors (alias: openai.APITimeoutError)
    RateLimitError,            # 429 errors
    APIConnectionError,        # 500 errors / connection issues (default)
    ServiceUnavailableError,   # 503 errors
)

Exception Types Reference

Status CodeException TypeInherits fromDescription
400
BadRequestError
openai.BadRequestErrorInvalid request
400
ContextWindowExceededError
litellm.BadRequestErrorToken limit exceeded
400
ContentPolicyViolationError
litellm.BadRequestErrorContent policy violation
401
AuthenticationError
openai.AuthenticationErrorAuth failure
403
PermissionDeniedError
openai.PermissionDeniedErrorPermission denied
404
NotFoundError
openai.NotFoundErrorInvalid model/endpoint
408
Timeout
openai.APITimeoutErrorRequest timeout
429
RateLimitError
openai.RateLimitErrorRate limited
500
APIConnectionError
openai.APIConnectionErrorDefault for unmapped errors
500
APIError
openai.APIErrorGeneric 500 error
503
ServiceUnavailableError
openai.APIStatusErrorService unavailable
>=500
InternalServerError
openai.InternalServerErrorUnmapped 500+ errors

Exception Attributes

All LiteLLM exceptions include:

  • status_code
    : HTTP status code
  • message
    : Error message
  • llm_provider
    : Provider that raised the exception

Exception Handling Example

import litellm
import openai

try:
    response = litellm.completion(
        model="llamafile/gemma-3-3b",
        messages=[{"role": "user", "content": "Hello"}],
        api_base="http://localhost:8080/v1",
        timeout=30.0,
    )
except openai.APITimeoutError as e:
    # LiteLLM exceptions inherit from OpenAI types
    print(f"Timeout: {e}")
except litellm.APIConnectionError as e:
    print(f"Connection failed: {e.message}")
    print(f"Provider: {e.llm_provider}")

Alternative Import from litellm.exceptions

from litellm.exceptions import BadRequestError, AuthenticationError, APIError

try:
    response = litellm.completion(
        model="llamafile/gemma-3-3b",
        messages=[{"role": "user", "content": "Hello"}],
        api_base="http://localhost:8080/v1",
    )
except AuthenticationError as e:
    print(f"Authentication failed: {e}")
except BadRequestError as e:
    print(f"Bad request: {e}")
except APIError as e:
    print(f"API error: {e}")

Checking If Exception Should Retry

import litellm

try:
    response = litellm.completion(
        model="llamafile/gemma-3-3b",
        messages=[{"role": "user", "content": "Hello"}],
        api_base="http://localhost:8080/v1",
    )
except Exception as e:
    if hasattr(e, 'status_code'):
        should_retry = litellm._should_retry(e.status_code)
        print(f"Should retry: {should_retry}")

Retry and Fallback Configuration

from litellm import completion

response = completion(
    model="llamafile/gemma-3-3b",
    messages=[{"role": "user", "content": "Hello"}],
    api_base="http://localhost:8080/v1",
    num_retries=3,      # Retry 3 times on failure
    timeout=30.0,       # 30 second timeout
)

Proxy Server Configuration

For proxy deployments, use

config.yaml
:

model_list:
  - model_name: commit-polish-model
    litellm_params:
      model: llamafile/gemma-3-3b          # add llamafile/ prefix
      api_base: http://localhost:8080/v1   # add api base for OpenAI compatible provider

Application Integration Patterns

Connection Verification Pattern

import litellm
from litellm import APIConnectionError

def verify_llamafile_connection(api_base: str = "http://localhost:8080/v1") -> bool:
    """Check if llamafile server is running."""
    try:
        litellm.completion(
            model="llamafile/test",
            messages=[{"role": "user", "content": "test"}],
            api_base=api_base,
            max_tokens=1,
        )
        return True
    except APIConnectionError:
        return False

Async Service Pattern

import litellm
from litellm import acompletion, APIConnectionError
import asyncio

class AIService:
    """LiteLLM wrapper with llamafile routing."""

    def __init__(self, model: str, api_base: str, temperature: float = 0.3, max_tokens: int = 200):
        self.model = model
        self.api_base = api_base
        self.temperature = temperature
        self.max_tokens = max_tokens

    async def generate_commit_message(self, diff: str, system_prompt: str) -> str:
        """Generate a commit message using the LLM."""
        try:
            response = await acompletion(
                model=self.model,
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": f"Generate a commit message for this diff:\n\n{diff}"},
                ],
                api_base=self.api_base,
                temperature=self.temperature,
                max_tokens=self.max_tokens,
            )
            return response.choices[0].message.content.strip()
        except APIConnectionError as e:
            raise RuntimeError(f"Failed to connect to llamafile server at {self.api_base}: {e.message}")

Common Pitfalls to Avoid

  1. Missing
    llamafile/
    prefix
    : Without prefix, LiteLLM won't route to OpenAI-compatible endpoint
  2. Wrong port: Llamafile uses 8080 by default, not 8000
  3. Missing
    /v1
    suffix
    : API base must end with
    /v1
  4. Adding extra path segments: Do NOT use
    http://localhost:8080/v1/chat/completions
    - LiteLLM adds the endpoint path automatically
  5. API key requirement: No API key needed for local llamafile (use empty string or any value if required by validation)

Configuration Examples

TOML Configuration

# ~/.config/commit-polish/config.toml
[ai]
model = "llamafile/gemma-3-3b"  # MUST have llamafile/ prefix
temperature = 0.3
max_tokens = 200

Environment Variables

export LLAMAFILE_API_BASE="http://localhost:8080/v1"
export LITELLM_LOG="INFO"  # Enable LiteLLM debug logging

Related Skills

For comprehensive documentation on related tools:

  • llamafile: Activate the llamafile skill using
    Skill(command: "llamafile")
    for llamafile server setup, model management, and local LLM deployment patterns
  • uv: Activate the uv skill using
    Skill(command: "uv")
    for Python project management, dependency handling, and virtual environment workflows

References

Official Documentation

Provider-Specific Documentation

Version Information

  • Documentation verified against: LiteLLM GitHub repository (main branch, accessed 2025-01-15)
  • Python: 3.11+
  • Llamafile: 0.9.3+