Claude-code-plugins-plus-skills openrouter-caching-strategy

install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/openrouter-pack/skills/openrouter-caching-strategy" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-openrouter-caching-strategy && rm -rf "$T"
manifest: plugins/saas-packs/openrouter-pack/skills/openrouter-caching-strategy/SKILL.md
source content

OpenRouter Caching Strategy

Overview

OpenRouter charges per token, so caching identical or similar requests can dramatically cut costs. Deterministic requests (

temperature=0
) with the same model and messages produce identical outputs -- these are safe to cache. This skill covers in-memory caching, persistent caching with TTL, and Anthropic prompt caching via OpenRouter.

In-Memory Cache

import os, hashlib, json, time
from typing import Optional
from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
    default_headers={"HTTP-Referer": "https://my-app.com", "X-Title": "my-app"},
)

class LLMCache:
    def __init__(self, ttl_seconds: int = 3600):
        self._cache: dict[str, tuple[dict, float]] = {}
        self._ttl = ttl_seconds
        self.hits = 0
        self.misses = 0

    def _key(self, model: str, messages: list, **kwargs) -> str:
        blob = json.dumps({"model": model, "messages": messages, **kwargs}, sort_keys=True)
        return hashlib.sha256(blob.encode()).hexdigest()

    def get(self, model: str, messages: list, **kwargs) -> Optional[dict]:
        k = self._key(model, messages, **kwargs)
        if k in self._cache:
            data, ts = self._cache[k]
            if time.time() - ts < self._ttl:
                self.hits += 1
                return data
            del self._cache[k]
        self.misses += 1
        return None

    def set(self, model: str, messages: list, response: dict, **kwargs):
        k = self._key(model, messages, **kwargs)
        self._cache[k] = (response, time.time())

cache = LLMCache(ttl_seconds=1800)

def cached_completion(messages, model="anthropic/claude-3.5-sonnet", **kwargs):
    """Only cache deterministic requests (temperature=0)."""
    kwargs.setdefault("temperature", 0)
    kwargs.setdefault("max_tokens", 1024)

    cached = cache.get(model, messages, **kwargs)
    if cached:
        return cached

    response = client.chat.completions.create(model=model, messages=messages, **kwargs)
    result = {
        "content": response.choices[0].message.content,
        "model": response.model,
        "usage": {"prompt": response.usage.prompt_tokens, "completion": response.usage.completion_tokens},
    }
    cache.set(model, messages, result, **kwargs)
    return result

Persistent Cache with Redis

import redis, json, hashlib

r = redis.Redis(host="localhost", port=6379, db=0)

def redis_cached_completion(messages, model="openai/gpt-4o-mini", ttl=3600, **kwargs):
    """Cache in Redis with automatic TTL expiry."""
    kwargs["temperature"] = 0  # Must be deterministic
    key = f"or:{hashlib.sha256(json.dumps({'m': model, 'msgs': messages, **kwargs}, sort_keys=True).encode()).hexdigest()}"

    cached = r.get(key)
    if cached:
        return json.loads(cached)

    response = client.chat.completions.create(model=model, messages=messages, **kwargs)
    result = {
        "content": response.choices[0].message.content,
        "model": response.model,
        "tokens": response.usage.prompt_tokens + response.usage.completion_tokens,
    }
    r.setex(key, ttl, json.dumps(result))
    return result

Anthropic Prompt Caching via OpenRouter

Anthropic models on OpenRouter support prompt caching -- large system prompts are cached server-side, reducing input cost by 90% on cache hits.

# Mark large static content blocks with cache_control
response = client.chat.completions.create(
    model="anthropic/claude-3.5-sonnet",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an expert. Here is the full source:\n" + large_context,
                    "cache_control": {"type": "ephemeral"},  # Cache this block
                }
            ],
        },
        {"role": "user", "content": "What does the main() function do?"},
    ],
    max_tokens=1024,
)
# First call: cache_creation_input_tokens charged at 1.25x
# Subsequent: cache_read_input_tokens charged at 0.1x (90% savings)

Cache Key Design

def cache_key(model: str, messages: list, **params) -> str:
    """Deterministic cache key. Include everything that affects output.

    Include: model ID (with variant like :floor), messages, temperature,
    max_tokens, top_p, transforms, provider routing.
    Exclude: stream (doesn't affect content), HTTP-Referer, X-Title.
    """
    canonical = json.dumps({
        "model": model, "messages": messages,
        "temperature": params.get("temperature", 0),
        "max_tokens": params.get("max_tokens"),
        "top_p": params.get("top_p"),
    }, sort_keys=True)
    return hashlib.sha256(canonical.encode()).hexdigest()

Cache Invalidation

TriggerActionWhy
Model version updateFlush keys for that modelNew version may give different outputs
System prompt changeFlush all keysOutput semantics changed
TTL expiryAutomatic evictionPrevents stale data
Manual purge
r.delete(key)
or clear by prefix
Debugging or policy change

Error Handling

ErrorCauseFix
Stale cache responseTTL too longReduce TTL or version cache keys
Cache miss stormCold start or invalidationWarm cache with common queries at deploy
Redis connection errorRedis downFall through to direct API call
Non-deterministic cache
temperature > 0
cached
Only cache when
temperature=0

Enterprise Considerations

  • Only cache deterministic requests (
    temperature=0
    ) -- non-zero temperatures produce different outputs each time
  • Use Anthropic prompt caching for large system prompts (RAG context) -- 90% cost reduction on cache hits
  • Set TTL based on content freshness needs (30 min for dynamic, 24h for reference data)
  • Track cache hit rate to justify caching infrastructure cost
  • Use Redis or Memcached for multi-instance deployments; in-memory only works for single-process
  • Version cache keys when updating system prompts or switching model versions

References