Claude-code-plugins-plus-skills openrouter-caching-strategy
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/openrouter-pack/skills/openrouter-caching-strategy" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-openrouter-caching-strategy && rm -rf "$T"
manifest:
plugins/saas-packs/openrouter-pack/skills/openrouter-caching-strategy/SKILL.mdsource content
OpenRouter Caching Strategy
Overview
OpenRouter charges per token, so caching identical or similar requests can dramatically cut costs. Deterministic requests (
temperature=0) with the same model and messages produce identical outputs -- these are safe to cache. This skill covers in-memory caching, persistent caching with TTL, and Anthropic prompt caching via OpenRouter.
In-Memory Cache
import os, hashlib, json, time from typing import Optional from openai import OpenAI client = OpenAI( base_url="https://openrouter.ai/api/v1", api_key=os.environ["OPENROUTER_API_KEY"], default_headers={"HTTP-Referer": "https://my-app.com", "X-Title": "my-app"}, ) class LLMCache: def __init__(self, ttl_seconds: int = 3600): self._cache: dict[str, tuple[dict, float]] = {} self._ttl = ttl_seconds self.hits = 0 self.misses = 0 def _key(self, model: str, messages: list, **kwargs) -> str: blob = json.dumps({"model": model, "messages": messages, **kwargs}, sort_keys=True) return hashlib.sha256(blob.encode()).hexdigest() def get(self, model: str, messages: list, **kwargs) -> Optional[dict]: k = self._key(model, messages, **kwargs) if k in self._cache: data, ts = self._cache[k] if time.time() - ts < self._ttl: self.hits += 1 return data del self._cache[k] self.misses += 1 return None def set(self, model: str, messages: list, response: dict, **kwargs): k = self._key(model, messages, **kwargs) self._cache[k] = (response, time.time()) cache = LLMCache(ttl_seconds=1800) def cached_completion(messages, model="anthropic/claude-3.5-sonnet", **kwargs): """Only cache deterministic requests (temperature=0).""" kwargs.setdefault("temperature", 0) kwargs.setdefault("max_tokens", 1024) cached = cache.get(model, messages, **kwargs) if cached: return cached response = client.chat.completions.create(model=model, messages=messages, **kwargs) result = { "content": response.choices[0].message.content, "model": response.model, "usage": {"prompt": response.usage.prompt_tokens, "completion": response.usage.completion_tokens}, } cache.set(model, messages, result, **kwargs) return result
Persistent Cache with Redis
import redis, json, hashlib r = redis.Redis(host="localhost", port=6379, db=0) def redis_cached_completion(messages, model="openai/gpt-4o-mini", ttl=3600, **kwargs): """Cache in Redis with automatic TTL expiry.""" kwargs["temperature"] = 0 # Must be deterministic key = f"or:{hashlib.sha256(json.dumps({'m': model, 'msgs': messages, **kwargs}, sort_keys=True).encode()).hexdigest()}" cached = r.get(key) if cached: return json.loads(cached) response = client.chat.completions.create(model=model, messages=messages, **kwargs) result = { "content": response.choices[0].message.content, "model": response.model, "tokens": response.usage.prompt_tokens + response.usage.completion_tokens, } r.setex(key, ttl, json.dumps(result)) return result
Anthropic Prompt Caching via OpenRouter
Anthropic models on OpenRouter support prompt caching -- large system prompts are cached server-side, reducing input cost by 90% on cache hits.
# Mark large static content blocks with cache_control response = client.chat.completions.create( model="anthropic/claude-3.5-sonnet", messages=[ { "role": "system", "content": [ { "type": "text", "text": "You are an expert. Here is the full source:\n" + large_context, "cache_control": {"type": "ephemeral"}, # Cache this block } ], }, {"role": "user", "content": "What does the main() function do?"}, ], max_tokens=1024, ) # First call: cache_creation_input_tokens charged at 1.25x # Subsequent: cache_read_input_tokens charged at 0.1x (90% savings)
Cache Key Design
def cache_key(model: str, messages: list, **params) -> str: """Deterministic cache key. Include everything that affects output. Include: model ID (with variant like :floor), messages, temperature, max_tokens, top_p, transforms, provider routing. Exclude: stream (doesn't affect content), HTTP-Referer, X-Title. """ canonical = json.dumps({ "model": model, "messages": messages, "temperature": params.get("temperature", 0), "max_tokens": params.get("max_tokens"), "top_p": params.get("top_p"), }, sort_keys=True) return hashlib.sha256(canonical.encode()).hexdigest()
Cache Invalidation
| Trigger | Action | Why |
|---|---|---|
| Model version update | Flush keys for that model | New version may give different outputs |
| System prompt change | Flush all keys | Output semantics changed |
| TTL expiry | Automatic eviction | Prevents stale data |
| Manual purge | or clear by prefix | Debugging or policy change |
Error Handling
| Error | Cause | Fix |
|---|---|---|
| Stale cache response | TTL too long | Reduce TTL or version cache keys |
| Cache miss storm | Cold start or invalidation | Warm cache with common queries at deploy |
| Redis connection error | Redis down | Fall through to direct API call |
| Non-deterministic cache | cached | Only cache when |
Enterprise Considerations
- Only cache deterministic requests (
) -- non-zero temperatures produce different outputs each timetemperature=0 - Use Anthropic prompt caching for large system prompts (RAG context) -- 90% cost reduction on cache hits
- Set TTL based on content freshness needs (30 min for dynamic, 24h for reference data)
- Track cache hit rate to justify caching infrastructure cost
- Use Redis or Memcached for multi-instance deployments; in-memory only works for single-process
- Version cache keys when updating system prompts or switching model versions