Skilllibrary model-routing
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/11-ai-llm-runtime-and-integration/model-routing" ~/.claude/skills/merceralex397-collab-skilllibrary-model-routing && rm -rf "$T"
manifest:
11-ai-llm-runtime-and-integration/model-routing/SKILL.mdsource content
Purpose
Route LLM requests to appropriate models based on task complexity, cost constraints, and latency requirements.
When to use this skill
- building a router that dispatches to small vs. large models based on query complexity
- implementing fallback chains (try cheap model first, escalate on failure)
- optimizing LLM costs by routing simple queries to smaller/cheaper models
- adding model routing to an existing agent or API gateway
Do not use this skill when
- choosing a single model for a project — prefer
model-selection - deploying inference infrastructure — prefer
inference-serving - managing context windows — prefer
context-management-memory
Procedure
- Define routing tiers — cheap/fast (Haiku, GPT-4o-mini), standard (Sonnet, GPT-4o), premium (Opus, o1).
- Classify request complexity — use heuristics: token count, keyword signals (code, math, creative), or a small classifier model.
- Implement router — check complexity score against thresholds. Route to cheapest tier that can handle the task.
- Add fallback chain — if cheap model fails (low confidence, refusal, malformed output), retry with the next tier up.
- Set cost budgets — track per-request cost. Alert when daily/monthly spend approaches limits.
- Cache responses — hash (model + prompt) for deterministic requests. Serve from cache before routing.
- Monitor quality — log model used, latency, and output quality score per request. Detect tier-mismatch patterns.
- Tune thresholds — adjust complexity thresholds weekly based on quality and cost data.
Routing architecture
Request --> Classifier --> Complexity Score | Low (<0.3) | Med (0.3-0.7) | High (>0.7) | | | Haiku/Mini Sonnet/4o Opus/o1 | | | Response Response Response | Confidence < threshold? | Escalate to next tier
Key patterns
class ModelRouter: TIERS = { "fast": {"model": "claude-haiku", "max_complexity": 0.3, "cost_per_1k": 0.0003}, "standard":{"model": "claude-sonnet", "max_complexity": 0.7, "cost_per_1k": 0.003}, "premium": {"model": "claude-opus", "max_complexity": 1.0, "cost_per_1k": 0.015}, } def route(self, request): score = self.classify_complexity(request) for tier in ["fast", "standard", "premium"]: if score <= self.TIERS[tier]["max_complexity"]: return self.TIERS[tier]["model"] return self.TIERS["premium"]["model"] def classify_complexity(self, request): # Heuristics: length, code presence, reasoning keywords text = request["content"] score = min(len(text) / 2000, 1.0) # length signal if any(kw in text for kw in ["explain", "analyze", "compare"]): score += 0.3 return min(score, 1.0)
Decision rules
- Default to the cheapest model that can handle the task — escalate on failure, not preemptively.
- Use output validation to detect when a cheap model fails — JSON schema check, confidence score, refusal detection.
- Cache identical requests — many applications send repeated or near-identical prompts.
- Log every routing decision with model, latency, cost, and quality — you cannot optimize without data.
- Re-evaluate thresholds monthly — model capabilities and pricing change frequently.
References
Related skills
— choosing models for a projectmodel-selection
— hosting the models being routed toinference-serving
— managing context per model tiercontext-management-memory