Skilllibrary model-selection
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/11-ai-llm-runtime-and-integration/model-selection" ~/.claude/skills/merceralex397-collab-skilllibrary-model-selection && rm -rf "$T"
manifest:
11-ai-llm-runtime-and-integration/model-selection/SKILL.mdsource content
Purpose
Choose the right LLM for a task based on capability benchmarks, cost, latency, context window, and deployment constraints.
When to use this skill
- evaluating which model to use for a new AI feature
- comparing model capabilities across providers (OpenAI, Anthropic, Google, open-source)
- deciding between API-hosted and self-hosted models
- sizing context window and throughput requirements
Do not use this skill when
- deploying inference servers — prefer
inference-serving - building model routing logic — prefer
model-routing - quantizing or compiling models — prefer
llama-cpp
Procedure
- Define requirements — list task type (chat, code, reasoning, classification), required context window, latency target, cost budget.
- Shortlist candidates — filter by context window and capability. Include at least one from each category: API-hosted, open-source.
- Check benchmarks — compare on relevant benchmarks (MMLU, HumanEval, MATH, MT-Bench). Use LMSYS Chatbot Arena for subjective quality.
- Estimate cost — calculate monthly cost:
. Include both input and output tokens.(requests/day * avg_tokens * price_per_token * 30) - Test on your data — run 50-100 representative examples through each candidate. Score by task-specific metrics (accuracy, format compliance).
- Evaluate latency — measure TTFT and tokens/sec. API models: test from your deployment region. Self-hosted: benchmark on your hardware.
- Check constraints — data residency, content policies, rate limits, SLAs. Some use cases require self-hosted for compliance.
- Select and document — choose model, document rationale, set fallback model, plan re-evaluation schedule.
Model landscape (2024-2025)
| Category | Models | Best for |
|---|---|---|
| Premium reasoning | Claude Opus, o1/o3, Gemini Pro | Complex analysis, math, coding |
| Standard | Claude Sonnet, GPT-4o, Gemini Flash | General tasks, good cost/quality |
| Fast/cheap | Claude Haiku, GPT-4o-mini | Classification, extraction, routing |
| Open-source large | Llama 3.1 70B, Mixtral 8x22B | Self-hosted, data-sensitive |
| Open-source small | Llama 3.1 8B, Phi-3, Qwen2 | Edge, CPU, fine-tuning |
| Code | Claude Sonnet, GPT-4o, DeepSeek | Code generation, review |
| Embedding | text-embedding-3, Cohere embed-v3 | Search, RAG |
Cost estimation
Example: Customer support chatbot - 10,000 conversations/day - Avg 2,000 input tokens + 500 output tokens per conversation GPT-4o-mini: (10k * 2k * $0.15/1M) + (10k * 500 * $0.60/1M) = $6/day = $180/month GPT-4o: (10k * 2k * $2.50/1M) + (10k * 500 * $10/1M) = $100/day = $3,000/month Sonnet: (10k * 2k * $3/1M) + (10k * 500 * $15/1M) = $135/day = $4,050/month Self-hosted 8B: ~$500/month (1x A10G) — fixed cost, unlimited requests
Decision rules
- Start with the cheapest model that meets quality requirements — upgrade only with evidence.
- Test on YOUR data, not just benchmarks — benchmark performance does not always transfer.
- Factor in total cost: API fees + engineering time + infrastructure for self-hosted.
- Plan for model deprecation — always have a fallback model tested and ready.
- Re-evaluate quarterly — new models release frequently and shift the cost/quality frontier.
References
- https://artificialanalysis.ai/
- https://chat.lmsys.org/?leaderboard
- https://platform.openai.com/docs/models
Related skills
— routing requests across multiple modelsmodel-routing
— deploying chosen modelsinference-serving
— running open-source models locallyllama-cpp