Skillshub cohere-cost-tuning
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/cohere-cost-tuning" ~/.claude/skills/comeonoliver-skillshub-cohere-cost-tuning && rm -rf "$T"
manifest:
skills/jeremylongshore/claude-code-plugins-plus-skills/cohere-cost-tuning/SKILL.mdsource content
Cohere Cost Tuning
Overview
Optimize Cohere costs through model selection, token budgets, embedding compression, and usage monitoring. Cohere pricing is token-based with separate input/output rates.
Prerequisites
- Cohere production key (trial is free but limited)
- Access to dashboard.cohere.com billing page
Cohere Pricing Model
Key principle: Cohere charges per token. Input tokens and output tokens have different rates. Embed, Rerank, and Classify have separate pricing based on search units.
| Tier | Access | Rate Limits | Cost |
|---|---|---|---|
| Trial | Free | 5-20 calls/min, 1000/month | $0 |
| Production | Metered | 1000 calls/min, unlimited | Per-token |
Model Cost Comparison
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Best For |
|---|---|---|---|
| Lowest | Lowest | High-volume, simple tasks |
| Low | Low | RAG, cost-effective |
| Medium | Medium | Complex reasoning |
| Higher | Higher | Best quality |
Non-Chat Pricing
| Endpoint | Pricing Unit | Notes |
|---|---|---|
| Embed | Per input token | Batch 96 texts to minimize calls |
| Rerank | Per search unit | 1 query + N docs = 1 search unit |
| Classify | Per classification | Charges per input classified |
Instructions
Strategy 1: Model Tiering
type CostTier = 'economy' | 'standard' | 'premium'; function selectModel(tier: CostTier): string { switch (tier) { case 'economy': return 'command-r7b-12-2024'; // ~5x cheaper case 'standard': return 'command-r-08-2024'; // Good balance case 'premium': return 'command-a-03-2025'; // Best quality } } // Route by use case function routeModel(task: string): string { // High-volume, simple tasks → cheapest model if (['classify', 'extract', 'summarize-short'].includes(task)) { return selectModel('economy'); } // RAG, moderate complexity if (['rag', 'search', 'qa'].includes(task)) { return selectModel('standard'); } // Complex reasoning, user-facing return selectModel('premium'); }
Strategy 2: Token Budget Controls
import { CohereClientV2 } from 'cohere-ai'; const cohere = new CohereClientV2(); // Set maxTokens to prevent runaway generation costs async function budgetedChat(message: string, maxOutputTokens = 500) { const response = await cohere.chat({ model: 'command-r-08-2024', messages: [{ role: 'user', content: message }], maxTokens: maxOutputTokens, // Hard limit on output tokens }); // Track actual usage const usage = response.usage?.billedUnits; console.log(`Tokens: in=${usage?.inputTokens} out=${usage?.outputTokens}`); return response; }
Strategy 3: Embedding Cost Reduction
// 1. Use int8 embeddings (same quality, cheaper storage) const response = await cohere.embed({ model: 'embed-v4.0', texts: documents, inputType: 'search_document', embeddingTypes: ['int8'], // 75% less storage than float }); // 2. Batch to 96 per call (minimize API calls) // 3. Cache embeddings (they're deterministic — embed once, use forever) // 4. Use embed-multilingual-v3.0 if you don't need v4 features
Strategy 4: Usage Monitoring
class CohereUsageTracker { private usage: Record<string, { inputTokens: number; outputTokens: number; calls: number }> = {}; private dailyBudget: number; constructor(dailyBudgetUSD: number) { this.dailyBudget = dailyBudgetUSD; } track(endpoint: string, billedUnits: { inputTokens?: number; outputTokens?: number }) { if (!this.usage[endpoint]) { this.usage[endpoint] = { inputTokens: 0, outputTokens: 0, calls: 0 }; } this.usage[endpoint].inputTokens += billedUnits.inputTokens ?? 0; this.usage[endpoint].outputTokens += billedUnits.outputTokens ?? 0; this.usage[endpoint].calls++; } getReport(): string { return Object.entries(this.usage) .map(([ep, u]) => `${ep}: ${u.calls} calls, ${u.inputTokens} in, ${u.outputTokens} out` ) .join('\n'); } estimateDailyCost(): number { // Rough estimate — check cohere.com/pricing for exact rates const chatIn = (this.usage['chat']?.inputTokens ?? 0) / 1_000_000; const chatOut = (this.usage['chat']?.outputTokens ?? 0) / 1_000_000; const embedIn = (this.usage['embed']?.inputTokens ?? 0) / 1_000_000; // Multiply by per-million-token rates from pricing page return (chatIn * 0.5) + (chatOut * 1.5) + (embedIn * 0.1); // example rates } } // Wrap all API calls const tracker = new CohereUsageTracker(10); // $10/day budget async function trackedChat(params: any) { const response = await cohere.chat(params); tracker.track('chat', response.usage?.billedUnits ?? {}); if (tracker.estimateDailyCost() > tracker['dailyBudget'] * 0.8) { console.warn('WARNING: Approaching daily Cohere budget limit'); } return response; }
Strategy 5: Rerank Before RAG (Skip Embed for Small Corpora)
// If you have < 1000 documents, skip embedding entirely // Rerank is cheaper than Embed + vector search for small collections async function cheapRAG(query: string, corpus: string[]) { // 1 search unit instead of N embed calls const ranked = await cohere.rerank({ model: 'rerank-v3.5', query, documents: corpus, topN: 3, }); const docs = ranked.results.map((r, i) => ({ id: `doc-${i}`, data: { text: corpus[r.index] }, })); // Use cheaper model for generation return cohere.chat({ model: 'command-r-08-2024', // Not command-a (cheaper) messages: [{ role: 'user', content: query }], documents: docs, maxTokens: 300, }); }
Cost Optimization Checklist
- Use
for simple tasks,command-r7b
only for complex onescommand-a - Set
on all chat callsmaxTokens - Batch embed calls (96 texts per request)
- Cache embeddings (deterministic — compute once)
- Use
embeddings for storageint8 - Monitor
in every responseusage.billedUnits - Set daily budget alerts
- Use
instead ofrerank
for small corpora (< 1000 docs)embed
Output
- Model tiering by cost/quality
- Token budget controls preventing runaway costs
- Usage tracking with daily budget alerts
- Cost-effective RAG with rerank pre-filtering
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| Unexpected bill spike | No maxTokens | Set maxTokens on all chat calls |
| High embed costs | Individual texts | Batch to 96 per call |
| Budget exceeded | No monitoring | Track billedUnits per response |
| Over-provisioned model | Using premium everywhere | Tier models by task complexity |
Resources
Next Steps
For architecture patterns, see
cohere-reference-architecture.