Skillshub cohere-cost-tuning

install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/cohere-cost-tuning" ~/.claude/skills/comeonoliver-skillshub-cohere-cost-tuning && rm -rf "$T"
manifest: skills/jeremylongshore/claude-code-plugins-plus-skills/cohere-cost-tuning/SKILL.md
source content

Cohere Cost Tuning

Overview

Optimize Cohere costs through model selection, token budgets, embedding compression, and usage monitoring. Cohere pricing is token-based with separate input/output rates.

Prerequisites

Cohere Pricing Model

Key principle: Cohere charges per token. Input tokens and output tokens have different rates. Embed, Rerank, and Classify have separate pricing based on search units.

TierAccessRate LimitsCost
TrialFree5-20 calls/min, 1000/month$0
ProductionMetered1000 calls/min, unlimitedPer-token

Model Cost Comparison

ModelInput (per 1M tokens)Output (per 1M tokens)Best For
command-r7b-12-2024
LowestLowestHigh-volume, simple tasks
command-r-08-2024
LowLowRAG, cost-effective
command-r-plus-08-2024
MediumMediumComplex reasoning
command-a-03-2025
HigherHigherBest quality

Non-Chat Pricing

EndpointPricing UnitNotes
EmbedPer input tokenBatch 96 texts to minimize calls
RerankPer search unit1 query + N docs = 1 search unit
ClassifyPer classificationCharges per input classified

Instructions

Strategy 1: Model Tiering

type CostTier = 'economy' | 'standard' | 'premium';

function selectModel(tier: CostTier): string {
  switch (tier) {
    case 'economy':  return 'command-r7b-12-2024';    // ~5x cheaper
    case 'standard': return 'command-r-08-2024';       // Good balance
    case 'premium':  return 'command-a-03-2025';       // Best quality
  }
}

// Route by use case
function routeModel(task: string): string {
  // High-volume, simple tasks → cheapest model
  if (['classify', 'extract', 'summarize-short'].includes(task)) {
    return selectModel('economy');
  }
  // RAG, moderate complexity
  if (['rag', 'search', 'qa'].includes(task)) {
    return selectModel('standard');
  }
  // Complex reasoning, user-facing
  return selectModel('premium');
}

Strategy 2: Token Budget Controls

import { CohereClientV2 } from 'cohere-ai';

const cohere = new CohereClientV2();

// Set maxTokens to prevent runaway generation costs
async function budgetedChat(message: string, maxOutputTokens = 500) {
  const response = await cohere.chat({
    model: 'command-r-08-2024',
    messages: [{ role: 'user', content: message }],
    maxTokens: maxOutputTokens,  // Hard limit on output tokens
  });

  // Track actual usage
  const usage = response.usage?.billedUnits;
  console.log(`Tokens: in=${usage?.inputTokens} out=${usage?.outputTokens}`);

  return response;
}

Strategy 3: Embedding Cost Reduction

// 1. Use int8 embeddings (same quality, cheaper storage)
const response = await cohere.embed({
  model: 'embed-v4.0',
  texts: documents,
  inputType: 'search_document',
  embeddingTypes: ['int8'],     // 75% less storage than float
});

// 2. Batch to 96 per call (minimize API calls)
// 3. Cache embeddings (they're deterministic — embed once, use forever)
// 4. Use embed-multilingual-v3.0 if you don't need v4 features

Strategy 4: Usage Monitoring

class CohereUsageTracker {
  private usage: Record<string, { inputTokens: number; outputTokens: number; calls: number }> = {};
  private dailyBudget: number;

  constructor(dailyBudgetUSD: number) {
    this.dailyBudget = dailyBudgetUSD;
  }

  track(endpoint: string, billedUnits: { inputTokens?: number; outputTokens?: number }) {
    if (!this.usage[endpoint]) {
      this.usage[endpoint] = { inputTokens: 0, outputTokens: 0, calls: 0 };
    }
    this.usage[endpoint].inputTokens += billedUnits.inputTokens ?? 0;
    this.usage[endpoint].outputTokens += billedUnits.outputTokens ?? 0;
    this.usage[endpoint].calls++;
  }

  getReport(): string {
    return Object.entries(this.usage)
      .map(([ep, u]) =>
        `${ep}: ${u.calls} calls, ${u.inputTokens} in, ${u.outputTokens} out`
      )
      .join('\n');
  }

  estimateDailyCost(): number {
    // Rough estimate — check cohere.com/pricing for exact rates
    const chatIn = (this.usage['chat']?.inputTokens ?? 0) / 1_000_000;
    const chatOut = (this.usage['chat']?.outputTokens ?? 0) / 1_000_000;
    const embedIn = (this.usage['embed']?.inputTokens ?? 0) / 1_000_000;
    // Multiply by per-million-token rates from pricing page
    return (chatIn * 0.5) + (chatOut * 1.5) + (embedIn * 0.1); // example rates
  }
}

// Wrap all API calls
const tracker = new CohereUsageTracker(10); // $10/day budget

async function trackedChat(params: any) {
  const response = await cohere.chat(params);
  tracker.track('chat', response.usage?.billedUnits ?? {});

  if (tracker.estimateDailyCost() > tracker['dailyBudget'] * 0.8) {
    console.warn('WARNING: Approaching daily Cohere budget limit');
  }

  return response;
}

Strategy 5: Rerank Before RAG (Skip Embed for Small Corpora)

// If you have < 1000 documents, skip embedding entirely
// Rerank is cheaper than Embed + vector search for small collections

async function cheapRAG(query: string, corpus: string[]) {
  // 1 search unit instead of N embed calls
  const ranked = await cohere.rerank({
    model: 'rerank-v3.5',
    query,
    documents: corpus,
    topN: 3,
  });

  const docs = ranked.results.map((r, i) => ({
    id: `doc-${i}`,
    data: { text: corpus[r.index] },
  }));

  // Use cheaper model for generation
  return cohere.chat({
    model: 'command-r-08-2024', // Not command-a (cheaper)
    messages: [{ role: 'user', content: query }],
    documents: docs,
    maxTokens: 300,
  });
}

Cost Optimization Checklist

  • Use
    command-r7b
    for simple tasks,
    command-a
    only for complex ones
  • Set
    maxTokens
    on all chat calls
  • Batch embed calls (96 texts per request)
  • Cache embeddings (deterministic — compute once)
  • Use
    int8
    embeddings for storage
  • Monitor
    usage.billedUnits
    in every response
  • Set daily budget alerts
  • Use
    rerank
    instead of
    embed
    for small corpora (< 1000 docs)

Output

  • Model tiering by cost/quality
  • Token budget controls preventing runaway costs
  • Usage tracking with daily budget alerts
  • Cost-effective RAG with rerank pre-filtering

Error Handling

IssueCauseSolution
Unexpected bill spikeNo maxTokensSet maxTokens on all chat calls
High embed costsIndividual textsBatch to 96 per call
Budget exceededNo monitoringTrack billedUnits per response
Over-provisioned modelUsing premium everywhereTier models by task complexity

Resources

Next Steps

For architecture patterns, see

cohere-reference-architecture
.