Skillshub cohere-performance-tuning
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/cohere-performance-tuning" ~/.claude/skills/comeonoliver-skillshub-cohere-performance-tuning && rm -rf "$T"
manifest:
skills/jeremylongshore/claude-code-plugins-plus-skills/cohere-performance-tuning/SKILL.mdsource content
Cohere Performance Tuning
Overview
Optimize Cohere API v2 performance through model selection, embedding batches, rerank pipelines, caching, and streaming for time-to-first-token.
Prerequisites
SDK installedcohere-ai- Understanding of Cohere endpoints (Chat, Embed, Rerank)
- Redis or in-memory cache (optional)
Latency Benchmarks (Typical)
| Operation | Model | P50 | P95 |
|---|---|---|---|
| Chat (short) | | 500ms | 1.5s |
| Chat (short) | | 800ms | 2.5s |
| Chat (stream TTFT) | | 200ms | 600ms |
| Embed (96 texts) | | 150ms | 400ms |
| Rerank (100 docs) | | 100ms | 300ms |
| Classify (96 inputs) | | 200ms | 500ms |
Instructions
Strategy 1: Model Selection by Latency Budget
// Use smaller models for latency-sensitive paths function selectModel(latencyBudgetMs: number): string { if (latencyBudgetMs < 1000) return 'command-r7b-12-2024'; // 7B, fastest if (latencyBudgetMs < 3000) return 'command-r-08-2024'; // Mid-tier return 'command-a-03-2025'; // Best quality } // Pair with maxTokens to control output length await cohere.chat({ model: selectModel(1500), messages: [{ role: 'user', content: query }], maxTokens: 200, // Shorter output = lower latency });
Strategy 2: Streaming for Time-to-First-Token
// Non-streaming: user waits for entire response (800ms-5s) // Streaming: first token arrives in ~200ms async function streamForUI(message: string): Promise<string> { const stream = await cohere.chatStream({ model: 'command-a-03-2025', messages: [{ role: 'user', content: message }], }); let fullText = ''; for await (const event of stream) { if (event.type === 'content-delta') { const text = event.delta?.message?.content?.text ?? ''; fullText += text; // Emit to frontend immediately — perceived latency drops to ~200ms } } return fullText; }
Strategy 3: Batch Embeddings (96 per Call)
// BAD: 1000 texts = 1000 API calls for (const text of texts) { await cohere.embed({ model: 'embed-v4.0', texts: [text], ... }); } // GOOD: 1000 texts = 11 API calls (96 per batch) async function batchEmbed(texts: string[]): Promise<number[][]> { const BATCH = 96; // Cohere max per request const results: number[][] = []; const batches = []; for (let i = 0; i < texts.length; i += BATCH) { batches.push(texts.slice(i, i + BATCH)); } // Parallel batches (respect rate limits) const responses = await Promise.all( batches.map(batch => cohere.embed({ model: 'embed-v4.0', texts: batch, inputType: 'search_document', embeddingTypes: ['float'], }) ) ); for (const resp of responses) { results.push(...resp.embeddings.float); } return results; }
Strategy 4: Compressed Embeddings
// float: 1024 dims * 4 bytes = 4KB per vector // int8: 1024 dims * 1 byte = 1KB per vector (75% smaller) // binary: 1024 dims / 8 = 128 bytes per vector (97% smaller) const response = await cohere.embed({ model: 'embed-v4.0', texts: documents, inputType: 'search_document', embeddingTypes: ['int8'], // or ['binary'] for maximum compression }); // Use int8 for storage, float for final scoring const storageVectors = response.embeddings.int8; // Store these
Strategy 5: Rerank as a Pre-filter
// Instead of embedding everything, use rerank as a fast pre-filter async function efficientSearch(query: string, corpus: string[]) { // Step 1: Rerank finds top candidates in ~100ms (up to 1000 docs) const reranked = await cohere.rerank({ model: 'rerank-v3.5', query, documents: corpus, topN: 5, }); // Step 2: Only embed the top 5 for fine-grained scoring (optional) const topDocs = reranked.results.map(r => ({ text: corpus[r.index], score: r.relevanceScore, })); return topDocs; }
Strategy 6: Embedding Cache
import { LRUCache } from 'lru-cache'; import crypto from 'crypto'; const embedCache = new LRUCache<string, number[]>({ max: 10_000, ttl: 24 * 60 * 60 * 1000, // 24h — embeddings are deterministic }); function hashText(text: string): string { return crypto.createHash('sha256').update(text).digest('hex').slice(0, 16); } async function cachedEmbed(texts: string[]): Promise<number[][]> { const results: number[][] = new Array(texts.length); const uncached: { index: number; text: string }[] = []; // Check cache first for (let i = 0; i < texts.length; i++) { const key = hashText(texts[i]); const cached = embedCache.get(key); if (cached) { results[i] = cached; } else { uncached.push({ index: i, text: texts[i] }); } } // Embed only uncached texts if (uncached.length > 0) { const vectors = await batchEmbed(uncached.map(u => u.text)); for (let j = 0; j < uncached.length; j++) { results[uncached[j].index] = vectors[j]; embedCache.set(hashText(uncached[j].text), vectors[j]); } } return results; }
Strategy 7: Response Caching for Chat
import { LRUCache } from 'lru-cache'; // Cache chat responses for deterministic queries const chatCache = new LRUCache<string, string>({ max: 1000, ttl: 5 * 60 * 1000, // 5 min TTL — chat responses can vary }); async function cachedChat(message: string, system?: string): Promise<string> { const key = `${system ?? ''}:${message}`; const cached = chatCache.get(key); if (cached) return cached; const response = await cohere.chat({ model: 'command-a-03-2025', messages: [ ...(system ? [{ role: 'system' as const, content: system }] : []), { role: 'user' as const, content: message }, ], temperature: 0, // Deterministic for caching }); const text = response.message?.content?.[0]?.text ?? ''; chatCache.set(key, text); return text; }
Performance Monitoring
async function timedCohereCall<T>( endpoint: string, fn: () => Promise<T> ): Promise<T> { const start = performance.now(); try { const result = await fn(); const ms = performance.now() - start; console.log(`[cohere] ${endpoint}: ${ms.toFixed(0)}ms`); return result; } catch (err) { const ms = performance.now() - start; console.error(`[cohere] ${endpoint} FAILED: ${ms.toFixed(0)}ms`, err); throw err; } }
Output
- Model selection by latency budget
- Streaming for sub-200ms TTFT
- Batch embedding (96x fewer API calls)
- Compressed embeddings (75-97% storage savings)
- Cache layer for deterministic queries
- Rerank as fast pre-filter
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| Chat > 5s | Long output + slow model | Use streaming, reduce maxTokens |
| Embed timeout | Too many texts | Batch to 96 per call |
| Cache stale | Long TTL | Reduce TTL for volatile data |
| High costs | No caching | Cache embeddings (deterministic) |
Resources
Next Steps
For cost optimization, see
cohere-cost-tuning.