Claude-code-plugins-plus-skills groq-performance-tuning
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/groq-pack/skills/groq-performance-tuning" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-groq-performance-tuning && rm -rf "$T"
manifest:
plugins/saas-packs/groq-pack/skills/groq-performance-tuning/SKILL.mdsource content
Groq Performance Tuning
Overview
Maximize Groq's LPU inference speed advantage. Groq already delivers extreme throughput (280-560 tok/s) and low latency (<200ms TTFT), but client-side optimization -- model selection, prompt size, streaming, caching, and parallelism -- determines whether your application fully exploits that speed.
Groq Speed Benchmarks
| Model | TTFT | Throughput | Context |
|---|---|---|---|
| ~50ms | ~560 tok/s | 128K |
| ~150ms | ~280 tok/s | 128K |
| ~100ms | ~400 tok/s | 128K |
| ~80ms | ~460 tok/s | 128K |
TTFT = Time to First Token. Actual values depend on prompt size and server load.
Instructions
Step 1: Choose the Right Model for Speed
import Groq from "groq-sdk"; const groq = new Groq(); // Speed tiers for different use cases const SPEED_MAP = { // Under 100ms TTFT -- use for latency-critical paths instant: "llama-3.1-8b-instant", // Under 200ms TTFT -- use for quality-sensitive paths balanced: "llama-3.3-70b-versatile", // Speculative decoding -- same quality as 70b, faster throughput fast70b: "llama-3.3-70b-specdec", } as const; type SpeedTier = keyof typeof SPEED_MAP; async function tieredCompletion(prompt: string, tier: SpeedTier = "instant") { return groq.chat.completions.create({ model: SPEED_MAP[tier], messages: [{ role: "user", content: prompt }], temperature: 0, // Deterministic = cacheable max_tokens: 256, // Only request what you need }); }
Step 2: Minimize Token Count
// Groq charges per token AND rate limits on TPM // Smaller prompts = faster responses + less quota usage // BAD: verbose system prompt (200+ tokens) const verbosePrompt = "You are an AI assistant that classifies text. Given a piece of text, analyze it carefully and determine whether the sentiment is positive, negative, or neutral. Consider the tone, word choice, and overall message..."; // GOOD: concise system prompt (15 tokens) const concisePrompt = "Classify as positive/negative/neutral. One word only."; // BAD: high max_tokens for short expected output const wasteful = { max_tokens: 4096 }; // for a one-word response // GOOD: match max_tokens to expected output const efficient = { max_tokens: 5 }; // "positive" is 1 token
Step 3: Streaming for Perceived Performance
async function streamWithMetrics( messages: any[], onToken: (token: string) => void ): Promise<{ content: string; ttftMs: number; totalMs: number; tokPerSec: number }> { const start = performance.now(); let ttft = 0; let content = ""; let tokenCount = 0; const stream = await groq.chat.completions.create({ model: "llama-3.3-70b-versatile", messages, stream: true, max_tokens: 1024, }); for await (const chunk of stream) { const token = chunk.choices[0]?.delta?.content || ""; if (token) { if (!ttft) ttft = performance.now() - start; content += token; tokenCount++; onToken(token); } } const totalMs = performance.now() - start; return { content, ttftMs: Math.round(ttft), totalMs: Math.round(totalMs), tokPerSec: Math.round(tokenCount / (totalMs / 1000)), }; }
Step 4: Semantic Prompt Cache
import { LRUCache } from "lru-cache"; import { createHash } from "crypto"; const promptCache = new LRUCache<string, string>({ max: 1000, ttl: 10 * 60_000, // 10 min TTL for deterministic responses }); function hashRequest(messages: any[], model: string): string { return createHash("sha256") .update(JSON.stringify({ messages, model })) .digest("hex"); } async function cachedCompletion( messages: any[], model = "llama-3.1-8b-instant" ): Promise<string> { const key = hashRequest(messages, model); const cached = promptCache.get(key); if (cached) return cached; const response = await groq.chat.completions.create({ model, messages, temperature: 0, // Cache only works with deterministic output }); const result = response.choices[0].message.content!; promptCache.set(key, result); return result; }
Step 5: Parallel Request Orchestration
import PQueue from "p-queue"; // Respect RPM limits while maximizing throughput const queue = new PQueue({ concurrency: 10, intervalCap: 25, interval: 60_000, }); async function parallelCompletions( prompts: string[], model = "llama-3.1-8b-instant" ): Promise<string[]> { const results = await Promise.all( prompts.map((prompt) => queue.add(() => cachedCompletion( [{ role: "user", content: prompt }], model ) ) ) ); return results as string[]; }
Step 6: Latency Benchmarking
async function benchmarkModels(prompt: string, iterations = 3) { const models = [ "llama-3.1-8b-instant", "llama-3.3-70b-versatile", "llama-3.3-70b-specdec", ]; for (const model of models) { const latencies: number[] = []; const speeds: number[] = []; for (let i = 0; i < iterations; i++) { const start = performance.now(); const result = await groq.chat.completions.create({ model, messages: [{ role: "user", content: prompt }], max_tokens: 100, }); const elapsed = performance.now() - start; latencies.push(elapsed); const tps = result.usage!.completion_tokens / ((result.usage as any).completion_time || elapsed / 1000); speeds.push(tps); } const avgLatency = latencies.reduce((a, b) => a + b) / latencies.length; const avgSpeed = speeds.reduce((a, b) => a + b) / speeds.length; console.log( `${model.padEnd(45)} | ${avgLatency.toFixed(0)}ms avg | ${avgSpeed.toFixed(0)} tok/s avg` ); } }
Performance Decision Matrix
| Scenario | Model | max_tokens | stream | cache |
|---|---|---|---|---|
| Classification | 8b-instant | 5 | No | Yes |
| Chat response | 70b-versatile | 1024 | Yes | No |
| Data extraction | 8b-instant | 200 | No | Yes |
| Code generation | 70b-versatile | 2048 | Yes | No |
| Bulk processing | 8b-instant | 256 | No | Yes |
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| High TTFT | Using 70b for simple tasks | Switch to |
| Rate limit (429) | Over RPM or TPM | Use queue with interval limiting |
| Stream disconnect | Network timeout | Implement reconnection with partial content |
| Token overflow | max_tokens too high | Set to expected output size |
| Cache miss rate high | Unique prompts | Normalize prompts, use template patterns |
Resources
Next Steps
For cost optimization, see
groq-cost-tuning.