Claude-code-plugins-plus-skills groq-rate-limits
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/groq-pack/skills/groq-rate-limits" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-groq-rate-limits && rm -rf "$T"
manifest:
plugins/saas-packs/groq-pack/skills/groq-rate-limits/SKILL.mdsource content
Groq Rate Limits
Overview
Handle Groq rate limits using the
retry-after header, exponential backoff, and request queuing. Groq enforces limits at the organization level with both RPM (requests/minute) and TPM (tokens/minute) constraints -- hitting either one triggers a 429.
Rate Limit Structure
Groq rate limits vary by plan and model. Limits are applied simultaneously -- you must stay under both RPM and TPM.
| Constraint | Description |
|---|---|
| RPM | Requests per minute |
| RPD | Requests per day |
| TPM | Tokens per minute |
| TPD | Tokens per day |
Free tier limits are significantly lower than paid tier. Check your current limits at console.groq.com/settings/limits.
Rate Limit Response Headers
When Groq responds (even on success), it includes these headers:
| Header | Description |
|---|---|
| Max requests in current window |
| Max tokens in current window |
| Requests remaining before limit |
| Tokens remaining before limit |
| Time until request limit resets |
| Time until token limit resets |
| Seconds to wait (only on 429 responses) |
Instructions
Step 1: Parse Rate Limit Headers
import Groq from "groq-sdk"; interface RateLimitInfo { limitRequests: number; limitTokens: number; remainingRequests: number; remainingTokens: number; resetRequestsMs: number; resetTokensMs: number; } function parseRateLimitHeaders(headers: Record<string, string>): RateLimitInfo { return { limitRequests: parseInt(headers["x-ratelimit-limit-requests"] || "0"), limitTokens: parseInt(headers["x-ratelimit-limit-tokens"] || "0"), remainingRequests: parseInt(headers["x-ratelimit-remaining-requests"] || "0"), remainingTokens: parseInt(headers["x-ratelimit-remaining-tokens"] || "0"), resetRequestsMs: parseResetTime(headers["x-ratelimit-reset-requests"]), resetTokensMs: parseResetTime(headers["x-ratelimit-reset-tokens"]), }; } function parseResetTime(value?: string): number { if (!value) return 0; // Groq returns reset times like "1.2s" or "120ms" if (value.endsWith("ms")) return parseFloat(value); if (value.endsWith("s")) return parseFloat(value) * 1000; return parseFloat(value) * 1000; }
Step 2: Exponential Backoff with Retry-After
async function withRateLimitRetry<T>( operation: () => Promise<T>, options = { maxRetries: 5, baseDelayMs: 1000, maxDelayMs: 60_000 } ): Promise<T> { for (let attempt = 0; attempt <= options.maxRetries; attempt++) { try { return await operation(); } catch (err) { if (attempt === options.maxRetries) throw err; if (err instanceof Groq.APIError && err.status === 429) { // Prefer retry-after header from Groq const retryAfterSec = parseInt(err.headers?.["retry-after"] || "0"); let delayMs: number; if (retryAfterSec > 0) { delayMs = retryAfterSec * 1000; } else { // Exponential backoff with jitter const exponential = options.baseDelayMs * Math.pow(2, attempt); const jitter = Math.random() * 500; delayMs = Math.min(exponential + jitter, options.maxDelayMs); } console.warn(`Rate limited (attempt ${attempt + 1}/${options.maxRetries}). Waiting ${(delayMs / 1000).toFixed(1)}s...`); await new Promise((r) => setTimeout(r, delayMs)); continue; } // Non-rate-limit errors: only retry 5xx if (err instanceof Groq.APIError && err.status >= 500) { const delayMs = options.baseDelayMs * Math.pow(2, attempt); await new Promise((r) => setTimeout(r, delayMs)); continue; } throw err; // 4xx (except 429) are not retryable } } throw new Error("Unreachable"); }
Step 3: Request Queue with Concurrency Control
import PQueue from "p-queue"; // Queue that respects Groq RPM limits function createGroqQueue(requestsPerMinute: number) { return new PQueue({ intervalCap: requestsPerMinute, interval: 60_000, // 1 minute window concurrency: 5, // Max parallel requests }); } const queue = createGroqQueue(30); // Free tier: 30 RPM async function queuedCompletion(messages: any[], model: string) { return queue.add(() => withRateLimitRetry(() => groq.chat.completions.create({ model, messages }) ) ); }
Step 4: Proactive Rate Limit Monitor
class RateLimitMonitor { private remaining = { requests: Infinity, tokens: Infinity }; private resets = { requests: 0, tokens: 0 }; update(headers: Record<string, string>): void { const info = parseRateLimitHeaders(headers); this.remaining.requests = info.remainingRequests; this.remaining.tokens = info.remainingTokens; this.resets.requests = Date.now() + info.resetRequestsMs; this.resets.tokens = Date.now() + info.resetTokensMs; } shouldThrottle(): boolean { return this.remaining.requests < 3 || this.remaining.tokens < 500; } async waitIfNeeded(): Promise<void> { if (!this.shouldThrottle()) return; const waitMs = Math.max( this.resets.requests - Date.now(), this.resets.tokens - Date.now(), 0 ); if (waitMs > 0) { console.log(`Throttling: waiting ${(waitMs / 1000).toFixed(1)}s for rate limit reset`); await new Promise((r) => setTimeout(r, waitMs)); } } getStatus(): string { return `Requests: ${this.remaining.requests} remaining | Tokens: ${this.remaining.tokens} remaining`; } }
Step 5: Model-Aware Rate Limit Strategy
// Different models have different limits -- route accordingly async function smartModelSelect( messages: any[], preferredModel: string, monitor: RateLimitMonitor ): Promise<string> { // If rate limited on preferred model, try a different one if (monitor.shouldThrottle()) { const fallbacks: Record<string, string> = { "llama-3.3-70b-versatile": "llama-3.1-8b-instant", "llama-3.1-8b-instant": "llama-3.3-70b-versatile", // Different limit pool }; const fallback = fallbacks[preferredModel]; if (fallback) { console.log(`Switching from ${preferredModel} to ${fallback} (rate limit)`); return fallback; } } return preferredModel; }
Error Handling
| Scenario | Symptom | Solution |
|---|---|---|
| Burst of requests | Many 429s in quick succession | Use queue with interval limiting |
| Large prompts burn TPM | 429 on tokens, not requests | Reduce , compress prompts |
| Free tier too restrictive | Constant 429s | Upgrade to Developer plan at console.groq.com |
| Multiple services sharing key | Cascading 429s | Use separate API keys per service |
Resources
Next Steps
For security configuration, see
groq-security-basics.