Claude-code-plugins-plus-skills together-rate-limits

install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/together-pack/skills/together-rate-limits" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-together-rate-limits && rm -rf "$T"
manifest: plugins/saas-packs/together-pack/skills/together-rate-limits/SKILL.md
source content

Together AI Rate Limits

Overview

Together AI's OpenAI-compatible inference API enforces per-key rate limits that vary by model tier and operation type. Chat completions and embeddings share a global request quota, while fine-tuning jobs and batch inference have separate concurrency caps. High-throughput workloads like embedding entire document corpora or running evaluations across 100+ prompts require client-side token bucket limiting. Together's batch inference endpoint offers 50% cost savings but has its own queue depth limits that differ from real-time inference.

Rate Limit Reference

EndpointLimitWindowScope
Chat completions600 req1 minutePer API key
Embeddings300 req1 minutePer API key
Image generation (FLUX)60 req1 minutePer API key
Fine-tune jobs (concurrent)3 jobsRollingPer API key
Batch inference100 req/batch, 10 batchesRollingPer API key

Rate Limiter Implementation

class TogetherRateLimiter {
  private tokens: number;
  private lastRefill: number;
  private readonly max: number;
  private readonly refillRate: number;
  private queue: Array<{ resolve: () => void }> = [];

  constructor(maxPerMinute: number) {
    this.max = maxPerMinute;
    this.tokens = maxPerMinute;
    this.lastRefill = Date.now();
    this.refillRate = maxPerMinute / 60_000;
  }

  async acquire(): Promise<void> {
    this.refill();
    if (this.tokens >= 1) { this.tokens -= 1; return; }
    return new Promise(resolve => this.queue.push({ resolve }));
  }

  private refill() {
    const now = Date.now();
    this.tokens = Math.min(this.max, this.tokens + (now - this.lastRefill) * this.refillRate);
    this.lastRefill = now;
    while (this.tokens >= 1 && this.queue.length) {
      this.tokens -= 1;
      this.queue.shift()!.resolve();
    }
  }
}

const chatLimiter = new TogetherRateLimiter(500);  // buffer under 600
const embedLimiter = new TogetherRateLimiter(250);

Retry Strategy

async function togetherRetry<T>(
  limiter: TogetherRateLimiter, fn: () => Promise<Response>, maxRetries = 4
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    await limiter.acquire();
    const res = await fn();
    if (res.ok) return res.json();
    if (res.status === 429) {
      const retryAfter = parseInt(res.headers.get("Retry-After") || "5", 10);
      const jitter = Math.random() * 2000;
      await new Promise(r => setTimeout(r, retryAfter * 1000 + jitter));
      continue;
    }
    if (res.status >= 500 && attempt < maxRetries) {
      await new Promise(r => setTimeout(r, Math.pow(2, attempt) * 1000));
      continue;
    }
    throw new Error(`Together API ${res.status}: ${await res.text()}`);
  }
  throw new Error("Max retries exceeded");
}

Batch Processing

async function batchEmbedDocuments(texts: string[], model: string, batchSize = 20) {
  const results: any[] = [];
  for (let i = 0; i < texts.length; i += batchSize) {
    const batch = texts.slice(i, i + batchSize);
    const result = await togetherRetry(embedLimiter, () =>
      fetch("https://api.together.xyz/v1/embeddings", {
        method: "POST", headers,
        body: JSON.stringify({ model, input: batch }),
      })
    );
    results.push(result);
    if (i + batchSize < texts.length) await new Promise(r => setTimeout(r, 3000));
  }
  return results;
}

Error Handling

IssueCauseFix
429 on chat completionsExceeded 600 req/min key limitUse token bucket, avoid burst patterns
429 on embeddingsEmbedding limit is half of chatBatch inputs (up to 20 texts per request)
Model not foundWrong model ID stringVerify with
GET /v1/models
endpoint
503 model overloadedPopular model at peak demandRetry with backoff, or use fallback model
Fine-tune 4093 concurrent job limit reachedWait for running job to complete first

Resources

Next Steps

See

together-performance-tuning
.