Claude-code-plugins-plus-skills together-reference-architecture

install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/together-pack/skills/together-reference-architecture" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-together-reference-architecture && rm -rf "$T"
manifest: plugins/saas-packs/together-pack/skills/together-reference-architecture/SKILL.md
source content

Together AI Reference Architecture

Overview

Production architecture for AI inference, fine-tuning, and batch processing with Together AI's OpenAI-compatible API. Designed for teams routing requests across 100+ open-source models (Llama, Mixtral, Qwen, FLUX) with intelligent model selection, response caching, fine-tune pipeline management, and cost optimization via batch inference at 50% discount. Key design drivers: model routing for cost/quality tradeoffs, inference caching for repeated queries, fine-tune lifecycle management, and graceful degradation across model providers.

Architecture Diagram

Application ──→ Model Router ──→ Cache (Redis) ──→ Together API (v1)
                    ↓                                /chat/completions
               Queue (Bull) ──→ Batch Worker         /completions
                    ↓                                /images/generations
               Fine-Tune Manager ──→ Together API    /fine-tunes
                    ↓                                /models
               Cost Tracker ──→ Analytics Dashboard

Service Layer

class InferenceService {
  constructor(private together: TogetherClient, private cache: CacheLayer, private router: ModelRouter) {}

  async complete(request: InferenceRequest): Promise<InferenceResponse> {
    const model = this.router.selectModel(request.task, request.priority);
    const cacheKey = `inference:${model}:${this.hashPrompt(request.prompt)}`;
    const cached = await this.cache.get(cacheKey);
    if (cached && request.allowCached) return cached;
    const response = await this.together.chatCompletions({ model, messages: request.messages, temperature: request.temperature ?? 0.7 });
    await this.cache.set(cacheKey, response, CACHE_CONFIG.inference.ttl);
    await this.costTracker.record(model, response.usage);
    return response;
  }

  async submitBatch(requests: InferenceRequest[]): Promise<string> {
    const batchId = await this.together.createBatch(requests.map(r => ({
      model: this.router.selectModel(r.task, 'batch'), messages: r.messages })));
    return batchId;  // 50% cost reduction for batch processing
  }
}

Caching Strategy

const CACHE_CONFIG = {
  inference:   { ttl: 3600,  prefix: 'infer' },    // 1 hr — deterministic prompts (temp=0) cache well
  embeddings:  { ttl: 86400, prefix: 'embed' },     // 24 hr — embeddings are stable for same input
  modelList:   { ttl: 3600,  prefix: 'models' },    // 1 hr — available models change infrequently
  fineTune:    { ttl: 60,    prefix: 'ft' },         // 1 min — training status needs near-real-time
  batchStatus: { ttl: 30,    prefix: 'batch' },      // 30s — batch completion polling
};
// Cache only temp=0 responses by default; stochastic responses bypass cache unless explicitly opted in

Event Pipeline

class InferencePipeline {
  private queue = new Bull('together-events', { redis: process.env.REDIS_URL });

  async onFineTuneComplete(event: FineTuneEvent): Promise<void> {
    await this.queue.add('deploy-model', event, { attempts: 3, backoff: { type: 'exponential', delay: 5000 } });
  }

  async processFineTuneEvent(event: FineTuneEvent): Promise<void> {
    if (event.status === 'completed') {
      await this.router.registerModel(event.modelId, { task: event.task, cost: event.inferCostPerToken });
      await this.runEvalSuite(event.modelId, event.evalDataset);
    }
    if (event.status === 'failed') await this.notifyTeam(event.error);
  }

  async processBatchComplete(batchId: string): Promise<void> {
    const results = await this.together.getBatchResults(batchId);
    await this.storeResults(results);
    await this.costTracker.recordBatch(batchId, results.usage);
  }
}

Data Model

interface InferenceRequest  { task: 'chat' | 'code' | 'embedding' | 'image'; messages: Message[]; prompt?: string; temperature?: number; priority: 'realtime' | 'standard' | 'batch'; allowCached?: boolean; }
interface ModelRoute         { modelId: string; task: string; costPerToken: number; latencyP50Ms: number; qualityScore: number; }
interface FineTuneJob        { id: string; baseModel: string; trainingFile: string; status: 'pending' | 'running' | 'completed' | 'failed'; epochs: number; learningRate: number; }
interface CostRecord         { model: string; promptTokens: number; completionTokens: number; costUsd: number; timestamp: string; }

Scaling Considerations

  • Route low-priority requests to cheaper models (Llama 8B) and high-priority to larger models (Llama 70B, Mixtral)
  • Use batch API for non-interactive workloads — 50% cost savings with acceptable latency tradeoff
  • Cache embeddings aggressively — identical text produces identical vectors, high cache hit rate
  • Monitor per-model cost and latency; auto-shift traffic when a model degrades or pricing changes
  • Fine-tune pipeline should use a separate API key with isolated rate limits from production inference

Error Handling

ComponentFailure ModeRecovery
Inference requestModel overloaded (500)Fallback to alternative model in same task category
Rate limiting429 Too Many RequestsToken bucket with exponential backoff, queue overflow to batch
Fine-tune jobTraining divergenceAuto-stop on loss plateau, notify team with checkpoint artifacts
Batch processingPartial batch failureRetry failed items individually, report partial results
Model routingSelected model deprecatedAuto-reroute to replacement model, alert team to update config

Resources

Next Steps

See

together-deploy-integration
.