Claude-code-plugins-plus-skills anth-performance-tuning
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/anthropic-pack/skills/anth-performance-tuning" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-anth-performance-tuning && rm -rf "$T"
manifest:
plugins/saas-packs/anthropic-pack/skills/anth-performance-tuning/SKILL.mdsource content
Anthropic Performance Tuning
Overview
Optimize Claude API latency and throughput via prompt caching, model selection, streaming, and request optimization. The biggest wins come from prompt caching (90% input cost reduction) and model selection (Haiku is 4x faster than Sonnet).
Prompt Caching (Biggest Win)
import anthropic client = anthropic.Anthropic() # Mark long, reusable content with cache_control # Cached content: 90% cheaper on subsequent requests, near-zero latency for cached portion message = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=[ { "type": "text", "text": "You are an expert on the following 50-page document: ...<long document>...", "cache_control": {"type": "ephemeral"} # Cache this block } ], messages=[{"role": "user", "content": "What does section 3.2 say?"}] ) # Check cache performance print(f"Cache read tokens: {message.usage.cache_read_input_tokens}") # Free/cheap print(f"Cache creation tokens: {message.usage.cache_creation_input_tokens}") # First call only print(f"Uncached input tokens: {message.usage.input_tokens}")
Cache requirements: Minimum 1,024 tokens for Sonnet/Opus, 2,048 for Haiku. Cache lives for 5 minutes (refreshed on each hit).
Model Selection for Speed
| Model | Speed | Cost (per MTok in/out) | Best For |
|---|---|---|---|
| Claude Haiku | Fastest | $0.80 / $4.00 | Classification, extraction, routing |
| Claude Sonnet | Balanced | $3.00 / $15.00 | General tasks, tool use, code |
| Claude Opus | Deepest | $15.00 / $75.00 | Complex reasoning, research |
# Route by task complexity def select_model(task_type: str) -> str: routing = { "classify": "claude-haiku-4-20250514", "extract": "claude-haiku-4-20250514", "summarize": "claude-sonnet-4-20250514", "code": "claude-sonnet-4-20250514", "research": "claude-opus-4-20250514", } return routing.get(task_type, "claude-sonnet-4-20250514")
Streaming for Perceived Speed
# Streaming reduces time-to-first-token from seconds to ~200ms with client.messages.stream( model="claude-sonnet-4-20250514", max_tokens=2048, messages=[{"role": "user", "content": prompt}] ) as stream: for text in stream.text_stream: yield text # User sees response immediately
Reduce Token Count
# 1. Set max_tokens to what you actually need (not max) msg = client.messages.create( model="claude-haiku-4-20250514", max_tokens=128, # Not 4096 — smaller = faster generation messages=[{"role": "user", "content": "Classify as positive/negative: 'Great product!'"}] ) # 2. Use prefill to skip preamble msg = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=64, messages=[ {"role": "user", "content": "Classify sentiment: 'Great product!'"}, {"role": "assistant", "content": "Sentiment:"} # Skip "Sure, I'd be happy to..." ] ) # 3. Pre-check token count for large inputs count = client.messages.count_tokens( model="claude-sonnet-4-20250514", messages=[{"role": "user", "content": large_document}] ) if count.input_tokens > 100_000: # Chunk or summarize first pass
Parallel Requests
import Anthropic from '@anthropic-ai/sdk'; import PQueue from 'p-queue'; const client = new Anthropic(); const queue = new PQueue({ concurrency: 10 }); // Process multiple prompts in parallel (within rate limits) const results = await Promise.all( prompts.map(p => queue.add(() => client.messages.create({ model: 'claude-haiku-4-20250514', max_tokens: 256, messages: [{ role: 'user', content: p }], }) )) );
Performance Benchmarks
| Optimization | Latency Impact | Cost Impact |
|---|---|---|
| Prompt caching | -50% (cached portion) | -90% input cost |
| Haiku over Sonnet | -75% TTFT | -73% cost |
| Streaming | -80% TTFT (perceived) | Same cost |
| Lower max_tokens | -10-30% total time | Same cost |
| Prefill technique | -20% output tokens | Proportional savings |
Resources
Next Steps
For cost optimization, see
anth-cost-tuning.