Claude-skill-registry llm-cost-optimization
Reduce LLM API costs without sacrificing quality. Covers prompt caching (Anthropic), local response caching, prompt compression, debouncing triggers, and cost analysis. Use when building LLM-powered features, analyzing API costs, optimizing prompts, or implementing caching strategies.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/llm-cost-optimization" ~/.claude/skills/majiayu000-claude-skill-registry-llm-cost-optimization && rm -rf "$T"
manifest:
skills/data/llm-cost-optimization/SKILL.mdsource content
LLM Cost Optimization
Practical techniques to reduce LLM API costs by 35-65%.
Quick Reference
| Technique | Savings | When to Use | Reference |
|---|---|---|---|
| Prompt Caching | 25-45% | Same system prompt, frequent calls | caching.md |
| Response Cache | 100% | Repeated identical requests | caching.md |
| Prompt Compression | 10-20% | Long system prompts | prompts.md |
| Debouncing | 50%+ | Duplicate triggers | triggers.md |
The 80/20 of LLM Costs
For short user inputs, system prompts dominate costs:
| Text Length | Input Tokens | System Prompt % |
|---|---|---|
| Short (~100 chars) | ~250 | 80-87% |
| Medium (~500 chars) | ~450 | 44% |
| Long (~2000 chars) | ~900 | 22% |
Optimization priority:
- Cache system prompts (biggest impact)
- Cache identical requests (free repeats)
- Debounce triggers (prevent waste)
- Compress prompts (last resort)
Cost Estimation (Claude Haiku 3.5)
| Text Length | Est. Cost |
|---|---|
| Short (~100 chars) | ~$0.0004 |
| Medium (~500 chars) | ~$0.0008 |
| Long (~2000 chars) | ~$0.002 |
Benchmark: 1000 translations ≈ $0.80 (before optimization)
Implementation Checklist
Before Building
- Add logging to every AI trigger point
- Verify triggers fire exactly once per user action
- Check for Pressed/Released event duplicates
Caching Strategy
- Enable Anthropic Prompt Caching for system prompts
- Implement local response cache (hash-based)
- Include model name in cache key
- Set reasonable cache limits (e.g., 500 entries LRU)
Prompt Design
- Measure current token count
- Identify critical rules (security, output format)
- Test quality after compression
- Document WHY for each rule kept
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Trigger fires twice | 2x cost | Check event.state |
| No prompt caching | Full price every call | Use cache_control |
| Aggressive prompt compression | Quality drops | Keep critical rules |
| Cache key missing model | Wrong results | Include model in key |
Quick Wins
1. Check for Duplicate Triggers
// Before ANY optimization, verify this log::info!("AI trigger fired: {:?}", event); if event.state != ShortcutState::Pressed { return; // Ignore Released events }
2. Enable Prompt Caching (Anthropic)
let system = vec![SystemBlock { block_type: "text".to_string(), text: system_prompt, cache_control: CacheControl { cache_type: "ephemeral".to_string() }, }];
3. Add Response Cache
// Check cache before API call if let Some(cached) = get_cached(&text, &model) { return Ok(cached); // Free! } // Save after API call save_to_cache(&text, &result, &model)?;
Anti-Patterns
- TOON format for plain text - Only helps with structured data
- Caching without model key - Haiku vs Sonnet give different results
- Prompt compression first - Optimize triggers and caching before touching prompts