Ai-design-components embedding-optimization
Optimizing vector embeddings for RAG systems through model selection, chunking strategies, caching, and performance tuning. Use when building semantic search, RAG pipelines, or document retrieval systems that require cost-effective, high-quality embeddings.
git clone https://github.com/ancoleman/ai-design-components
T=$(mktemp -d) && git clone --depth=1 https://github.com/ancoleman/ai-design-components "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/embedding-optimization" ~/.claude/skills/ancoleman-ai-design-components-embedding-optimization && rm -rf "$T"
skills/embedding-optimization/SKILL.mdEmbedding Optimization
Optimize embedding generation for cost, performance, and quality in RAG and semantic search systems.
When to Use This Skill
Trigger this skill when:
- Building RAG (Retrieval Augmented Generation) systems
- Implementing semantic search or similarity detection
- Optimizing embedding API costs (reducing by 70-90%)
- Improving document retrieval quality through better chunking
- Processing large document corpora (thousands to millions of documents)
- Selecting between API-based vs. local embedding models
Model Selection Framework
Choose the optimal embedding model based on requirements:
Quick Recommendations:
- Startup/MVP:
(local, 384 dims, zero API costs)all-MiniLM-L6-v2 - Production:
(API, 1,536 dims, balanced quality/cost)text-embedding-3-small - High Quality:
(API, 3,072 dims, premium)text-embedding-3-large - Multilingual:
(local, 768 dims) or Coheremultilingual-e5-baseembed-multilingual-v3.0
For detailed decision frameworks including cost comparisons, quality benchmarks, and data privacy considerations, see
references/model-selection-guide.md.
Model Comparison Summary:
| Model | Type | Dimensions | Cost per 1M tokens | Best For |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | Local | 384 | $0 (compute only) | High volume, tight budgets |
| BGE-base-en-v1.5 | Local | 768 | $0 (compute only) | Quality + cost balance |
| text-embedding-3-small | API | 1,536 | $0.02 | General purpose production |
| text-embedding-3-large | API | 3,072 | $0.13 | Premium quality requirements |
| embed-multilingual-v3.0 | API | 1,024 | $0.10 | 100+ language support |
Chunking Strategies
Select chunking strategy based on content type and use case:
Content Type → Strategy Mapping:
- Documentation: Recursive (heading-aware), 800 chars, 100 overlap
- Code: Recursive (function-level), 1,000 chars, 100 overlap
- Q&A/FAQ: Fixed-size, 500 chars, 50 overlap (precise retrieval)
- Legal/Technical: Semantic (large), 1,500 chars, 200 overlap (context preservation)
- Blog Posts: Semantic (paragraph), 1,000 chars, 100 overlap
- Academic Papers: Recursive (section-aware), 1,200 chars, 150 overlap
For detailed chunking patterns, decision trees, and implementation guidance, see
references/chunking-strategies.md.
Quick Start with CLI:
python scripts/chunk_document.py \ --input document.txt \ --content-type markdown \ --chunk-size 800 \ --overlap 100 \ --output chunks.jsonl
Caching Implementation
Achieve 80-90% cost reduction through content-addressable caching.
Caching Architecture by Query Volume:
- <10K queries/month: In-memory cache (Python
)lru_cache - 10K-100K queries/month: Redis (fast, TTL-based expiration)
- 100K-1M queries/month: Redis (hot) + PostgreSQL (warm)
- >1M queries/month: Multi-tier (Redis + PostgreSQL + S3)
Production Caching with Redis:
# Embed documents with caching enabled python scripts/cached_embedder.py \ --model text-embedding-3-small \ --input documents.jsonl \ --output embeddings.npy \ --cache-backend redis \ --cache-ttl 2592000 # 30 days
Caching ROI Example:
- 50,000 document chunks
- 20% duplicate content
- Without caching: $0.50 API cost
- With caching (60% hit rate): $0.20 API cost
- Savings: 60% ($0.30)
Dimensionality Trade-offs
Balance storage, search speed, and quality:
| Dimensions | Storage (1M vectors) | Search Speed (p95) | Quality | Use Case |
|---|---|---|---|---|
| 384 | 1.5 GB | 10ms | Good | Large-scale search |
| 768 | 3 GB | 15ms | High | General purpose RAG |
| 1,536 | 6 GB | 25ms | Very High | High-quality retrieval |
| 3,072 | 12 GB | 40ms | Highest | Premium applications |
Key Insight: For most RAG applications, 768 dimensions (BGE-base-en-v1.5 local or equivalent) provides the best quality/cost/speed balance.
Batch Processing Optimization
Maximize throughput for large-scale ingestion:
OpenAI API:
- Batch up to 2,048 inputs per request
- Implement rate limiting (tier-dependent: 500-5,000 RPM)
- Use parallel requests with backoff on rate limits
Local Models (sentence-transformers):
- GPU acceleration (CUDA, MPS for Apple Silicon)
- Batch size tuning (32-128 based on GPU memory)
- Multi-GPU support for maximum throughput
Expected Throughput:
- OpenAI API: 1,000-5,000 texts/minute (rate limit dependent)
- Local GPU (RTX 3090): 5,000-10,000 texts/minute
- Local CPU: 100-500 texts/minute
Performance Monitoring
Track key metrics for optimization:
Critical Metrics:
- Latency: Embedding generation time (p50, p95, p99)
- Throughput: Embeddings per second/minute
- Cost: API usage tracking (USD per 1K/1M tokens)
- Cache Efficiency: Hit rate percentage
For detailed monitoring setup, metric collection patterns, and dashboarding, see
references/performance-monitoring.md.
Monitor with Wrapper:
from scripts.performance_monitor import MonitoredEmbedder monitored = MonitoredEmbedder( embedder=your_embedder, cost_per_1k_tokens=0.00002 # OpenAI pricing ) embeddings = monitored.embed_batch(texts) metrics = monitored.get_metrics() print(f"Cache hit rate: {metrics['cache_hit_rate_pct']}%") print(f"Total cost: ${metrics['total_cost_usd']}")
Working Examples
See
examples/ directory for complete implementations:
Python Examples:
- OpenAI embeddings with Redis cachingexamples/openai_cached.py
- sentence-transformers local embeddingexamples/local_embedder.py
- Content-aware recursive chunkingexamples/smart_chunker.py
- Pipeline performance trackingexamples/performance_monitor.py
- Large-scale document processingexamples/batch_processor.py
All examples include:
- Complete, runnable code
- Dependency installation instructions
- Error handling and retry logic
- Configuration options
Integration Points
Upstream (This skill provides to):
- Vector Databases: Embeddings flow to Pinecone, Weaviate, Qdrant, pgvector
- RAG Systems: Optimized embeddings for retrieval pipelines
- Semantic Search: Query and document embeddings for similarity search
Downstream (This skill uses from):
- Document Processing: Chunk documents before embedding
- Data Ingestion: Process documents from various sources
Related Skills:
- For RAG architecture, see
skillbuilding-ai-chat - For vector database operations, see
skilldatabases-vector - For data ingestion pipelines, see
skillingesting-data
Common Patterns
Pattern 1: RAG Pipeline
Document → Chunk → Embed → Store (vector DB) → Retrieve
Pattern 2: Semantic Search
Query → Embed → Search (vector DB) → Rank → Display
Pattern 3: Multi-Stage Retrieval (Cost Optimization)
Query → Cheap Embedding (384d) → Initial Search → Expensive Embedding (1,536d) → Rerank Top-K → Return
Cost Savings: 70% reduction vs. single-stage with expensive embeddings
Quick Reference Checklist
Model Selection:
- Identified data privacy requirements (local vs. API)
- Calculated expected query volume
- Determined quality requirements (good/high/highest)
- Checked multilingual support needs
Chunking:
- Analyzed content type (code, docs, legal, etc.)
- Selected appropriate chunk size (500-1,500 chars)
- Set overlap to prevent context loss (50-200 chars)
- Validated chunks preserve semantic boundaries
Caching:
- Implemented content-addressable hashing
- Selected cache backend (Redis, PostgreSQL)
- Set TTL based on content volatility
- Monitoring cache hit rate (target: >60%)
Performance:
- Tracking latency (embedding generation time)
- Measuring throughput (embeddings/sec)
- Monitoring costs (USD spent on API calls)
- Optimizing batch sizes for maximum efficiency