Claude-skill-registry llm-serving-patterns
LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/llm-serving-patterns" ~/.claude/skills/majiayu000-claude-skill-registry-llm-serving-patterns && rm -rf "$T"
manifest:
skills/data/llm-serving-patterns/SKILL.mdsource content
LLM Serving Patterns
When to Use This Skill
Use this skill when:
- Designing LLM inference infrastructure
- Choosing between serving frameworks (vLLM, TGI, TensorRT-LLM)
- Implementing quantization for production deployment
- Optimizing batching and throughput
- Building streaming response systems
- Scaling LLM deployments cost-effectively
Keywords: LLM serving, inference, vLLM, TGI, TensorRT-LLM, quantization, INT8, INT4, FP16, batching, continuous batching, streaming, SSE, WebSocket, KV cache, PagedAttention, speculative decoding
LLM Serving Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐ │ LLM Serving Stack │ ├─────────────────────────────────────────────────────────────────────┤ │ Clients (API, Chat UI, Agents) │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Load Balancer / API Gateway │ │ │ │ • Rate limiting • Authentication • Request routing │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ Inference Server │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │ │ │ │ Request │ │ Batching │ │ KV Cache │ │ │ │ │ │ Queue │──▶│ Engine │──▶│ Management │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │ │ │ │ │ │ │ │ │ ▼ ▼ │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ │ │ Model Execution Engine │ │ │ │ │ │ • Tensor operations • Attention • Token sampling │ │ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ GPU/TPU Cluster │ │ │ │ • Model sharding • Tensor parallelism • Pipeline parallel │ │ │ └─────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────┘
Serving Framework Comparison
| Framework | Strengths | Best For | Considerations |
|---|---|---|---|
| vLLM | PagedAttention, high throughput, continuous batching | General LLM serving, high concurrency | Python-native, active community |
| TGI (Text Generation Inference) | Production-ready, Hugging Face integration | Enterprise deployment, HF models | Rust backend, Docker-first |
| TensorRT-LLM | NVIDIA optimization, lowest latency | NVIDIA GPUs, latency-critical | NVIDIA-only, complex setup |
| Triton Inference Server | Multi-model, multi-framework | Heterogeneous model serving | Enterprise complexity |
| Ollama | Simple local deployment | Development, edge deployment | Limited scaling features |
| llama.cpp | CPU inference, quantization | Resource-constrained, edge | C++ integration required |
Framework Selection Decision Tree
Need lowest latency on NVIDIA GPUs? ├── Yes → TensorRT-LLM └── No └── Need high throughput with many concurrent users? ├── Yes → vLLM (PagedAttention) └── No └── Need enterprise features + HF integration? ├── Yes → TGI └── No └── Simple local/edge deployment? ├── Yes → Ollama or llama.cpp └── No → vLLM (general purpose)
Quantization Techniques
Precision Levels
| Precision | Bits | Memory Reduction | Quality Impact | Use Case |
|---|---|---|---|---|
| FP32 | 32 | Baseline | None | Training, reference |
| FP16/BF16 | 16 | 2x | Minimal | Standard serving |
| INT8 | 8 | 4x | Low | Production serving |
| INT4 | 4 | 8x | Moderate | Resource-constrained |
| INT2 | 2 | 16x | Significant | Experimental |
Quantization Methods
| Method | Description | Quality | Speed |
|---|---|---|---|
| PTQ (Post-Training Quantization) | Quantize after training, no retraining | Good | Fast to apply |
| QAT (Quantization-Aware Training) | Simulate quantization during training | Better | Requires training |
| GPTQ | One-shot weight quantization | Very good | Moderate |
| AWQ (Activation-aware Weight Quantization) | Preserves salient weights | Excellent | Moderate |
| GGUF/GGML | llama.cpp format, CPU-optimized | Good | Very fast inference |
| SmoothQuant | Migrates difficulty to weights | Excellent | Moderate |
Quantization Selection
Quality vs. Efficiency Trade-off: Quality ────────────────────────────────────────────▶ Efficiency │ │ │ FP32 FP16 INT8+AWQ INT8+GPTQ INT4 INT2 │ │ ○───────○────────○──────────○──────────○──────○ │ │ │ │ │ │ │ │ │ │ Best Great Good Good Fair Poor │ │ │
Batching Strategies
Static Batching
Request 1: [tokens: 100] ─┐ Request 2: [tokens: 50] ─┼──▶ [Batch: pad to 100] ──▶ Process ──▶ All complete Request 3: [tokens: 80] ─┘ Problem: Short requests wait for long ones (head-of-line blocking)
Continuous Batching (Preferred)
Time ──────────────────────────────────────────────────────────▶ Req 1: [████████████████████████████████] ──▶ Complete Req 2: [████████████] ──▶ Complete ──▶ Req 4 starts [████████████████] Req 3: [████████████████████] ──▶ Complete ──▶ Req 5 starts [████████] • New requests join batch as others complete • No padding waste • Optimal GPU utilization
Batching Parameters
| Parameter | Description | Trade-off |
|---|---|---|
| Maximum concurrent requests | Memory vs. throughput |
| Tokens before forcing batch | Latency vs. throughput |
| Maximum sequences in batch | Memory vs. concurrency |
KV Cache Management
The KV Cache Problem
Attention: Q × K^T × V For each token generated: • Must recompute attention with ALL previous tokens • K and V tensors grow with sequence length • Memory: O(batch_size × seq_len × num_layers × hidden_dim) Example (70B model, 4K context): • KV cache per request: ~8GB • 10 concurrent requests: ~80GB GPU memory
PagedAttention (vLLM Innovation)
Traditional KV Cache: ┌──────────────────────────────────────────┐ │ Request 1 KV Cache (contiguous, fixed) │ ← Wastes memory ├──────────────────────────────────────────┤ │ Request 2 KV Cache (contiguous, fixed) │ ├──────────────────────────────────────────┤ │ FRAGMENTED/WASTED SPACE │ └──────────────────────────────────────────┘ PagedAttention: ┌────┬────┬────┬────┬────┬────┬────┬────┐ │ R1 │ R2 │ R1 │ R3 │ R2 │ R1 │ R3 │ R2 │ ← Pages allocated on demand └────┴────┴────┴────┴────┴────┴────┴────┘ • Non-contiguous memory allocation • Near-zero memory waste • 2-4x higher throughput
KV Cache Optimization Strategies
| Strategy | Description | Memory Savings |
|---|---|---|
| Paged Attention | Virtual memory for KV cache | ~50% reduction |
| Prefix Caching | Reuse KV cache for common prefixes | System prompt: 100% |
| Quantized KV Cache | INT8/FP8 for KV values | 50-75% reduction |
| Sliding Window | Limited attention context | Linear memory |
| MQA/GQA | Grouped query attention | Architecture-dependent |
Streaming Response Patterns
Server-Sent Events (SSE)
Client Server │ │ │──── GET /v1/chat/completions ──────▶│ │ (stream: true) │ │ │ │◀──── HTTP 200 OK ───────────────────│ │ Content-Type: text/event-stream│ │ │ │◀──── data: {"token": "Hello"} ──────│ │◀──── data: {"token": " world"} ─────│ │◀──── data: {"token": "!"} ──────────│ │◀──── data: [DONE] ──────────────────│ │ │
SSE Benefits:
- HTTP/1.1 compatible
- Auto-reconnection support
- Simple to implement
- Wide client support
WebSocket Streaming
Client Server │ │ │──── WebSocket Upgrade ─────────────▶│ │◀──── 101 Switching Protocols ───────│ │ │ │──── {"prompt": "Hello"} ───────────▶│ │ │ │◀──── {"token": "Hi"} ───────────────│ │◀──── {"token": " there"} ───────────│ │◀──── {"token": "!"} ────────────────│ │◀──── {"done": true} ────────────────│ │ │
WebSocket Benefits:
- Bidirectional communication
- Lower latency
- Better for chat applications
- Connection persistence
Streaming Implementation Considerations
| Aspect | SSE | WebSocket |
|---|---|---|
| Reconnection | Built-in | Manual |
| Scalability | Per-request | Connection pool |
| Load Balancing | Standard HTTP | Sticky sessions |
| Firewall/Proxy | Usually works | May need config |
| Best For | One-way streaming | Interactive chat |
Speculative Decoding
Concept
Standard Decoding: Large Model: [T1] → [T2] → [T3] → [T4] → [T5] 10ms 10ms 10ms 10ms 10ms = 50ms total Speculative Decoding: Draft Model: [T1, T2, T3, T4, T5] (parallel, 5ms) │ ▼ Large Model: [Verify T1-T5 in one pass] (15ms) Accept: T1, T2, T3 ✓ Reject: T4, T5 ✗ │ ▼ [Generate T4, T5 correctly] Total: ~25ms (2x speedup if 60% acceptance)
Speculative Decoding Trade-offs
| Factor | Impact |
|---|---|
| Draft model quality | Higher match rate = more speedup |
| Draft model size | Larger = better quality, slower |
| Speculation depth | More tokens = higher risk/reward |
| Verification cost | Must be < sequential generation |
Scaling Strategies
Horizontal Scaling
┌─────────────────────────────────────────────────────────┐ │ Load Balancer │ │ (Round-robin, Least-connections) │ └─────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ vLLM │ │ vLLM │ │ vLLM │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ (GPU×4) │ │ (GPU×4) │ │ (GPU×4) │ └─────────┘ └─────────┘ └─────────┘
Model Parallelism
| Strategy | Description | Use Case |
|---|---|---|
| Tensor Parallelism | Split layers across GPUs | Single large model |
| Pipeline Parallelism | Different layers on different GPUs | Very large models |
| Data Parallelism | Same model, different batches | High throughput |
Tensor Parallelism (TP=4): ┌─────────────────────────────────────────┐ │ Layer N │ │ GPU0 │ GPU1 │ GPU2 │ GPU3 │ │ 25% │ 25% │ 25% │ 25% │ └─────────────────────────────────────────┘ Pipeline Parallelism (PP=4): GPU0: Layers 0-7 GPU1: Layers 8-15 GPU2: Layers 16-23 GPU3: Layers 24-31
Latency Optimization Checklist
Pre-deployment
- Choose appropriate quantization (INT8 for production)
- Enable continuous batching
- Configure KV cache size appropriately
- Set optimal batch size for hardware
- Enable prefix caching for system prompts
Runtime
- Monitor GPU memory utilization
- Track p50/p95/p99 latencies
- Measure time-to-first-token (TTFT)
- Monitor tokens-per-second (TPS)
- Set appropriate timeouts
Infrastructure
- Use fastest available interconnect (NVLink, InfiniBand)
- Minimize network hops
- Place inference close to users (edge)
- Consider dedicated inference hardware
Cost Optimization
Cost Drivers
| Factor | Impact | Optimization |
|---|---|---|
| GPU hours | Highest | Quantization, batching |
| Memory | High | PagedAttention, KV cache optimization |
| Network | Medium | Response compression, edge deployment |
| Storage | Low | Model deduplication |
Cost Estimation Formula
Monthly Cost = (Requests/month) × (Avg tokens/request) × (GPU-seconds/token) × ($/GPU-hour) ───────────────────────────────────────────────────────────────────────────── 3600 Example: • 10M requests/month • 500 tokens average • 0.001 GPU-seconds/token (optimized) • $2/GPU-hour Cost = (10M × 500 × 0.001 × 2) / 3600 = $2,778/month
Common Patterns
Multi-model Routing
┌─────────────────────────────────────────────────────────┐ │ Router │ │ • Classify request complexity │ │ • Route to appropriate model │ └─────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Small │ │ Medium │ │ Large │ │ Model │ │ Model │ │ Model │ │ (7B) │ │ (13B) │ │ (70B) │ │ Fast │ │ Balanced│ │ Quality │ └─────────┘ └─────────┘ └─────────┘
Caching Strategies
| Cache Type | What to Cache | TTL |
|---|---|---|
| Prompt cache | Common system prompts | Long |
| KV cache | Prefix tokens | Session |
| Response cache | Exact query matches | Varies |
| Embedding cache | Document embeddings | Long |
Related Skills
- End-to-end ML pipeline designml-system-design
- Retrieval-augmented generation patternsrag-architecture
- Vector search for LLM contextvector-databases
- General inference optimizationml-inference-optimization
- Capacity planning for LLM systemsestimation-techniques
Version History
- v1.0.0 (2025-12-26): Initial release - LLM serving patterns for systems design interviews
Last Updated
Date: 2025-12-26