Claude-skill-registry llm-serving-patterns

LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/llm-serving-patterns" ~/.claude/skills/majiayu000-claude-skill-registry-llm-serving-patterns && rm -rf "$T"

manifest: skills/data/llm-serving-patterns/SKILL.md

LLM Serving Patterns

When to Use This Skill

Use this skill when:

Designing LLM inference infrastructure
Choosing between serving frameworks (vLLM, TGI, TensorRT-LLM)
Implementing quantization for production deployment
Optimizing batching and throughput
Building streaming response systems
Scaling LLM deployments cost-effectively

Keywords: LLM serving, inference, vLLM, TGI, TensorRT-LLM, quantization, INT8, INT4, FP16, batching, continuous batching, streaming, SSE, WebSocket, KV cache, PagedAttention, speculative decoding

LLM Serving Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         LLM Serving Stack                           │
├─────────────────────────────────────────────────────────────────────┤
│  Clients (API, Chat UI, Agents)                                     │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │              Load Balancer / API Gateway                     │   │
│  │  • Rate limiting  • Authentication  • Request routing        │   │
│  └─────────────────────────────────────────────────────────────┘   │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                   Inference Server                           │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │   │
│  │  │  Request    │  │  Batching   │  │  KV Cache           │  │   │
│  │  │  Queue      │──▶│  Engine     │──▶│  Management        │  │   │
│  │  └─────────────┘  └─────────────┘  └─────────────────────┘  │   │
│  │       │                                      │               │   │
│  │       ▼                                      ▼               │   │
│  │  ┌─────────────────────────────────────────────────────┐    │   │
│  │  │              Model Execution Engine                  │    │   │
│  │  │  • Tensor operations  • Attention  • Token sampling │    │   │
│  │  └─────────────────────────────────────────────────────┘    │   │
│  └─────────────────────────────────────────────────────────────┘   │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    GPU/TPU Cluster                           │   │
│  │  • Model sharding  • Tensor parallelism  • Pipeline parallel │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Serving Framework Comparison

Framework	Strengths	Best For	Considerations
vLLM	PagedAttention, high throughput, continuous batching	General LLM serving, high concurrency	Python-native, active community
TGI (Text Generation Inference)	Production-ready, Hugging Face integration	Enterprise deployment, HF models	Rust backend, Docker-first
TensorRT-LLM	NVIDIA optimization, lowest latency	NVIDIA GPUs, latency-critical	NVIDIA-only, complex setup
Triton Inference Server	Multi-model, multi-framework	Heterogeneous model serving	Enterprise complexity
Ollama	Simple local deployment	Development, edge deployment	Limited scaling features
llama.cpp	CPU inference, quantization	Resource-constrained, edge	C++ integration required

Framework Selection Decision Tree

Need lowest latency on NVIDIA GPUs?
├── Yes → TensorRT-LLM
└── No
    └── Need high throughput with many concurrent users?
        ├── Yes → vLLM (PagedAttention)
        └── No
            └── Need enterprise features + HF integration?
                ├── Yes → TGI
                └── No
                    └── Simple local/edge deployment?
                        ├── Yes → Ollama or llama.cpp
                        └── No → vLLM (general purpose)

Quantization Techniques

Precision Levels

Precision	Bits	Memory Reduction	Quality Impact	Use Case
FP32	32	Baseline	None	Training, reference
FP16/BF16	16	2x	Minimal	Standard serving
INT8	8	4x	Low	Production serving
INT4	4	8x	Moderate	Resource-constrained
INT2	2	16x	Significant	Experimental

Quantization Methods

Method	Description	Quality	Speed
PTQ (Post-Training Quantization)	Quantize after training, no retraining	Good	Fast to apply
QAT (Quantization-Aware Training)	Simulate quantization during training	Better	Requires training
GPTQ	One-shot weight quantization	Very good	Moderate
AWQ (Activation-aware Weight Quantization)	Preserves salient weights	Excellent	Moderate
GGUF/GGML	llama.cpp format, CPU-optimized	Good	Very fast inference
SmoothQuant	Migrates difficulty to weights	Excellent	Moderate

Quantization Selection

Quality vs. Efficiency Trade-off:

Quality ────────────────────────────────────────────▶ Efficiency
   │                                                      │
   │  FP32    FP16    INT8+AWQ   INT8+GPTQ   INT4   INT2  │
   │   ○───────○────────○──────────○──────────○──────○    │
   │   │       │        │          │          │      │    │
   │  Best   Great    Good      Good       Fair   Poor   │
   │                                                      │

Batching Strategies

Static Batching

Request 1: [tokens: 100] ─┐
Request 2: [tokens: 50]  ─┼──▶ [Batch: pad to 100] ──▶ Process ──▶ All complete
Request 3: [tokens: 80]  ─┘

Problem: Short requests wait for long ones (head-of-line blocking)

Continuous Batching (Preferred)

Time ──────────────────────────────────────────────────────────▶

Req 1: [████████████████████████████████] ──▶ Complete
Req 2: [████████████] ──▶ Complete ──▶ Req 4 starts [████████████████]
Req 3: [████████████████████] ──▶ Complete ──▶ Req 5 starts [████████]

• New requests join batch as others complete
• No padding waste
• Optimal GPU utilization

Batching Parameters

Parameter	Description	Trade-off
`max_batch_size`	Maximum concurrent requests	Memory vs. throughput
`max_waiting_tokens`	Tokens before forcing batch	Latency vs. throughput
`max_num_seqs`	Maximum sequences in batch	Memory vs. concurrency

KV Cache Management

The KV Cache Problem

Attention: Q × K^T × V

For each token generated:
• Must recompute attention with ALL previous tokens
• K and V tensors grow with sequence length
• Memory: O(batch_size × seq_len × num_layers × hidden_dim)

Example (70B model, 4K context):
• KV cache per request: ~8GB
• 10 concurrent requests: ~80GB GPU memory

PagedAttention (vLLM Innovation)

Traditional KV Cache:
┌──────────────────────────────────────────┐
│ Request 1 KV Cache (contiguous, fixed)   │ ← Wastes memory
├──────────────────────────────────────────┤
│ Request 2 KV Cache (contiguous, fixed)   │
├──────────────────────────────────────────┤
│ FRAGMENTED/WASTED SPACE                  │
└──────────────────────────────────────────┘

PagedAttention:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ R1 │ R2 │ R1 │ R3 │ R2 │ R1 │ R3 │ R2 │  ← Pages allocated on demand
└────┴────┴────┴────┴────┴────┴────┴────┘
• Non-contiguous memory allocation
• Near-zero memory waste
• 2-4x higher throughput

KV Cache Optimization Strategies

Strategy	Description	Memory Savings
Paged Attention	Virtual memory for KV cache	~50% reduction
Prefix Caching	Reuse KV cache for common prefixes	System prompt: 100%
Quantized KV Cache	INT8/FP8 for KV values	50-75% reduction
Sliding Window	Limited attention context	Linear memory
MQA/GQA	Grouped query attention	Architecture-dependent

Streaming Response Patterns

Server-Sent Events (SSE)

Client                                Server
   │                                     │
   │──── GET /v1/chat/completions ──────▶│
   │      (stream: true)                 │
   │                                     │
   │◀──── HTTP 200 OK ───────────────────│
   │      Content-Type: text/event-stream│
   │                                     │
   │◀──── data: {"token": "Hello"} ──────│
   │◀──── data: {"token": " world"} ─────│
   │◀──── data: {"token": "!"} ──────────│
   │◀──── data: [DONE] ──────────────────│
   │                                     │

SSE Benefits:

HTTP/1.1 compatible
Auto-reconnection support
Simple to implement
Wide client support

WebSocket Streaming

Client                                Server
   │                                     │
   │──── WebSocket Upgrade ─────────────▶│
   │◀──── 101 Switching Protocols ───────│
   │                                     │
   │──── {"prompt": "Hello"} ───────────▶│
   │                                     │
   │◀──── {"token": "Hi"} ───────────────│
   │◀──── {"token": " there"} ───────────│
   │◀──── {"token": "!"} ────────────────│
   │◀──── {"done": true} ────────────────│
   │                                     │

WebSocket Benefits:

Bidirectional communication
Lower latency
Better for chat applications
Connection persistence

Streaming Implementation Considerations

Aspect	SSE	WebSocket
Reconnection	Built-in	Manual
Scalability	Per-request	Connection pool
Load Balancing	Standard HTTP	Sticky sessions
Firewall/Proxy	Usually works	May need config
Best For	One-way streaming	Interactive chat

Speculative Decoding

Concept

Standard Decoding:
Large Model: [T1] → [T2] → [T3] → [T4] → [T5]
             10ms   10ms   10ms   10ms   10ms = 50ms total

Speculative Decoding:
Draft Model: [T1, T2, T3, T4, T5] (parallel, 5ms)
                      │
                      ▼
Large Model: [Verify T1-T5 in one pass] (15ms)
             Accept: T1, T2, T3 ✓  Reject: T4, T5 ✗
                      │
                      ▼
             [Generate T4, T5 correctly]

Total: ~25ms (2x speedup if 60% acceptance)

Speculative Decoding Trade-offs

Factor	Impact
Draft model quality	Higher match rate = more speedup
Draft model size	Larger = better quality, slower
Speculation depth	More tokens = higher risk/reward
Verification cost	Must be < sequential generation

Scaling Strategies

Horizontal Scaling

┌─────────────────────────────────────────────────────────┐
│                    Load Balancer                        │
│         (Round-robin, Least-connections)                │
└─────────────────────────────────────────────────────────┘
         │              │              │
         ▼              ▼              ▼
    ┌─────────┐    ┌─────────┐    ┌─────────┐
    │ vLLM    │    │ vLLM    │    │ vLLM    │
    │ Node 1  │    │ Node 2  │    │ Node 3  │
    │ (GPU×4) │    │ (GPU×4) │    │ (GPU×4) │
    └─────────┘    └─────────┘    └─────────┘

Model Parallelism

Strategy	Description	Use Case
Tensor Parallelism	Split layers across GPUs	Single large model
Pipeline Parallelism	Different layers on different GPUs	Very large models
Data Parallelism	Same model, different batches	High throughput

Tensor Parallelism (TP=4):
┌─────────────────────────────────────────┐
│              Layer N                     │
│  GPU0   │   GPU1   │   GPU2   │   GPU3  │
│  25%    │   25%    │   25%    │   25%   │
└─────────────────────────────────────────┘

Pipeline Parallelism (PP=4):
GPU0: Layers 0-7
GPU1: Layers 8-15
GPU2: Layers 16-23
GPU3: Layers 24-31

Latency Optimization Checklist

Pre-deployment

Choose appropriate quantization (INT8 for production)
Enable continuous batching
Configure KV cache size appropriately
Set optimal batch size for hardware
Enable prefix caching for system prompts

Runtime

Monitor GPU memory utilization
Track p50/p95/p99 latencies
Measure time-to-first-token (TTFT)
Monitor tokens-per-second (TPS)
Set appropriate timeouts

Infrastructure

Use fastest available interconnect (NVLink, InfiniBand)
Minimize network hops
Place inference close to users (edge)
Consider dedicated inference hardware

Cost Optimization

Cost Drivers

Factor	Impact	Optimization
GPU hours	Highest	Quantization, batching
Memory	High	PagedAttention, KV cache optimization
Network	Medium	Response compression, edge deployment
Storage	Low	Model deduplication

Cost Estimation Formula

Monthly Cost =
  (Requests/month) × (Avg tokens/request) × (GPU-seconds/token) × ($/GPU-hour)
  ─────────────────────────────────────────────────────────────────────────────
                                    3600

Example:
• 10M requests/month
• 500 tokens average
• 0.001 GPU-seconds/token (optimized)
• $2/GPU-hour

Cost = (10M × 500 × 0.001 × 2) / 3600 = $2,778/month

Common Patterns

Multi-model Routing

┌─────────────────────────────────────────────────────────┐
│                     Router                              │
│  • Classify request complexity                          │
│  • Route to appropriate model                           │
└─────────────────────────────────────────────────────────┘
         │              │              │
         ▼              ▼              ▼
    ┌─────────┐    ┌─────────┐    ┌─────────┐
    │ Small   │    │ Medium  │    │ Large   │
    │ Model   │    │ Model   │    │ Model   │
    │ (7B)    │    │ (13B)   │    │ (70B)   │
    │ Fast    │    │ Balanced│    │ Quality │
    └─────────┘    └─────────┘    └─────────┘

Caching Strategies

Cache Type	What to Cache	TTL
Prompt cache	Common system prompts	Long
KV cache	Prefix tokens	Session
Response cache	Exact query matches	Varies
Embedding cache	Document embeddings	Long

Related Skills

```
ml-system-design
```
- End-to-end ML pipeline design
```
rag-architecture
```
- Retrieval-augmented generation patterns
```
vector-databases
```
- Vector search for LLM context
```
ml-inference-optimization
```
- General inference optimization
```
estimation-techniques
```
- Capacity planning for LLM systems

Version History

v1.0.0 (2025-12-26): Initial release - LLM serving patterns for systems design interviews

Last Updated

Date: 2025-12-26