Skilllibrary inference-serving

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/skilllibrary

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/11-ai-llm-runtime-and-integration/inference-serving" ~/.claude/skills/merceralex397-collab-skilllibrary-inference-serving && rm -rf "$T"

manifest: 11-ai-llm-runtime-and-integration/inference-serving/SKILL.md

source content

Purpose

Deploy and optimize LLM inference servers using vLLM, TGI, or Ollama with proper batching, quantization, and resource management.

When to use this skill

deploying a model with vLLM or TGI for production inference
configuring continuous batching and max concurrent requests
fitting models into GPU memory with quantization (AWQ, GPTQ, GGUF)
benchmarking throughput (tokens/sec) and latency (TTFT, TPS)

Do not use this skill when

choosing which model to use — prefer
```
model-selection
```
running on CPU only — prefer
```
offline-cpu-inference
```
building agent memory or RAG — prefer
```
agent-memory
```
or
```
embeddings-indexing
```

Procedure

Choose serving framework — vLLM for high-throughput GPU serving, TGI for HuggingFace models, Ollama for local dev.
Select quantization — AWQ/GPTQ for GPU (4-bit, minimal quality loss), GGUF for llama.cpp/CPU. Match quant to VRAM budget.

Launch server —

vllm serve <model> --tensor-parallel-size <gpus> --max-model-len <ctx> --quantization awq

Configure batching — vLLM uses continuous batching by default. Set
```
--max-num-seqs
```
to control concurrency.
Set memory limits —
```
--gpu-memory-utilization 0.9
```
for vLLM. Reserve 10% for KV cache overhead.
Add OpenAI-compatible API — vLLM and TGI both serve
```
/v1/chat/completions
```
. Point clients at
```
http://localhost:8000
```
.

Benchmark —

python -m vllm.entrypoints.openai.api_server --benchmark

or use

locust

hey

for load testing.

Monitor — track GPU utilization (
```
nvidia-smi
```
), request queue depth, TTFT (time to first token), and throughput.

vLLM deployment

# Basic serving
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

# With quantization (fits larger models in less VRAM)
vllm serve TheBloke/Llama-3.1-70B-AWQ \
  --quantization awq \
  --tensor-parallel-size 2 \
  --max-model-len 4096

TGI deployment

docker run --gpus all -p 8080:80 \
  -v /data:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct \
  --max-batch-total-tokens 32768 \
  --max-input-tokens 4096

GPU memory estimation

Model params * bytes per param = VRAM needed
  7B  * 2 bytes (fp16)  = ~14 GB
  7B  * 0.5 bytes (4bit) = ~3.5 GB + KV cache
  70B * 2 bytes (fp16)  = ~140 GB (2x A100 80GB)
  70B * 0.5 bytes (4bit) = ~35 GB + KV cache

Decision rules

Use vLLM for production GPU serving — best throughput via PagedAttention.
Use Ollama for local development and testing — simplest setup.
AWQ quantization for GPU, GGUF for CPU — do not mix formats.
Set
```
max-model-len
```
to actual max needed, not model max — saves KV cache memory.
Always benchmark with realistic request patterns before deploying.

References

Related skills

```
model-selection
```
— choosing the right model
```
llama-cpp
```
— CPU/hybrid inference with llama.cpp
```
offline-cpu-inference
```
— CPU-only optimization