Skilllibrary inference-serving
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/11-ai-llm-runtime-and-integration/inference-serving" ~/.claude/skills/merceralex397-collab-skilllibrary-inference-serving && rm -rf "$T"
manifest:
11-ai-llm-runtime-and-integration/inference-serving/SKILL.mdsource content
Purpose
Deploy and optimize LLM inference servers using vLLM, TGI, or Ollama with proper batching, quantization, and resource management.
When to use this skill
- deploying a model with vLLM or TGI for production inference
- configuring continuous batching and max concurrent requests
- fitting models into GPU memory with quantization (AWQ, GPTQ, GGUF)
- benchmarking throughput (tokens/sec) and latency (TTFT, TPS)
Do not use this skill when
- choosing which model to use — prefer
model-selection - running on CPU only — prefer
offline-cpu-inference - building agent memory or RAG — prefer
oragent-memoryembeddings-indexing
Procedure
- Choose serving framework — vLLM for high-throughput GPU serving, TGI for HuggingFace models, Ollama for local dev.
- Select quantization — AWQ/GPTQ for GPU (4-bit, minimal quality loss), GGUF for llama.cpp/CPU. Match quant to VRAM budget.
- Launch server —
.vllm serve <model> --tensor-parallel-size <gpus> --max-model-len <ctx> --quantization awq - Configure batching — vLLM uses continuous batching by default. Set
to control concurrency.--max-num-seqs - Set memory limits —
for vLLM. Reserve 10% for KV cache overhead.--gpu-memory-utilization 0.9 - Add OpenAI-compatible API — vLLM and TGI both serve
. Point clients at/v1/chat/completions
.http://localhost:8000 - Benchmark —
or usepython -m vllm.entrypoints.openai.api_server --benchmark
/locust
for load testing.hey - Monitor — track GPU utilization (
), request queue depth, TTFT (time to first token), and throughput.nvidia-smi
vLLM deployment
# Basic serving pip install vllm vllm serve meta-llama/Llama-3.1-8B-Instruct \ --tensor-parallel-size 1 \ --max-model-len 8192 \ --gpu-memory-utilization 0.9 # With quantization (fits larger models in less VRAM) vllm serve TheBloke/Llama-3.1-70B-AWQ \ --quantization awq \ --tensor-parallel-size 2 \ --max-model-len 4096
TGI deployment
docker run --gpus all -p 8080:80 \ -v /data:/data \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id meta-llama/Llama-3.1-8B-Instruct \ --max-batch-total-tokens 32768 \ --max-input-tokens 4096
GPU memory estimation
Model params * bytes per param = VRAM needed 7B * 2 bytes (fp16) = ~14 GB 7B * 0.5 bytes (4bit) = ~3.5 GB + KV cache 70B * 2 bytes (fp16) = ~140 GB (2x A100 80GB) 70B * 0.5 bytes (4bit) = ~35 GB + KV cache
Decision rules
- Use vLLM for production GPU serving — best throughput via PagedAttention.
- Use Ollama for local development and testing — simplest setup.
- AWQ quantization for GPU, GGUF for CPU — do not mix formats.
- Set
to actual max needed, not model max — saves KV cache memory.max-model-len - Always benchmark with realistic request patterns before deploying.
References
Related skills
— choosing the right modelmodel-selection
— CPU/hybrid inference with llama.cppllama-cpp
— CPU-only optimizationoffline-cpu-inference