Asi vllm-deployment

install
source · Clone the upstream repo
git clone https://github.com/plurigrid/asi
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/plurigrid/asi "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/vllm-deployment" ~/.claude/skills/plurigrid-asi-vllm-deployment && rm -rf "$T"
manifest: skills/vllm-deployment/SKILL.md
source content

vLLM Model Serving and Inference

Quick Start

Docker (CPU)

docker run --rm -p 8000:8000 \
  --shm-size=4g \
  --cap-add SYS_NICE \
  --security-opt seccomp=unconfined \
  -e VLLM_CPU_KVCACHE_SPACE=4 \
  <vllm-cpu-image> \
  --model <model-name> \
  --dtype float32
# Access: http://localhost:8000

Docker (GPU)

docker run --rm -p 8000:8000 \
  --gpus all \
  --shm-size=4g \
  <vllm-gpu-image> \
  --model <model-name>
# Access: http://localhost:8000

Docker Deployment

1. Assess Hardware Requirements

HardwareMinimum RAMRecommended
CPU2x model size4x model size
GPUModel size + 2GBModel size + 4GB VRAM
  • Check model documentation for specific requirements
  • Consider quantized variants to reduce memory footprint
  • Allocate 50-100GB storage for model downloads

2. Pull the Container Image

# CPU image (check vLLM docs for latest tag)
docker pull <vllm-cpu-image>

# GPU image (check vLLM docs for latest tag)
docker pull <vllm-gpu-image>

Notes:

  • Use CPU-specific images for CPU inference
  • Use CUDA-enabled images matching your GPU architecture
  • Verify CPU instruction set compatibility (AVX512, AVX2)

3. Configure and Run

CPU Deployment:

docker run --rm \
  --shm-size=4g \
  --cap-add SYS_NICE \
  --security-opt seccomp=unconfined \
  -p 8000:8000 \
  -e VLLM_CPU_KVCACHE_SPACE=4 \
  -e VLLM_CPU_OMP_THREADS_BIND=0-7 \
  <vllm-cpu-image> \
  --model <model-name> \
  --dtype float32 \
  --max-model-len 2048

GPU Deployment:

docker run --rm \
  --gpus all \
  --shm-size=4g \
  -p 8000:8000 \
  <vllm-gpu-image> \
  --model <model-name> \
  --dtype auto \
  --max-model-len 4096

4. Verify Deployment

# Check health
curl http://localhost:8000/health

# List models
curl http://localhost:8000/v1/models

# Test inference
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "<model-name>", "prompt": "Hello", "max_tokens": 10}'

5. Update

docker pull <vllm-image>
docker stop <container-id>
# Re-run with same parameters

Cloud VM Deployment

1. Provision Infrastructure

# Create security group with rules:
# - TCP 22 (SSH)
# - TCP 8000 (API)

# Launch instance with:
# - Sufficient RAM/VRAM for model
# - Docker pre-installed (or install manually)
# - 50-100GB root volume
# - Public IP for external access

2. Connect and Deploy

ssh -i <key-file> <user>@<instance-ip>

# Install Docker if not present
# Pull and run vLLM container (see Docker Deployment section)

3. Verify External Access

# From local machine
curl http://<instance-ip>:8000/health
curl http://<instance-ip>:8000/v1/models

4. Cleanup

# Stop container
docker stop <container-id>

# Terminate instance to stop costs
# Delete associated resources (volumes, security groups) if temporary

Configuration Reference

Environment Variables

VariablePurposeExample
VLLM_CPU_KVCACHE_SPACE
KV cache size in GB (CPU)
4
VLLM_CPU_OMP_THREADS_BIND
CPU core binding (CPU)
0-7
CUDA_VISIBLE_DEVICES
GPU device selection
0,1
HF_TOKEN
HuggingFace authentication
hf_xxx

Docker Flags

FlagPurpose
--shm-size=4g
Shared memory for IPC
--cap-add SYS_NICE
NUMA optimization (CPU)
--security-opt seccomp=unconfined
Memory policy syscalls (CPU)
--gpus all
GPU access
-p 8000:8000
Port mapping

vLLM Arguments

ArgumentPurposeExample
--model
Model name/path
<model-name>
--dtype
Data type
float32
,
auto
,
bfloat16
--max-model-len
Max context length
2048
--tensor-parallel-size
Multi-GPU parallelism
2

API Endpoints

EndpointMethodPurpose
/health
GETHealth check
/v1/models
GETList available models
/v1/completions
POSTText completion
/v1/chat/completions
POSTChat completion
/metrics
GETPrometheus metrics

Production Checklist

  • Verify model fits in available memory
  • Configure appropriate data type for hardware
  • Set up firewall/security group rules
  • Test API endpoints before production use
  • Configure monitoring (Prometheus metrics)
  • Set up health check alerts
  • Document model and configuration used
  • Plan for model updates and rollbacks

Troubleshooting

IssueSolution
Container exits immediatelyIncrease RAM or use smaller model
Slow inference (CPU)Verify OMP thread binding configuration
Connection refused externallyCheck firewall/security group rules
Model download failsSet HF_TOKEN for gated models
Out of memory during inferenceReduce max_model_len or batch size
Port already in useChange host port mapping
Warmup takes too longNormal for large models (1-5 min)

References