Asi vllm-deployment
install
source · Clone the upstream repo
git clone https://github.com/plurigrid/asi
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/plurigrid/asi "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/vllm-deployment" ~/.claude/skills/plurigrid-asi-vllm-deployment && rm -rf "$T"
manifest:
skills/vllm-deployment/SKILL.mdsource content
vLLM Model Serving and Inference
Quick Start
Docker (CPU)
docker run --rm -p 8000:8000 \ --shm-size=4g \ --cap-add SYS_NICE \ --security-opt seccomp=unconfined \ -e VLLM_CPU_KVCACHE_SPACE=4 \ <vllm-cpu-image> \ --model <model-name> \ --dtype float32 # Access: http://localhost:8000
Docker (GPU)
docker run --rm -p 8000:8000 \ --gpus all \ --shm-size=4g \ <vllm-gpu-image> \ --model <model-name> # Access: http://localhost:8000
Docker Deployment
1. Assess Hardware Requirements
| Hardware | Minimum RAM | Recommended |
|---|---|---|
| CPU | 2x model size | 4x model size |
| GPU | Model size + 2GB | Model size + 4GB VRAM |
- Check model documentation for specific requirements
- Consider quantized variants to reduce memory footprint
- Allocate 50-100GB storage for model downloads
2. Pull the Container Image
# CPU image (check vLLM docs for latest tag) docker pull <vllm-cpu-image> # GPU image (check vLLM docs for latest tag) docker pull <vllm-gpu-image>
Notes:
- Use CPU-specific images for CPU inference
- Use CUDA-enabled images matching your GPU architecture
- Verify CPU instruction set compatibility (AVX512, AVX2)
3. Configure and Run
CPU Deployment:
docker run --rm \ --shm-size=4g \ --cap-add SYS_NICE \ --security-opt seccomp=unconfined \ -p 8000:8000 \ -e VLLM_CPU_KVCACHE_SPACE=4 \ -e VLLM_CPU_OMP_THREADS_BIND=0-7 \ <vllm-cpu-image> \ --model <model-name> \ --dtype float32 \ --max-model-len 2048
GPU Deployment:
docker run --rm \ --gpus all \ --shm-size=4g \ -p 8000:8000 \ <vllm-gpu-image> \ --model <model-name> \ --dtype auto \ --max-model-len 4096
4. Verify Deployment
# Check health curl http://localhost:8000/health # List models curl http://localhost:8000/v1/models # Test inference curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "<model-name>", "prompt": "Hello", "max_tokens": 10}'
5. Update
docker pull <vllm-image> docker stop <container-id> # Re-run with same parameters
Cloud VM Deployment
1. Provision Infrastructure
# Create security group with rules: # - TCP 22 (SSH) # - TCP 8000 (API) # Launch instance with: # - Sufficient RAM/VRAM for model # - Docker pre-installed (or install manually) # - 50-100GB root volume # - Public IP for external access
2. Connect and Deploy
ssh -i <key-file> <user>@<instance-ip> # Install Docker if not present # Pull and run vLLM container (see Docker Deployment section)
3. Verify External Access
# From local machine curl http://<instance-ip>:8000/health curl http://<instance-ip>:8000/v1/models
4. Cleanup
# Stop container docker stop <container-id> # Terminate instance to stop costs # Delete associated resources (volumes, security groups) if temporary
Configuration Reference
Environment Variables
| Variable | Purpose | Example |
|---|---|---|
| KV cache size in GB (CPU) | |
| CPU core binding (CPU) | |
| GPU device selection | |
| HuggingFace authentication | |
Docker Flags
| Flag | Purpose |
|---|---|
| Shared memory for IPC |
| NUMA optimization (CPU) |
| Memory policy syscalls (CPU) |
| GPU access |
| Port mapping |
vLLM Arguments
| Argument | Purpose | Example |
|---|---|---|
| Model name/path | |
| Data type | , , |
| Max context length | |
| Multi-GPU parallelism | |
API Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
| GET | Health check |
| GET | List available models |
| POST | Text completion |
| POST | Chat completion |
| GET | Prometheus metrics |
Production Checklist
- Verify model fits in available memory
- Configure appropriate data type for hardware
- Set up firewall/security group rules
- Test API endpoints before production use
- Configure monitoring (Prometheus metrics)
- Set up health check alerts
- Document model and configuration used
- Plan for model updates and rollbacks
Troubleshooting
| Issue | Solution |
|---|---|
| Container exits immediately | Increase RAM or use smaller model |
| Slow inference (CPU) | Verify OMP thread binding configuration |
| Connection refused externally | Check firewall/security group rules |
| Model download fails | Set HF_TOKEN for gated models |
| Out of memory during inference | Reduce max_model_len or batch size |
| Port already in use | Change host port mapping |
| Warmup takes too long | Normal for large models (1-5 min) |