Skills rocm_vllm_deployment
Production-ready vLLM deployment on AMD ROCm GPUs. Combines environment auto-check, model parameter detection, Docker Compose deployment, health verification, and functional testing with comprehensive logging and security best practices.
git clone https://github.com/openclaw/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/alexhegit/rocm-vllm-deployment" ~/.claude/skills/openclaw-skills-rocm-vllm-deployment && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/alexhegit/rocm-vllm-deployment" ~/.openclaw/skills/openclaw-skills-rocm-vllm-deployment && rm -rf "$T"
skills/alexhegit/rocm-vllm-deployment/SKILL.mdROCm vLLM Deployment Skill
Production-ready automation for deploying vLLM inference services on AMD ROCm GPUs using Docker Compose.
Features
- Environment Auto-Check - Detects and repairs missing dependencies
- Model Parameter Detection - Auto-reads config.json for optimal settings
- VRAM Estimation - Calculates memory requirements before deployment
- Secure Token Handling - Never writes tokens to compose files
- Structured Output - All logs and test results saved per-model
- Deployment Reports - Human-readable summary for each deployment
- Health Verification - Automated health checks and functional tests
- Troubleshooting Guide - Common issues and solutions
Environment Prerequisites
Recommended (for production): Add to
~/.bash_profile:
# HuggingFace authentication token (required for gated models) export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # Model cache directory (optional) export HF_HOME="$HOME/models" # Apply changes source ~/.bash_profile
Not required for testing: The skill will proceed without these set:
- HF_TOKEN: Optional — public models work without it; gated models fail at download with clear error
- HF_HOME: Optional — defaults to
/root/.cache/huggingface/hub
Environment Variable Detection
Priority Order:
- Explicit parameter (highest) — Provided in task/request (e.g.,
)hf_token: "xxx" - Environment variable — Already set in shell or from parent process
- ~/.bash_profile — Source to load variables
- Default value (lowest) — HF_HOME defaults to
/root/.cache/huggingface/hub
| Variable | Required | If Missing |
|---|---|---|
| Conditional | Continue without token (public models work; gated models fail at download with clear error) |
| No | Warning + Default — Use |
Philosophy: Fail fast for configuration errors, fail at download time for authentication errors.
Helper Scripts
Location:
<skill-dir>/scripts/
check-env.sh
Validate and load environment variables before deployment.
Usage:
# Basic check (HF_TOKEN optional, HF_HOME optional with default) ./scripts/check-env.sh # Strict mode (HF_HOME required, fails if not set) ./scripts/check-env.sh --strict # Quiet mode (minimal output, for automation) ./scripts/check-env.sh --quiet # Test with environment variables HF_TOKEN="hf_xxx" HF_HOME="/models" ./scripts/check-env.sh
Exit Codes:
| Code | Meaning |
|---|---|
| 0 | Environment check completed (variables loaded or defaulted) |
| 2 | Critical error (e.g., cannot source ~/.bash_profile) |
Note: This script is optional. You can also directly run
source ~/.bash_profile.
generate-report.sh
Generate human-readable deployment report after successful deployment.
Usage:
./scripts/generate-report.sh <model-id> <container-name> <port> <status> [model-load-time] [memory-used] # Example: ./scripts/generate-report.sh \ "Qwen-Qwen3-0.6B" \ "vllm-qwen3-0-6b" \ "8001" \ "✅ Success" \ "3.6" \ "1.2"
Parameters:
| Parameter | Required | Description |
|---|---|---|
| Yes | Model ID (with replaced by ) |
| Yes | Docker container name |
| Yes | Host port for API endpoint |
| Yes | Deployment status (e.g., "✅ Success") |
| No | Model loading time in seconds |
| No | Memory consumption in GiB |
Output:
$HOME/vllm-compose/<model-id>/DEPLOYMENT_REPORT.md
Exit Codes:
| Code | Meaning |
|---|---|
| 0 | Report generated successfully |
| 1 | Missing required parameters |
| 2 | Output directory not found |
Integration: This script is automatically called in Phase 7 of the deployment workflow.
Input Schema
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| model_id | String | Yes | - | HuggingFace model ID |
| docker_image | String | No | rocm/vllm-dev:nightly | vLLM Docker image |
| tensor_parallel_size | Integer | No | 1 | Number of GPUs |
| port | Integer | No | 9999 | API server port |
| hf_home | String | No | or | Model cache directory |
| hf_token | Secret | Conditional | | HuggingFace token (optional for public models, required for gated models) |
| max_model_len | Integer | No | Auto-detect | Maximum sequence length |
| gpu_memory_utilization | Float | No | 0.85 | GPU memory utilization |
| auto_install | Boolean | No | true | Auto-install dependencies |
| log_level | String | No | INFO | Logging verbosity |
Output Structure
All deployment artifacts MUST be saved to:
$HOME/vllm-compose/<model-id-slash-to-dash>/
Convert model ID to directory name by replacing
/ with -:
→openai/gpt-oss-20b$HOME/vllm-compose/openai-gpt-oss-20b/
→Qwen/Qwen3-Coder-Next-FP8$HOME/vllm-compose/Qwen-Qwen3-Coder-Next-FP8/
Per-model directory structure:
$HOME/vllm-compose/<model-id>/ ├── deployment.log # Full deployment logs (stdout + stderr) ├── test-results.json # Functional test results (JSON format) ├── docker-compose.yml # Generated Docker Compose file ├── .env # HF_TOKEN environment (chmod 600, optional) └── DEPLOYMENT_REPORT.md # Human-readable deployment summary
File requirements:
— Capture ALL container logs during deploymentdeployment.log
— Save API response from functional test requesttest-results.json
— Generated in Phase 7DEPLOYMENT_REPORT.md- All three files MUST exist before marking deployment as complete
Execution Workflow
Phase 0: Environment Check & Auto-Repair
Step 0.1: Load Environment Variables
# Source ~/.bash_profile to load HF_HOME and HF_TOKEN source ~/.bash_profile # If HF_HOME is not defined, it defaults to /root/.cache/huggingface/hub
If HF_HOME is not defined in ~/.bash_profile, it defaults to
/root/.cache/huggingface/hub.
Step 0.2: Create Output Directory
- Create:
$HOME/vllm-compose/<model-id>/
Step 0.3: Initialize Logging
- All output →
$HOME/vllm-compose/<model-id>/deployment.log
Step 0.4: System Checks
- Detect OS and package manager
- Check Python, pip, huggingface_hub
- Check Docker, docker compose
- Check ROCm tools (rocm-smi/amd-smi)
- Check GPU access (/dev/kfd, /dev/dri)
- Check disk space (20GB minimum)
Phase 1: Model Download
Use HF_HOME from Phase 0 (environment variable or default):
# Download model to HF_HOME huggingface-cli download <model_id> --local-dir "$HF_HOME/hub/models--<org>--<model>" # Or use snapshot_download via Python: python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='<model_id>', cache_dir='$HF_HOME')"
Authentication Handling:
| Scenario | Behavior |
|---|---|
| Public model + no token | ✅ Download succeeds |
| Public model + token provided | ✅ Download succeeds |
| Gated model + no token | ❌ Download fails with "authentication required" error |
| Gated model + invalid token | ❌ Download fails with "invalid token" error |
| Gated model + valid token | ✅ Download succeeds |
On Authentication Failure:
echo "ERROR: Model download failed - authentication required" echo "This model requires a valid HF_TOKEN." echo "" echo "Please add to ~/.bash_profile:" echo " export HF_TOKEN=\"hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\"" echo "Then run: source ~/.bash_profile" exit 1
- Locate model path in HF cache:
$HF_HOME/hub/models--<org>--<model-name>/ - Log download progress to
deployment.log
Phase 2: Model Parameter Detection
- Read config.json from model
- Auto-detect: max_model_len, hidden_size, num_attention_heads, num_hidden_layers, vocab_size, dtype
- Validate TP size divides attention heads
- Estimate VRAM requirement
Phase 3: Docker Compose Configuration
Generate files in output directory:
-
docker-compose.yml →
$HOME/vllm-compose/<model-id>/docker-compose.yml- Mount HF_HOME as volume (read-only for models)
- NO hardcoded tokens in compose file
-
.env →
(optional)$HOME/vllm-compose/<model-id>/.env- Contains:
HF_TOKEN=<value> - Permissions:
chmod 600 - Only created if user explicitly requests persistent token storage
- Contains:
Volume mount example:
volumes: - ${HF_HOME}:/root/.cache/huggingface/hub:ro - /dev/kfd:/dev/kfd - /dev/dri:/dev/dri
Important: Docker Compose reads
${HF_HOME} from the host environment at runtime. Before running docker compose, source ~/.bash_profile: source ~/.bash_profile
Phase 4: Container Launch
Important: Before deploying, pull the latest image to ensure updates:
docker pull rocm/vllm-dev:nightly
Note: Default port is 9999. Before running docker compose, check if port is available:
ss -tlnp | grep :<port>. If port is in use, specify a different port in docker-compose.yml.
- Pass HF_TOKEN at runtime: HF_TOKEN=$HF_TOKEN docker compose up -d
- Wait for container initialization
Phase 5: Health Verification
- Check container status
- Test /health endpoint
- Test /v1/models endpoint
Phase 6: Functional Testing
- Run completion test via
API/v1/chat/completions - Save response to:
$HOME/vllm-compose/<model-id>/test-results.json - Verify response contains valid completion
- Log deployment complete → Append to
deployment.log - Deployment is complete only when both files exist:
deployment.logtest-results.json
Phase 7: Deployment Report
Generate human-readable deployment report using the helper script.
Step 7.1: Extract Deployment Metrics
# Parse deployment.log for metrics MODEL_LOAD_TIME=$(grep -o "model loading took [0-9.]* seconds" deployment.log | grep -o '[0-9.]*' || echo "N/A") MEMORY_USED=$(grep -o "took [0-9.]* GiB memory" deployment.log | grep -o '[0-9.]*' || echo "N/A")
Step 7.2: Generate Report
# Execute the report generation script <skill-dir>/scripts/generate-report.sh \ "<model-id>" \ "<container-name>" \ "<port>" \ "<status>" \ "$MODEL_LOAD_TIME" \ "$MEMORY_USED" # Example: ./scripts/generate-report.sh \ "Qwen-Qwen3-0.6B" \ "vllm-qwen3-0-6b" \ "8001" \ "✅ Success" \ "3.6" \ "1.2"
Output:
$HOME/vllm-compose/<model-id>/DEPLOYMENT_REPORT.md
Report Contents:
- Output structure verification (file checklist)
- Deployment summary table (health, test, metrics)
- Test results (request/response preview)
- Environment configuration
- Quick commands for operations
Completion Criteria:
exists in output directoryDEPLOYMENT_REPORT.md- Report contains all required sections
- All file checks show ✅
Security Best Practices
- Never commit tokens to version control — Add
to.env.gitignore - Use .env files with chmod 600 — Restrict access to owner only
- Mask tokens in logs — Show only first 10 chars:
${TOKEN:0:10}... - Pass tokens at runtime —
HF_TOKEN=$HF_TOKEN docker compose up -d - Store tokens in ~/.bash_profile — For production environments, set
in user's shell configHF_TOKEN - Set token for gated models — HF_TOKEN is validated at download time; set in ~/.bash_profile for production
Troubleshooting
Environment Variables
| Issue | Solution |
|---|---|
| Add to , then . Or provide via parameter. |
| defaults to . For production, add to . |
| Create and add environment variables. |
| Run or restart terminal. |
| Token may be invalid or lack access to the model. Verify token at https://huggingface.co/settings/tokens |
Model Download
| Issue | Solution |
|---|---|
(gated model) | Set in or provide via parameter. Ensure token has access to the model. |
| Verify model ID is correct (case-sensitive). Check model exists on HuggingFace. |
| Check network connection. Large models may take time. |
Deployment
| Issue | Solution |
|---|---|
| hf CLI not found | |
| Docker Compose fails | Use (no hyphen) |
| GPU access fails | Add user to group: |
| Port in use | Change parameter |
| OOM | Reduce |
Cleanup
cd $HOME/vllm-compose/<model-id> docker compose down
Status Check
Check deployment status and logs:
# View deployment directory ls -la $HOME/vllm-compose/<model-id>/ # View live logs tail -f $HOME/vllm-compose/<model-id>/deployment.log # View test results cat $HOME/vllm-compose/<model-id>/test-results.json # Check container status docker ps | grep <model-id> # Verify environment variables echo "HF_TOKEN: ${HF_TOKEN:0:10}..." echo "HF_HOME: $HF_HOME"
Quick Start (Production)
Step 1: Add environment variables to ~/.bash_profile
# Required: HuggingFace token export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # Recommended: Custom model storage path (production) export HF_HOME="/data/models/huggingface" # Apply changes source ~/.bash_profile
Step 2: Verify environment is ready
# Source ~/.bash_profile to load variables source ~/.bash_profile # Expected output: # === Environment Ready === # Summary: # HF_TOKEN: hf_xxxxxx... # HF_HOME: /data/models/huggingface
Step 3: Run deployment
# The skill will automatically: # 1. Source ~/.bash_profile to load HF_HOME and HF_TOKEN # 2. Use HF_TOKEN and HF_HOME from environment (or ~/.bash_profile, or defaults) # 3. Proceed without token for public models # 4. Fail at download time with clear error if gated model requires token
Version History
| Version | Changes |
|---|---|
| 1.0.0 | Initial release |