Agent-plugins hyperpod-version-checker
Check and compare software component versions on SageMaker HyperPod cluster nodes - NVIDIA drivers, CUDA toolkit, cuDNN, NCCL, EFA, AWS OFI NCCL, GDRCopy, MPI, Neuron SDK (Trainium/Inferentia), Python, and PyTorch. Use when checking component versions, verifying CUDA/driver compatibility, detecting version mismatches across nodes, planning upgrades, documenting cluster configuration, or troubleshooting version-related issues on HyperPod. Triggers on requests about versions, compatibility, component checks, or upgrade planning for HyperPod clusters.
git clone https://github.com/awslabs/agent-plugins
T=$(mktemp -d) && git clone --depth=1 https://github.com/awslabs/agent-plugins "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/sagemaker-ai/skills/hyperpod-version-checker" ~/.claude/skills/awslabs-agent-plugins-hyperpod-version-checker && rm -rf "$T"
plugins/sagemaker-ai/skills/hyperpod-version-checker/SKILL.mdHyperPod Version Checker
Upload to cluster nodes via
hyperpod-ssm skill, then execute.
Usage
# Text report to console + file bash hyperpod_check_versions.sh # JSON only to stdout (text report still saved to file) — best for piping/parsing bash hyperpod_check_versions.sh --json # Custom output file bash hyperpod_check_versions.sh --output /tmp/versions.txt # No color (for logging) bash hyperpod_check_versions.sh --no-color
Output file:
component_versions_<hostname>_<timestamp>.txt (default)
What It Checks
| Component | Detection Method | Applicable When |
|---|---|---|
| NVIDIA Driver | | GPU instances (p3/p4/p5/g5) |
| CUDA Toolkit | , symlink | GPU instances |
| cuDNN | Header file, packages | GPU instances doing deep learning |
| NCCL | Library filename, header, packages | Distributed GPU training |
| EFA | , | EFA-capable instances (p4d/p4de/p5/trn1/trn2) |
| AWS OFI NCCL | , library search | EFA + NCCL workloads |
| GDRCopy | rpm/dpkg, kernel module | GPU instances with RDMA (p4d+/p5) |
| MPI | , | Distributed training |
| Neuron SDK | , , packages | Trainium/Inferentia (trn1/trn2/inf1/inf2) |
| Python/PyTorch | , import | ML workloads |
| Container runtime | , , , | EKS clusters |
Multi-Node Comparison
Run on each node individually via the
hyperpod-ssm skill. With --json, stdout is clean JSON for easy diffing.
Compatibility Reference
The script automatically analyzes CUDA/driver compatibility. For reference:
| Driver Series | Supported CUDA |
|---|---|
| 580+ | 13.x, 12.x, 11.x |
| 570+ | 12.8+ (Blackwell), 12.x, 11.x |
| 545+ | 12.3-12.7, 11.x |
| 525-535 | 12.0-12.2, 11.x |
| 450+ | 11.x only |
NCCL: Use 2.18+ for CUDA 12.x, 2.12+ for CUDA 11.x. Must be consistent across all nodes.
| EFA Installer | AWS OFI NCCL |
|---|---|
| 1.29+ | v1.7.3+ (recommended) |
| 1.26-1.28 | v1.7.0-v1.7.2 |
| 1.20-1.25 | v1.6.0+ |