Agent-plugins hyperpod-version-checker

Check and compare software component versions on SageMaker HyperPod cluster nodes - NVIDIA drivers, CUDA toolkit, cuDNN, NCCL, EFA, AWS OFI NCCL, GDRCopy, MPI, Neuron SDK (Trainium/Inferentia), Python, and PyTorch. Use when checking component versions, verifying CUDA/driver compatibility, detecting version mismatches across nodes, planning upgrades, documenting cluster configuration, or troubleshooting version-related issues on HyperPod. Triggers on requests about versions, compatibility, component checks, or upgrade planning for HyperPod clusters.

install
source · Clone the upstream repo
git clone https://github.com/awslabs/agent-plugins
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/awslabs/agent-plugins "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/sagemaker-ai/skills/hyperpod-version-checker" ~/.claude/skills/awslabs-agent-plugins-hyperpod-version-checker && rm -rf "$T"
manifest: plugins/sagemaker-ai/skills/hyperpod-version-checker/SKILL.md
source content

HyperPod Version Checker

Upload to cluster nodes via

hyperpod-ssm
skill, then execute.

Usage

# Text report to console + file
bash hyperpod_check_versions.sh

# JSON only to stdout (text report still saved to file) — best for piping/parsing
bash hyperpod_check_versions.sh --json

# Custom output file
bash hyperpod_check_versions.sh --output /tmp/versions.txt

# No color (for logging)
bash hyperpod_check_versions.sh --no-color

Output file:

component_versions_<hostname>_<timestamp>.txt
(default)

What It Checks

ComponentDetection MethodApplicable When
NVIDIA Driver
nvidia-smi
GPU instances (p3/p4/p5/g5)
CUDA Toolkit
nvcc
,
/usr/local/cuda
symlink
GPU instances
cuDNNHeader file, packagesGPU instances doing deep learning
NCCLLibrary filename, header, packagesDistributed GPU training
EFA
/opt/amazon/efa_installed_packages
,
fi_info
EFA-capable instances (p4d/p4de/p5/trn1/trn2)
AWS OFI NCCL
efa_installed_packages
, library search
EFA + NCCL workloads
GDRCopyrpm/dpkg, kernel moduleGPU instances with RDMA (p4d+/p5)
MPI
mpirun
,
/opt/amazon/openmpi
Distributed training
Neuron SDK
neuronx-cc
,
neuron-ls
, packages
Trainium/Inferentia (trn1/trn2/inf1/inf2)
Python/PyTorch
python3
,
torch
import
ML workloads
Container runtime
docker
,
containerd
,
kubectl
,
nvidia-ctk
EKS clusters

Multi-Node Comparison

Run on each node individually via the

hyperpod-ssm
skill. With
--json
, stdout is clean JSON for easy diffing.

Compatibility Reference

The script automatically analyzes CUDA/driver compatibility. For reference:

Driver SeriesSupported CUDA
580+13.x, 12.x, 11.x
570+12.8+ (Blackwell), 12.x, 11.x
545+12.3-12.7, 11.x
525-53512.0-12.2, 11.x
450+11.x only

NCCL: Use 2.18+ for CUDA 12.x, 2.12+ for CUDA 11.x. Must be consistent across all nodes.

EFA InstallerAWS OFI NCCL
1.29+v1.7.3+ (recommended)
1.26-1.28v1.7.0-v1.7.2
1.20-1.25v1.6.0+