Agent-plugins hyperpod-ssm
Remote command execution and file transfer on SageMaker HyperPod cluster nodes via AWS Systems Manager (SSM). This is the primary interface for accessing HyperPod nodes — direct SSH is not available. Use when any skill, workflow, or user request needs to execute commands on cluster nodes, upload files to nodes, read/download files from nodes, run diagnostics, install packages, or perform any operation requiring shell access to HyperPod instances. Other HyperPod skills depend on this skill for all node-level operations.
install
source · Clone the upstream repo
git clone https://github.com/awslabs/agent-plugins
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/awslabs/agent-plugins "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/sagemaker-ai/skills/hyperpod-ssm" ~/.claude/skills/awslabs-agent-plugins-hyperpod-ssm && rm -rf "$T"
manifest:
plugins/sagemaker-ai/skills/hyperpod-ssm/SKILL.mdsource content
HyperPod SSM Access
SSM Target Format
Target:
sagemaker-cluster:<CLUSTER_ID>_<GROUP_NAME>-<INSTANCE_ID>
: Last segment of cluster ARN (NOT the cluster name). Extract viaCLUSTER_ID
.get-cluster-info.sh
: Instance group name — retrieve viaGROUP_NAME
.list-nodes.sh
: EC2 instance ID (e.g.,INSTANCE_ID
)i-0123456789abcdef0
Scripts
Three scripts under
scripts/. Resolve cluster info and nodes once, then execute per node.
get-cluster-info.sh — Resolve cluster name → ID (call once)
scripts/get-cluster-info.sh CLUSTER_NAME [--region REGION] # Output: {"cluster_id":"...","cluster_arn":"...","cluster_name":"...","region":"..."}
list-nodes.sh — List all nodes with pagination (call once)
scripts/list-nodes.sh CLUSTER_NAME [--region REGION] [--instance-group GROUP] [--instance-id ID] # Output: JSON array of ClusterNodeSummaries (InstanceId, InstanceGroupName, InstanceStatus, etc.)
list-cluster-nodes paginates at 100 nodes. This script handles pagination automatically.
ssm-exec.sh — Execute command on a node (call per node)
# Execute — with pre-built target scripts/ssm-exec.sh --target "sagemaker-cluster:CLUSTERID_GROUP-INSTANCEID" 'command' [--region REGION] # Execute — with parts scripts/ssm-exec.sh --cluster-id ID --group GROUP --instance-id INSTANCE_ID 'command' [--region REGION] # Upload scripts/ssm-exec.sh --target TARGET --upload LOCAL_PATH REMOTE_PATH [--region REGION] # Read remote file scripts/ssm-exec.sh --target TARGET --read REMOTE_PATH [--region REGION]
Running Commands Across Many Nodes
SSM
start-session rate limit: 3 TPS per account. Plan batch size and delay accordingly.
aws ssm send-command does NOT support sagemaker-cluster: targets — only start-session works.
Manual SSM Commands
When the scripts aren't suitable, use
aws ssm start-session directly with AWS-StartNonInteractiveCommand:
cat > /tmp/cmd.json << 'EOF' {"command": ["bash -c 'echo hello && whoami'"]} EOF aws ssm start-session \ --target sagemaker-cluster:{CLUSTER_ID}_{GROUP_NAME}-{INSTANCE_ID} \ --region REGION \ --document-name AWS-StartNonInteractiveCommand \ --parameters file:///tmp/cmd.json
Always use a JSON file for
--parameters — inline parameters break with special characters.
Common Diagnostic Commands
| Task | Command |
|---|---|
| Lifecycle logs | |
| Memory | |
| Disk/mounts | |
| GPU status | |
| GPU memory | |
| EFA/network | |
| CloudWatch agent | |
| Top processes | |
Key Details
- Default SSM non-interactive user is
.root - SSM rate limit: 3 TPS per account.
- For interactive sessions (rare), omit
to get a shell.--document-name - Interactive commands (vim, top) are not supported via
.AWS-StartNonInteractiveCommand - Large outputs may be truncated by SSM.
- For troubleshooting common errors, see references/troubleshooting.md.