Agent-plugins hyperpod-issue-report

Generate comprehensive issue reports from HyperPod clusters (EKS and Slurm) by collecting diagnostic logs and configurations for troubleshooting and AWS Support cases. Use when users need to collect diagnostics from HyperPod cluster nodes, generate issue reports for AWS Support, investigate node failures or performance problems, document cluster state, or create diagnostic snapshots. Triggers on requests involving issue reports, diagnostic collection, support case preparation, or cluster troubleshooting that requires gathering logs and system information from multiple nodes.

install

source · Clone the upstream repo

git clone https://github.com/awslabs/agent-plugins

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/awslabs/agent-plugins "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/sagemaker-ai/skills/hyperpod-issue-report" ~/.claude/skills/awslabs-agent-plugins-hyperpod-issue-report && rm -rf "$T"

manifest: plugins/sagemaker-ai/skills/hyperpod-issue-report/SKILL.md

source content

HyperPod Issue Report

Collect diagnostic logs from HyperPod cluster nodes via SSM, store results in S3. Supports both EKS and Slurm clusters with auto-detection. Uses the bundled

scripts/hyperpod_issue_report.py

for reliable parallel collection.

Prerequisites

AWS CLI configured with permissions:

sagemaker:DescribeCluster

sagemaker:ListClusterNodes

ssm:StartSession

s3:PutObject

s3:GetObject

eks:DescribeCluster

Python 3.8+ and uv (see uv installation docs for install options)
SSM Agent running on target nodes; node IAM roles need
```
s3:GetObject
```
/
```
s3:PutObject
```
on the report bucket
For EKS clusters: kubectl installed and configured (see Workflow step 2)

Workflow

1. Gather Information

Collect from the user:

Cluster identifier (required): accepts cluster name or full cluster ARN (e.g.,
```
arn:aws:sagemaker:us-west-2:123456789012:cluster/abc123
```
)
AWS region (required unless extractable from ARN)
S3 path for report storage (required, e.g.
```
s3://bucket/prefix
```
). If the user doesn't have a bucket, create one (e.g.,
```
s3://hyperpod-diagnostics-<account-id>-<region>
```
)
Issue description (optional)
Target scope: all nodes, specific instance groups, or specific node IDs (optional)
Additional commands to run on nodes (optional)

2. Verify Environment

aws sts get-caller-identity
aws sagemaker describe-cluster --cluster-name <name-or-arn> --region <region>

If the S3 bucket doesn't exist, create it:

aws s3 mb s3://<bucket-name> --region <region>

For EKS clusters (check

Orchestrator.Eks

in describe-cluster output):

Ensure kubectl is installed (
```
which kubectl
```
). If missing, install it for the current platform.

Configure kubeconfig using the EKS cluster name from the describe-cluster response:

aws eks update-kubeconfig --name <eks-cluster-name> --region <region>

3. Run the Collection Script

uv run scripts/hyperpod_issue_report.py \
  --cluster <cluster-name-or-arn> \
  --region <region> \
  --s3-path s3://<bucket>[/prefix]

Use

--help

for all options including

--instance-groups

--nodes

--command

--max-workers

, and

--debug

. Note:

--instance-groups

and

--nodes

are mutually exclusive. Node identifiers accept instance IDs (

i-*

), EKS names (

hyperpod-i-*

), or Slurm names (

ip-*

4. Present Results

After collection, the script shows statistics and offers interactive download. Report the S3 location and offer to:

Download the report locally
Help analyze collected diagnostics (see references/collection-details.md for what's in each file)
Prepare a summary for AWS Support

Troubleshooting

See references/troubleshooting.md for error handling, large cluster tuning, and known limitations.