Awesome-omni-skill terway-troubleshooting
Troubleshoot Terway CNI issues in Kubernetes using Kubernetes events and Terway logs. Use when diagnosing "cni plugin not initialized", Pod create/delete failures, or ENI/IPAM problems in Terway (centralized or non-centralized IPAM).
git clone https://github.com/diegosouzapw/awesome-omni-skill
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/development/terway-troubleshooting" ~/.claude/skills/diegosouzapw-awesome-omni-skill-terway-troubleshooting && rm -rf "$T"
skills/development/terway-troubleshooting/SKILL.mdTerway Troubleshooting SOP
When to use this Skill
Use this Skill whenever the user:
- Reports "cni plugin not initialized" or similar CNI errors on nodes
- Reports Pod creation or deletion failures in a cluster using Terway as the CNI
- Suspects ENI/IPAM/resource issues related to Terway (centralized or non-centralized)
Always assume the cluster is running Kubernetes and Terway is the CNI plugin.
High-level troubleshooting flow
Follow this order unless the user has already done some steps:
-
Gather cluster-level configuration first
-
Run the cluster configuration inspection script to understand the environment:
./scripts/inspect-terway-cluster.sh -
This provides Terway version, IPAM type (centralized vs non-centralized), service CIDR, kube-proxy mode, and key Terway feature flags.
-
Use this information to guide the rest of the troubleshooting flow.
-
-
Check Terway components health
- Verify Terway DaemonSet Pod is created and running on the node.
- If using centralized IPAM (identified in step 1), also verify the Terway controlplane Pod.
-
Inspect the problematic Pod and Node configuration
- Once you've identified the problematic Pod and its Node:
- Run
to check Pod-level config (hostNetwork, pod-eni, annotation-based config source)../scripts/inspect-terway-pod.sh <namespace> <pod-name> - Run
to check Node-level config (ENI mode, dynamic config, LingJun status, ignore-by-terway, no-kube-proxy)../scripts/inspect-terway-node.sh <node-name>
- Run
- Once you've identified the problematic Pod and its Node:
-
Use Kubernetes Events as the primary signal
- For any problematic Pod, inspect its Events first.
- Map Terway-specific event reasons to likely causes and next checks.
-
Inspect Terway IPAM / ENI controllers
- Depending on centralized vs non-centralized IPAM (from step 1), check relevant CRDs and their Events.
-
Only then, inspect logs
- Use Terway daemon and controlplane logs to deepen analysis when Events are missing or unclear.
Keep answers structured: first restate what has been checked, then propose next verification steps.
Step 1 – Terway and CNI initialization
-
If the user reports "cni plugin not initialized" or similar:
- Do not immediately blame Terway IPAM logic.
- First ensure Terway Pods (daemon and, if present, controlplane) are created, scheduled, and running on the node.
- If Terway Pod is missing:
- Ask the user to check Mutating/Validating Webhooks, runtime, and CNI configuration (kubelet cni dirs, etc.).
- If Terway Pod is CrashLooping:
- Ask for the Pod description/logs and help debug that before going to Pod-level network issues.
-
Only after Terway is confirmed running on the node, proceed to Pod create/delete failures and Events.
Step 2 – Always start from Kubernetes Events
For any Pod with network-related failures:
-
Inspect Pod Events
- Instruct the user to run
and paste relevant Events.kubectl describe pod <pod> -n <ns> - Focus on Terway-related reasons (case-sensitive):
(Warning, Pod)AllocIPFailed
(Normal, Pod)AllocIPSucceed
(Warning, Pod)VirtualModeChanged
(Warning, Pod)CniPodCreateError
(Warning, Pod)CniPodDeleteError
(Warning, Pod)CniCreateENIError
(Warning, Pod)CniPodENIDeleteErr
- Instruct the user to run
-
Interpret common Pod event reasons
(Warning, Pod)AllocIPFailed- Means CNI ADD reached Terway backend but IP allocation failed.
- Likely causes:
- ENI quota exhausted (
).ErrEniPerInstanceLimitExceeded - VSwitch IP exhaustion (
,InvalidVSwitchID.IPNotEnough
).QuotaExceeded.PrivateIPAddress - OpenAPI permission or configuration errors.
- ENI quota exhausted (
- Next checks:
- Node-level Events on the Node and Node CR (if centralized IPAM).
- Terway daemon logs around the same time.
(Normal, Pod)AllocIPSucceed- IP allocation succeeded; if the Pod still fails, the issue is likely after IP allocation (datapath setup, routes, iptables, etc.).
(Warning, Pod)VirtualModeChanged- IPvlan datapath is unavailable, Terway falls back to veth.
- Usually not fatal but indicates kernel or capability problems on the node.
(Warning, Pod)CniPodCreateError- From the controlplane Pod controller. Means Pod create path failed (annotation parsing, PodENI/PodNetworking, vswitch selection, etc.).
- Ask for the full event message; it usually contains the specific error string.
(Warning, Pod)CniPodDeleteError- Failure in Pod delete cleanup (PodENI/ENI status or detach). Investigate PodENI and Node CR status.
/CniCreateENIError
(Warning, Pod)CniPodENIDeleteErr- Emitted by the PodENI controller when ENI creation/deletion for the Pod fails. Use PodENI CR Events for more details.
-
If no Terway-specific Events are present
- Confirm that the Pod is scheduled to a node where Terway is running.
- Then move to node-level and CRD-level Events.
Step 3 – Node and Node CR Events
Distinguish between:
- Kubernetes Node object (
).corev1.Node - Terway Node CRD (
) used in centralized IPAM.network.alibabacloud.com/v1beta1 Node
-
On the Kubernetes Node (
)corev1.Node- Important Terway-related event reasons:
(Warning, Node)AllocIPFailed- From local IPAM; indicates ENI/IP issues at node level.
(Warning, Node)ConfigError- From Terway node controllers when
or node capabilities are invalid.eni-config
- From Terway node controllers when
- Use these to distinguish between misconfiguration vs. resource exhaustion.
- Important Terway-related event reasons:
-
On the Terway Node CRD (centralized IPAM)
- When centralized IPAM is enabled, a
CR underNode
exists.network.alibabacloud.com - Terway emits events on this CR for ENI lifecycle and pool operations, using reasons defined in
, such as:types/k8s.go
/CreateENISucceedCreateENIFailed
/AttachENISucceedAttachENIFailed
/DetachENISucceedDetachENIFailed
/DeleteENISucceedDeleteENIFailed
- Use these events to answer questions like:
- Is the IP pool being warmed correctly?
- Are new ENIs failing to create because of OpenAPI errors or configuration?
- When centralized IPAM is enabled, a
-
Link Node events to Pod failures
- If Pods report
orAllocIPFailed
, check whether the corresponding Node / Node CR shows ENI/IPAM failures.CniPodCreateError - Use that correlation to explain whether the problem is capacity, config, or bug.
- If Pods report
Step 4 – Centralized vs non-centralized IPAM behavior
When reasoning about Terway behavior, always clarify which IPAM mode is in use.
-
Detect mode from context
- Centralized IPAM indicators:
- Presence of Terway controlplane deployment.
- CRDs like
,podenis.network.alibabacloud.com
,nodes.network.alibabacloud.com
.podnetworkings.network.alibabacloud.com - Helm/config flag
or controlplane config withcentralizedIPAM: true
set.CentralizedIPAM
- Non-centralized/local IPAM indicators:
- IPAM type in
iseni-config
.default - Node-local IPAM logic in the daemon is responsible for ENI/IP management.
- IPAM type in
- Centralized IPAM indicators:
-
If centralized IPAM
- In addition to Pod and Node events, always consider:
- PodENI CR (per-pod ENI and IP state): events like
,CreateENIFailed
,AttachENIFailed
.UpdatePodENIFailed - Node CR: ENI pool and warmup behavior.
- PodNetworking CR: Events
when syncing vswitch lists.SyncPodNetworkingSucceed/Failed
- PodENI CR (per-pod ENI and IP state): events like
- For Pod failures:
- Check Pod Events (Cni* reasons) → PodENI Events → Node CR Events → controlplane logs.
- In addition to Pod and Node events, always consider:
-
If non-centralized IPAM
- Focus on:
- Node Events (
,AllocIPFailed
).ConfigError
ConfigMap correctness (vswitches, security groups, ip_stack, trunk/erdma flags, etc.).eni-config- Terway daemon logs on the affected node.
- Node Events (
- Focus on:
Step 5 – Using logs only when Events are insufficient
-
When to move to logs
- Events point to a failure but not the exact cause (e.g., only
without OpenAPI error details).AllocIPFailed - There are no Terway Events on the relevant Pod/Node/CR, but the behavior clearly involves Terway.
- Events point to a failure but not the exact cause (e.g., only
-
Which logs to inspect
- Terway daemon logs on the affected node:
- Look for:
- The Pod name / namespace.
- OpenAPI errors (quota, IP shortage, permission issues).
- Internal errors in ENI/route/datapath setup.
- Look for:
- Terway controlplane logs (centralized IPAM):
- Look for:
- Errors in Pod controller, PodENI controller, Node controller.
- PodNetworking sync failures.
- Look for:
- Terway daemon logs on the affected node:
-
How to combine logs with Events
- Use Event timestamps and reasons as an index into the logs.
- Explain to the user:
- Which event indicates the failure.
- Which log line confirms the root cause.
Utility scripts
Cluster-level configuration
Before starting troubleshooting, gather cluster-wide Terway configuration:
./scripts/inspect-terway-cluster.sh
This script inspects:
- Terway version from the
DaemonSet image tagterway-eniip - Service CIDR and IP stack from
ConfigMapack-cluster-profile - Kube-proxy mode (iptables/ipvs) and cluster CIDR from
ConfigMapkube-proxy-worker - IPAM type (
for centralized,crd
for non-centralized) fromdefault
ConfigMapeni-config - Terway feature flags:
,enable_eni_trunking
,enable_erdma
,vswitch_selection_policy
,max_pool_size
, etc.min_pool_size
Use this information to determine whether centralized IPAM is enabled and which Terway features are active. This guides the rest of the troubleshooting flow.
Node-level configuration
To inspect Terway-related node configuration for a problematic Pod, first identify the Pod's node (for example via
kubectl get pod -o wide). Then, from the repository root, run:
./scripts/inspect-terway-node.sh <node-name>
This prints ENI mode (shared vs exclusive), node-level dynamic config (
terway-config), LingJun node flags, k8s.aliyun.com/ignore-by-terway and k8s.aliyun.com/no-kube-proxy labels, and the ENO API type from the nodes.network.alibabacloud.com CR. Use this information as input to the troubleshooting steps above when you have located the Pod's node.
Pod-level configuration
To inspect Terway-related Pod configuration, run:
./scripts/inspect-terway-pod.sh <namespace> <pod-name>
This checks:
- Whether the Pod uses
(if true, Terway CNI does not process it).hostNetwork - Whether the Pod has
annotation (indicating trunk/exclusive ENI mode).k8s.aliyun.com/pod-eni: "true" - Which annotation-based config source is used, following the webhook priority order:
(explicit pod-networks config)k8s.aliyun.com/pod-networks
(pod-networks-request config)k8s.aliyun.com/pod-networks-request
(matched PodNetworking resource)k8s.aliyun.com/pod-networking- Fallback to
default on eth0 if none of the above are set.eni-config
Use this to determine if the Pod should be managed by Terway, whether it uses PodENI, and which configuration source drives its ENI/IP allocation.
Response style guidelines
When this Skill is active:
- Always start from Events when diagnosing Pod or node-level issues; do not jump straight into logs unless Events are missing.
- Reference concrete Terway event reasons (e.g.,
,AllocIPFailed
,CniPodCreateError
) and explain what they mean.CreateENIFailed - Ask for specific artifacts when needed:
output for the problematic Pod.kubectl describe pod- Node and Node CR describe output when centralized IPAM is used.
- Excerpts from Terway daemon/controlplane logs around the relevant time.
- Keep answers structured and concise, but be explicit about next steps (what to inspect next and why).
- Clearly distinguish between configuration issues, resource exhaustion/quota, and potential Terway bugs based on Events and logs.