Gsd-skill-creator openstack-networking-debug
OpenStack networking debug operations skill for SDN troubleshooting, packet tracing, and flow analysis. Covers OVS/OVN debugging (ovs-vsctl, ovs-ofctl, ovs-appctl, ovn-nbctl, ovn-sbctl, ovn-trace), security group analysis via OVS flow rules and conntrack, DHCP troubleshooting through namespace inspection and dnsmasq diagnostics, floating IP diagnosis with NAT rule and ARP verification, network namespace inspection (ip netns), MTU chain analysis for overlay networks, DNS resolution debugging, and east-west traffic diagnosis. Use when diagnosing network connectivity failures, tracing packets through the SDN stack, or analyzing flow tables in a running OpenStack cloud.
git clone https://github.com/Tibsfox/gsd-skill-creator
T=$(mktemp -d) && git clone --depth=1 https://github.com/Tibsfox/gsd-skill-creator "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/openstack/networking-debug" ~/.claude/skills/tibsfox-gsd-skill-creator-openstack-networking-debug && rm -rf "$T"
skills/openstack/networking-debug/SKILL.mdOpenStack Networking Debug -- SDN Troubleshooting Operations
Networking debug is the most hands-on troubleshooting domain in cloud operations. Virtual networks add multiple abstraction layers between user intent and physical packets -- an instance's traffic passes through a tap device, a Linux bridge or OVS port, integration bridge flows, tunnel encapsulation, and physical NIC before reaching the wire. When connectivity breaks, the operator must trace through every layer to find where packets stop flowing.
The debugging mental model: Start at the instance and trace outward. The packet path for a tenant instance is: instance vNIC -> tap device -> qbr bridge (if OVS with iptables) -> OVS br-int -> tunnel or VLAN tag -> OVS br-ex (for external traffic) -> physical NIC. For OVN, the path simplifies: instance vNIC -> OVS br-int (with OVN flows) -> tunnel or physical port. Every hop is inspectable. Every hop can be the failure point.
This skill is the primary reference for the CRAFT-network agent when diagnosing connectivity issues during Phase E operations.
Deploy
Debug Tooling Setup
Verify all diagnostic tools are available before beginning any debug session.
OVS diagnostic commands (available inside
openvswitch_vswitchd container):
# Verify OVS tools are accessible docker exec openvswitch_vswitchd ovs-vsctl --version docker exec openvswitch_vswitchd ovs-ofctl --version docker exec openvswitch_vswitchd ovs-appctl --version # Show complete OVS configuration docker exec openvswitch_vswitchd ovs-vsctl show
OVN diagnostic commands (available inside
ovn_northd and ovn_controller containers):
# Verify OVN tools docker exec ovn_northd ovn-nbctl --version docker exec ovn_northd ovn-sbctl --version # OVN trace (powerful logical packet tracing) docker exec ovn_controller ovn-trace --version
Network namespace tools (on the host or inside Neutron containers):
# List all network namespaces ip netns list # Expected: qrouter-<id>, qdhcp-<id> (OVS backend) # OVN uses fewer namespaces (metadata only)
Packet capture (tcpdump inside containers or namespaces):
# Capture on a tap interface (instance-facing) tcpdump -i tap<port-id-prefix> -n -c 50 # Capture inside a network namespace ip netns exec qrouter-<router-id> tcpdump -i qr-<port-prefix> -n -c 50 # Capture on physical NIC tcpdump -i eth1 -n port 4789 # VXLAN traffic
Kolla-Ansible Debug Container Options
For persistent debug environments, Kolla-Ansible provides tooling containers:
# Enter the neutron_server container for API-level debugging docker exec -it neutron_server /bin/bash # Enter openvswitch_vswitchd for flow-level debugging docker exec -it openvswitch_vswitchd /bin/bash # Enter the relevant agent container for namespace access docker exec -it neutron_l3_agent /bin/bash # OVS backend docker exec -it neutron_dhcp_agent /bin/bash # OVS backend
Configure
OVS Logging Levels
Adjust OVS logging to capture more detail during active debugging, then restore to production levels.
# Increase OVS daemon logging (temporary, resets on restart) docker exec openvswitch_vswitchd ovs-appctl vlog/set vswitchd:dbg docker exec openvswitch_vswitchd ovs-appctl vlog/set ofproto:dbg # Restore production logging docker exec openvswitch_vswitchd ovs-appctl vlog/set vswitchd:warn docker exec openvswitch_vswitchd ovs-appctl vlog/set ofproto:warn # Check current log levels docker exec openvswitch_vswitchd ovs-appctl vlog/list
OVN Tracing Enablement
OVN trace simulates a packet through the logical pipeline without sending real traffic.
# Trace a packet from a logical port through OVN docker exec ovn_controller ovn-trace <datapath> \ 'inport == "<logical-port>" && eth.src == <mac> && eth.dst == <mac> \ && ip4.src == <src-ip> && ip4.dst == <dst-ip> && ip.ttl == 64'
Neutron Agent Debug Logging
Enable debug logging on individual agents for detailed event tracing.
# Check current log level docker exec neutron_server grep -i "debug" /etc/neutron/neutron.conf # Enable debug via Kolla-Ansible config override # In /etc/kolla/config/neutron/neutron.conf: # [DEFAULT] # debug = True # After config change, reconfigure the service # kolla-ansible -i inventory reconfigure --tags neutron
Packet Capture Setup
# Identify the tap device for an instance port openstack port show <port-id> -c id # Tap device name: tap<first-11-chars-of-port-id> # Identify the OVS port number for correlation with flow tables docker exec openvswitch_vswitchd ovs-vsctl --columns=name,ofport list Interface | grep tap # Set up continuous capture with rotation (for intermittent issues) tcpdump -i tap<prefix> -n -w /tmp/capture-%H%M.pcap -G 300 -W 12
Flow Table Inspection Setup
# Dump all flow tables on br-int (primary integration bridge) docker exec openvswitch_vswitchd ovs-ofctl dump-flows br-int # Dump flows for a specific table (table 0 = ingress classification) docker exec openvswitch_vswitchd ovs-ofctl dump-flows br-int table=0 # Watch flows in real-time (shows packet/byte counts) docker exec openvswitch_vswitchd ovs-ofctl dump-flows br-int --no-stats=false
Operate
Connectivity Diagnosis Workflow
Scenario: Instance cannot reach the external network.
Step-by-step trace from instance outward:
- Verify the instance has an IP:
openstack server show <instance> -c addresses - Check the port is bound:
-- must beopenstack port show <port-id> -c binding_vif_type
orovs
, notovnbinding_failed - Check the tap device exists:
-- if missing, the port was not wired by the agentip link show tap<port-prefix> - Check OVS port attachment:
docker exec openvswitch_vswitchd ovs-vsctl list-ports br-int | grep <port-prefix> - Trace through br-int flows:
-- look for matching ingress/egress rulesdocker exec openvswitch_vswitchd ovs-ofctl dump-flows br-int | grep <port-tag> - Check the router namespace (OVS):
-- verify the default route points to the external gatewayip netns exec qrouter-<router-id> ip route - Check br-ex configuration:
-- the physical NIC must be attacheddocker exec openvswitch_vswitchd ovs-vsctl list-ports br-ex - Check the physical NIC:
-- verify it is UP, check for errors withip link show <nic>ip -s link show <nic>
DHCP Troubleshooting
Scenario: Instance gets no IP address.
- Check DHCP agent (OVS):
-- must showopenstack network agent list | grep dhcp
andaliveUP - Check DHCP namespace:
-- dnsmasq must be runningip netns exec qdhcp-<network-id> ps aux | grep dnsmasq - Capture DHCP traffic:
ip netns exec qdhcp-<network-id> tcpdump -i tap<dhcp-port-prefix> -n port 67 or port 68 -c 20- Run
to trigger a DHCP requestopenstack server reboot <instance> - Look for: DHCP Discover (from instance), DHCP Offer (from dnsmasq), DHCP Request, DHCP Ack
- Missing Discover: instance network stack or tap device issue
- Discover but no Offer: dnsmasq config or port mismatch
- Run
- Check lease file:
ip netns exec qdhcp-<network-id> cat /var/lib/neutron/dhcp/<network-id>/leases - OVN DHCP:
-- verify DHCP options are programmed for the subnetdocker exec ovn_northd ovn-nbctl list DHCP_Options
Floating IP Diagnosis
Scenario: Cannot reach instance from external network via floating IP.
- Verify floating IP assignment:
-- checkopenstack floating ip show <fip>
andfixed_ip_address
fields, confirmfloating_ip_address
is setport_id - Check router namespace (OVS):
-- the floating IP must appear on theip netns exec qrouter-<router-id> ip addr show
interfaceqg-<port> - Check NAT rules:
-- look for DNAT rule mapping floating IP to fixed IPip netns exec qrouter-<router-id> iptables -t nat -L -n -v - Check ARP on external network:
-- if no response, the L3 agent is not answering ARP for this IParping -I <external-iface> <floating-ip> - Check security groups: The instance port's security groups must allow the traffic (ICMP for ping, TCP 22 for SSH)
- OVN NAT:
-- verify dnat_and_snat entry existsdocker exec ovn_northd ovn-nbctl lr-nat-list <router-name>
Security Group Analysis
Scenario: Traffic blocked that should be allowed.
- List applied rules:
-- check protocol, port range, direction, remote prefixopenstack security group rule list <group> --long - Check port security is enabled:
-- ifopenstack port show <port-id> -c port_security_enabled
, security groups are bypassed entirelyFalse - Decode OVS flows:
-- match flow rules against security group rulesdocker exec openvswitch_vswitchd ovs-ofctl dump-flows br-int | grep <port-tag> - Check conntrack state: Security groups are stateful. Existing connections persist after rule changes. Check conntrack:
conntrack -L | grep <instance-ip> - Force flow resync: Restart the OVS agent to regenerate all flows:
docker restart neutron_openvswitch_agent - OVN ACLs:
-- verify ACLs match security group intentdocker exec ovn_northd ovn-nbctl acl-list <logical-switch>
MTU Issues
Scenario: Large packets fail, SSH works but SCP stalls, HTTP transfers hang.
- Check physical NIC MTU:
-- note the value (typically 1500 or 9000)ip link show <nic> | grep mtu - Check overlay overhead: VXLAN = 50 bytes, GRE = 42 bytes. Tenant MTU = physical MTU - overhead
- Test effective MTU:
-- decrease size until it works; that is the effective MTUping -M do -s 1400 <target> - Check DHCP-advertised MTU:
-- must match the calculated tenant MTUopenstack subnet show <subnet> -c mtu - Check instance MTU: Inside the instance,
-- must match the subnet MTUip link show eth0 | grep mtu - Fix: Set
inneutron_mtu
to match physical MTU, thenglobals.ymlkolla-ansible reconfigure --tags neutron
DNS Resolution
Scenario: Instance cannot resolve hostnames.
- Check DHCP options:
-- must have at least one DNS serveropenstack subnet show <subnet> -c dns_nameservers - Check inside instance:
-- should list the DNS server from DHCP optionscat /etc/resolv.conf - Check DNS connectivity:
-- test DNS from the DHCP namespaceip netns exec qdhcp-<network-id> nslookup google.com <dns-server> - Check metadata agent: DNS may fail if the metadata service is also broken:
from inside the instancecurl http://169.254.169.254/latest/meta-data/
Troubleshoot
Instance Has No Network Connectivity
Symptoms: Instance boots, may or may not have an IP, cannot reach gateway or other instances.
Resolution steps:
- Run
-- get the port IDopenstack port list --server <instance> - Check
-- ifbinding_vif_type
, check agent logs:binding_faileddocker logs neutron_openvswitch_agent --tail 100 - Verify tap device exists on host:
ip link show tap<port-prefix> - If tap missing: restart the OVS agent or the instance. Check if the compute host matches the port's
binding_host_id - If tap exists but no connectivity: dump flows and trace packet path through OVS
- Check for wrong VLAN tag:
-- compare with expected network segmentation IDdocker exec openvswitch_vswitchd ovs-vsctl get port tap<prefix> tag - Check namespace routing:
-- missing default route means no external connectivityip netns exec qrouter-<id> ip route
Floating IP Unreachable from External
Symptoms: Floating IP assigned in OpenStack but not reachable from the external network.
Resolution steps:
- Verify router has external gateway:
openstack router show <router> -c external_gateway_info - Check qrouter namespace exists:
ip netns list | grep qrouter - If namespace missing: restart L3 agent:
docker restart neutron_l3_agent - Check NAT rules in namespace:
-- DNAT and SNAT rules must existip netns exec qrouter-<id> iptables -t nat -S - If NAT rules absent:
to reassociateopenstack floating ip set --port <port-id> <fip> - Check ARP resolution:
on a machine on the external networkarping -c 3 -I <external-iface> <floating-ip> - If ARP fails: check br-ex has the physical interface:
docker exec openvswitch_vswitchd ovs-vsctl list-ports br-ex
Inter-Tenant Traffic Leaking
Symptoms: Instances in different projects can communicate when they should not.
Resolution steps:
- Check VXLAN/VLAN segmentation IDs:
and compare with net2 -- collision means shared L2 domainopenstack network show <net1> -c provider:segmentation_id - Check security groups on both ports:
-- default groups deny cross-tenant ingressopenstack port show <port> -c security_group_ids - Inspect OVS flow tables:
-- look for flows that bridge between different tunnel IDsdocker exec openvswitch_vswitchd ovs-ofctl dump-flows br-int - Check for shared networks:
-- shared networks are accessible across projects by designopenstack network show <net> -c shared - Flow table corruption: restart OVS agent to force full flow resync:
docker restart neutron_openvswitch_agent
DHCP Failures
Symptoms: Instance boots without IP address, gets wrong IP, or IP assignment is delayed.
Resolution steps:
- Check agent status:
-- must be aliveopenstack network agent list | grep dhcp - Check namespace:
-- if missing, restart DHCP agentip netns list | grep qdhcp-<network-id> - Check dnsmasq process:
-- if not running, check agent logsip netns exec qdhcp-<network-id> ps aux | grep dnsmasq - Check port mismatch: compare
with the dnsmasq config fileopenstack port list --network <net> --device-owner network:dhcp - OVN:
-- verify subnet options exist and contain correct CIDRdocker exec ovn_northd ovn-nbctl list DHCP_Options
Metadata Service Unreachable
Symptoms: Instance cannot reach 169.254.169.254, cloud-init fails, SSH key injection fails.
Resolution steps:
- Check metadata agent:
-- container must be runningdocker ps | grep metadata - Check metadata proxy routing:
-- a DNAT rule must redirect metadata requestsip netns exec qrouter-<id> iptables -t nat -S | grep 169.254 - Check Nova metadata API:
from the controller -- must return metadata API version listcurl http://localhost:8775/ - Check Neutron metadata proxy config:
docker exec neutron_metadata_agent cat /etc/neutron/metadata_agent.ini | grep nova_metadata - OVN: metadata is served by
running in a namespace on the chassis hosting the instanceneutron_ovn_metadata_agent
East-West Traffic Between Instances Fails
Symptoms: Instances on the same or different subnets cannot communicate.
Resolution steps:
- Same subnet: Traffic stays on br-int. Check that both ports are on the same VLAN tag:
vsdocker exec openvswitch_vswitchd ovs-vsctl get port tap<prefix1> tagtap<prefix2> - Different subnets: Traffic routes through the router namespace. Check
-- both subnets must be presentip netns exec qrouter-<id> ip route - Check security groups: default egress is allow-all, but default ingress denies all. Both instances need rules permitting each other's traffic
- Check ARP tables inside namespaces:
-- missing entries indicate L2 reachability problemsip netns exec qrouter-<id> arp -n - Test from namespace:
-- if this works, the issue is between the namespace and the instanceip netns exec qrouter-<id> ping <instance-ip>
Integration Points
- Neutron skill: Core networking knowledge. Networking-debug extends Neutron's troubleshooting section with deeper diagnostic procedures and systematic trace workflows. Load Neutron skill first for architecture context, then networking-debug for active troubleshooting.
- Security skill: Security group analysis is a core part of network debugging. When traffic is unexpectedly blocked, cross-reference security group rules with OVS flow tables. Security skill provides the policy layer; this skill provides the enforcement inspection layer.
- Monitoring skill: Network metrics (packet drops, interface errors, latency, bandwidth utilization) guide where to start debugging. High packet drops on br-int suggest flow issues. High latency on tunnel interfaces suggests overlay problems. Check monitoring dashboards before diving into packet traces.
- Capacity skill: Network resource exhaustion causes connectivity failures. Floating IP pool depletion prevents external access. Port quota exhaustion prevents new instances from getting network connectivity. Subnet address pool exhaustion prevents DHCP assignment. When debugging connectivity, verify resource availability first.
- Nova skill: Instance networking depends on Nova's port binding workflow. When an instance has no connectivity, the first question is whether Nova successfully requested a port from Neutron and whether the compute host wired the tap device. Nova logs contain the port binding request; Neutron logs contain the binding response.
- CRAFT-network agent: Primary consumer of this skill. The CRAFT-network agent activates networking-debug when keywords like "connectivity," "packet trace," "flow analysis," or "debug" appear in the problem context. The agent uses the systematic diagnostic workflows in this skill to trace issues methodically.
NASA SE Cross-References
| SE Phase | Networking Debug Activity | Reference |
|---|---|---|
| Phase D (Integration & Test) | Network integration testing: verify end-to-end connectivity through the SDN stack, validate security group enforcement, confirm DHCP assignment across all network types, test floating IP reachability from external networks. Each test exercises a different segment of the packet path. | SP-6105 SS 5.2 (Product Integration -- service interface verification) |
| Phase E (Operations) | Operational network troubleshooting: diagnose connectivity failures using the systematic trace workflows in this skill. Every troubleshooting procedure follows the "observe symptom, form hypothesis, test hypothesis, resolve or escalate" pattern from NASA's anomaly resolution process. | SP-6105 SS 5.4 (Product Validation -- operational environment verification) |
| Phase E (Sustainment) | Network configuration changes during operations: MTU adjustments, security group updates, new network creation. Each change requires verification using the diagnostic procedures in this skill to confirm the change achieved the intended effect without side effects. | NPR 7123.1 SS 5.4 (Sustainment -- operational baseline management) |