Robotics-agent-skills robot-bringup
git clone https://github.com/arpitg1304/robotics-agent-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/arpitg1304/robotics-agent-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/robot-bringup" ~/.claude/skills/arpitg1304-robotics-agent-skills-robot-bringup && rm -rf "$T"
skills/robot-bringup/SKILL.mdRobot Bringup Skill
When to Use This Skill
- Configuring a robot to automatically start its full ROS2 stack on boot via systemd
- Writing systemd unit files that correctly source ROS2 workspaces and set DDS environment
- Composing layered launch files (hardware, drivers, perception, application) into a single bringup
- Setting up ordered startup with health checks to avoid race conditions between dependent nodes
- Writing udev rules for deterministic device naming of cameras, LiDARs, and serial devices
- Configuring CycloneDDS or FastDDS for multi-machine ROS2 discovery across robot and base station
- Implementing watchdog and heartbeat monitoring for production robot systems
- Setting up log rotation and structured logging for long-running robot deployments
- Writing graceful shutdown handlers that bring actuators to a safe state before exit
- Debugging boot-time failures, service ordering issues, or device enumeration races
The Robot Bringup Stack
A production robot bringup follows a layered startup sequence from hardware initialization through application-level nodes. Each layer depends on the one below it.
┌─────────────────────────────────────────────────────────────────────┐ │ APPLICATION LAYER │ │ Navigation, manipulation, mission planning, HRI │ ├─────────────────────────────────────────────────────────────────────┤ │ PERCEPTION LAYER │ │ Object detection, SLAM, point cloud filtering, sensor fusion │ ├─────────────────────────────────────────────────────────────────────┤ │ DRIVER LAYER │ │ Camera drivers, LiDAR drivers, motor controllers, IMU │ ├─────────────────────────────────────────────────────────────────────┤ │ HARDWARE LAYER │ │ udev rules, device enumeration, USB reset, firmware check │ ├─────────────────────────────────────────────────────────────────────┤ │ ROS2 ENVIRONMENT │ │ Source workspace, set RMW, ROS_DOMAIN_ID, DDS config │ ├─────────────────────────────────────────────────────────────────────┤ │ SYSTEMD TARGETS & SERVICES │ │ network-online.target → robot-hw.target → robot-bringup.target │ ├─────────────────────────────────────────────────────────────────────┤ │ LINUX BOOT (systemd) │ │ BIOS/UEFI → GRUB → kernel → systemd init │ ├─────────────────────────────────────────────────────────────────────┤ │ HARDWARE BOOT │ │ Power supply, onboard computer, peripherals │ └─────────────────────────────────────────────────────────────────────┘
systemd Service Units for ROS2
Basic ROS2 Service Unit
Place service files in
/etc/systemd/system/. This template starts a ROS2 launch file as a long-running service with watchdog support.
# /etc/systemd/system/robot-bringup.service [Unit] Description=Robot ROS2 Bringup Stack Documentation=https://github.com/my-org/my-robot After=network-online.target robot-hw.target Wants=network-online.target Requires=robot-hw.target [Service] Type=notify User=robot Group=robot WorkingDirectory=/home/robot # Load ROS2 environment variables from a dedicated env file EnvironmentFile=/etc/robot/ros2.env # Pre-start check: verify critical devices exist ExecStartPre=/usr/local/bin/robot-device-check.sh # Start the ROS2 launch file via bash so we can source the workspace ExecStart=/bin/bash -c '\ source /opt/ros/${ROS_DISTRO}/setup.bash && \ source /home/robot/ros2_ws/install/setup.bash && \ exec ros2 launch my_robot_bringup bringup.launch.py' # Graceful shutdown: send SIGINT first (Ctrl+C equivalent for ROS2) ExecStop=/bin/kill -INT $MAINPID TimeoutStopSec=30 # Restart on failure, but not on clean exit Restart=on-failure RestartSec=5 # systemd watchdog: service must call sd_notify(WATCHDOG=1) within this interval WatchdogSec=30 # Process management KillMode=mixed KillSignal=SIGINT FinalKillSignal=SIGKILL TimeoutStartSec=60 # Logging StandardOutput=journal StandardError=journal SyslogIdentifier=robot-bringup [Install] WantedBy=multi-user.target
Environment Setup in systemd
Store environment variables in a dedicated file rather than sourcing .bashrc (which is not loaded by systemd).
# /etc/robot/ros2.env # ROS2 distribution ROS_DISTRO=humble # DDS middleware selection RMW_IMPLEMENTATION=rmw_cyclonedds_cpp # Domain isolation: unique per robot to avoid cross-talk ROS_DOMAIN_ID=42 # CycloneDDS configuration file path CYCLONEDDS_URI=file:///etc/robot/cyclonedds.xml # Disable localhost-only mode for multi-machine setups ROS_LOCALHOST_ONLY=0 # Logging configuration ROS_LOG_DIR=/var/log/ros2 RCUTILS_LOGGING_USE_STDOUT=0 RCUTILS_COLORIZED_OUTPUT=0 # Robot-specific configuration ROBOT_NAME=my_robot_01 ROBOT_CONFIG_DIR=/etc/robot/config
Dependencies Between Services
Split the robot stack into multiple systemd services with explicit ordering. This allows independent restart of layers and clearer failure isolation.
# /etc/systemd/system/robot-drivers.service [Unit] Description=Robot Hardware Drivers (cameras, LiDAR, IMU, motors) After=network-online.target robot-hw.target Wants=network-online.target Requires=robot-hw.target [Service] Type=notify User=robot EnvironmentFile=/etc/robot/ros2.env ExecStart=/bin/bash -c '\ source /opt/ros/${ROS_DISTRO}/setup.bash && \ source /home/robot/ros2_ws/install/setup.bash && \ exec ros2 launch my_robot_bringup drivers.launch.py' Restart=on-failure RestartSec=5 WatchdogSec=30 KillMode=mixed KillSignal=SIGINT TimeoutStopSec=20 StandardOutput=journal SyslogIdentifier=robot-drivers [Install] WantedBy=robot-bringup.target
# /etc/systemd/system/robot-perception.service [Unit] Description=Robot Perception Stack (SLAM, detection, sensor fusion) After=robot-drivers.service Requires=robot-drivers.service PartOf=robot-drivers.service [Service] Type=notify User=robot EnvironmentFile=/etc/robot/ros2.env ExecStart=/bin/bash -c '\ source /opt/ros/${ROS_DISTRO}/setup.bash && \ source /home/robot/ros2_ws/install/setup.bash && \ exec ros2 launch my_robot_bringup perception.launch.py' Restart=on-failure RestartSec=5 WatchdogSec=30 KillMode=mixed KillSignal=SIGINT TimeoutStopSec=20 StandardOutput=journal SyslogIdentifier=robot-perception [Install] WantedBy=robot-bringup.target
# /etc/systemd/system/robot-application.service [Unit] Description=Robot Application Layer (navigation, planning, HRI) After=robot-perception.service Requires=robot-perception.service PartOf=robot-perception.service [Service] Type=notify User=robot EnvironmentFile=/etc/robot/ros2.env ExecStart=/bin/bash -c '\ source /opt/ros/${ROS_DISTRO}/setup.bash && \ source /home/robot/ros2_ws/install/setup.bash && \ exec ros2 launch my_robot_bringup application.launch.py' Restart=on-failure RestartSec=10 WatchdogSec=30 KillMode=mixed KillSignal=SIGINT TimeoutStopSec=20 StandardOutput=journal SyslogIdentifier=robot-application [Install] WantedBy=robot-bringup.target
Restart Policies and Failure Recovery
Configure rate limiting to prevent restart loops when a service is fundamentally broken (e.g., missing device, configuration error).
# Add to the [Service] section of any robot service Restart=on-failure RestartSec=5 # Allow at most 5 restart attempts within 120 seconds StartLimitIntervalSec=120 StartLimitBurst=5 # Ramp up restart delay to avoid thrashing # RestartSec can also be set dynamically via drop-in overrides: # RestartSec=5 (first few retries, fast recovery) # After StartLimitBurst is hit, the unit enters failed state # Use systemctl reset-failed robot-drivers.service to retry # On final failure, trigger an alert OnFailure=robot-alert@%n.service
Resource Limits and cgroups
Constrain resource usage to prevent a runaway node from starving the rest of the system.
# Add to the [Service] section # Limit memory to 2 GB (hard kill at 2.5 GB) MemoryMax=2G MemoryHigh=1800M # Limit CPU to 300% (3 cores on a multi-core system) CPUQuota=300% # Set real-time scheduling priority for time-critical drivers # Requires the user to have rtprio permissions in /etc/security/limits.d/ Nice=-5 IOSchedulingClass=realtime IOSchedulingPriority=0 # Restrict filesystem access ProtectHome=read-only ProtectSystem=strict ReadWritePaths=/var/log/ros2 /tmp PrivateTmp=true
Launch File Composition and Layering
Launch Layer Architecture
Organize launch files into layers that mirror the systemd service architecture. Each layer is an independent launch file that can be tested in isolation.
bringup.launch.py (top-level: composes all layers) ├── hardware.launch.py (udev checks, device readiness) ├── drivers.launch.py (camera, LiDAR, IMU, motor drivers) │ ├── camera.launch.py │ ├── lidar.launch.py │ └── motors.launch.py ├── perception.launch.py (SLAM, detection, fusion) │ ├── slam.launch.py │ └── detection.launch.py └── application.launch.py (navigation, planning, HRI) ├── navigation.launch.py └── mission.launch.py
Hardware Layer Launch
# my_robot_bringup/launch/hardware.launch.py from launch import LaunchDescription from launch.actions import LogInfo, ExecuteProcess, TimerAction from launch.conditions import IfCondition from launch.substitutions import LaunchConfiguration, EnvironmentVariable def generate_launch_description(): # Declare arguments for hardware configuration robot_name = LaunchConfiguration('robot_name', default=EnvironmentVariable('ROBOT_NAME', default_value='default_robot')) # Check that critical devices are present check_camera = ExecuteProcess( cmd=['test', '-e', '/dev/robot/camera_front'], name='check_camera_front', output='screen', ) check_lidar = ExecuteProcess( cmd=['test', '-e', '/dev/robot/lidar'], name='check_lidar', output='screen', ) check_imu = ExecuteProcess( cmd=['test', '-e', '/dev/robot/imu'], name='check_imu', output='screen', ) log_ready = TimerAction( period=2.0, actions=[LogInfo(msg='Hardware checks passed, devices ready')], ) return LaunchDescription([ check_camera, check_lidar, check_imu, log_ready, ])
Driver Layer Launch
# my_robot_bringup/launch/drivers.launch.py from launch import LaunchDescription from launch.actions import DeclareLaunchArgument, IncludeLaunchDescription, GroupAction from launch.launch_description_sources import PythonLaunchDescriptionSource from launch.substitutions import LaunchConfiguration, PathJoinSubstitution from launch_ros.actions import Node, SetRemap from launch_ros.substitutions import FindPackageShare def generate_launch_description(): use_sim = LaunchConfiguration('use_sim', default='false') camera_config = LaunchConfiguration('camera_config', default='default') # Camera driver camera_node = Node( package='usb_cam', executable='usb_cam_node_exe', name='camera_front', parameters=[PathJoinSubstitution([ FindPackageShare('my_robot_bringup'), 'config', 'camera_front.yaml' ])], remappings=[('/image_raw', '/camera/front/image_raw')], ) # LiDAR driver lidar_node = Node( package='sllidar_ros2', executable='sllidar_node', name='lidar', parameters=[{ 'serial_port': '/dev/robot/lidar', 'serial_baudrate': 460800, 'frame_id': 'lidar_link', 'angle_compensate': True, }], ) # IMU driver imu_node = Node( package='imu_driver', executable='imu_node', name='imu', parameters=[{ 'port': '/dev/robot/imu', 'frame_id': 'imu_link', 'publish_rate': 100.0, }], ) # Motor controller driver motor_node = Node( package='motor_driver', executable='motor_controller_node', name='motor_controller', parameters=[PathJoinSubstitution([ FindPackageShare('my_robot_bringup'), 'config', 'motors.yaml' ])], ) return LaunchDescription([ DeclareLaunchArgument('use_sim', default_value='false'), DeclareLaunchArgument('camera_config', default_value='default'), camera_node, lidar_node, imu_node, motor_node, ])
Perception Layer Launch
# my_robot_bringup/launch/perception.launch.py from launch import LaunchDescription from launch.actions import DeclareLaunchArgument, GroupAction from launch.conditions import IfCondition from launch.substitutions import LaunchConfiguration, PathJoinSubstitution from launch_ros.actions import Node, ComposableNodeContainer, LoadComposableNode from launch_ros.descriptions import ComposableNode from launch_ros.substitutions import FindPackageShare def generate_launch_description(): enable_slam = LaunchConfiguration('enable_slam', default='true') enable_detection = LaunchConfiguration('enable_detection', default='true') # Use a composable node container for zero-copy perception pipeline perception_container = ComposableNodeContainer( name='perception_container', namespace='', package='rclcpp_components', executable='component_container_mt', composable_node_descriptions=[ ComposableNode( package='image_proc', plugin='image_proc::RectifyNode', name='rectify', remappings=[('image', '/camera/front/image_raw')], ), ComposableNode( package='my_detection', plugin='my_detection::DetectorNode', name='detector', parameters=[PathJoinSubstitution([ FindPackageShare('my_robot_bringup'), 'config', 'detector.yaml' ])], ), ], condition=IfCondition(enable_detection), ) # SLAM node slam_node = Node( package='slam_toolbox', executable='async_slam_toolbox_node', name='slam', parameters=[PathJoinSubstitution([ FindPackageShare('my_robot_bringup'), 'config', 'slam.yaml' ])], condition=IfCondition(enable_slam), ) return LaunchDescription([ DeclareLaunchArgument('enable_slam', default_value='true'), DeclareLaunchArgument('enable_detection', default_value='true'), perception_container, slam_node, ])
Application Layer Launch
# my_robot_bringup/launch/application.launch.py from launch import LaunchDescription from launch.actions import DeclareLaunchArgument, IncludeLaunchDescription from launch.launch_description_sources import PythonLaunchDescriptionSource from launch.substitutions import LaunchConfiguration, PathJoinSubstitution from launch_ros.actions import Node from launch_ros.substitutions import FindPackageShare def generate_launch_description(): nav_params = LaunchConfiguration('nav_params', default=PathJoinSubstitution([ FindPackageShare('my_robot_bringup'), 'config', 'nav2_params.yaml' ])) # Include Nav2 bringup nav2_bringup = IncludeLaunchDescription( PythonLaunchDescriptionSource(PathJoinSubstitution([ FindPackageShare('nav2_bringup'), 'launch', 'bringup_launch.py' ])), launch_arguments={ 'params_file': nav_params, 'use_sim_time': LaunchConfiguration('use_sim', default='false'), }.items(), ) # Mission planner mission_node = Node( package='my_mission', executable='mission_planner', name='mission_planner', parameters=[PathJoinSubstitution([ FindPackageShare('my_robot_bringup'), 'config', 'mission.yaml' ])], ) return LaunchDescription([ DeclareLaunchArgument('nav_params', default_value=''), DeclareLaunchArgument('use_sim', default_value='false'), nav2_bringup, mission_node, ])
Top-Level Bringup Launch
This launch file composes all layers into a single entry point, with conditional arguments for simulation vs. real hardware and robot variant selection.
# my_robot_bringup/launch/bringup.launch.py from launch import LaunchDescription from launch.actions import ( DeclareLaunchArgument, IncludeLaunchDescription, GroupAction, LogInfo, TimerAction, ) from launch.conditions import IfCondition, UnlessCondition from launch.launch_description_sources import PythonLaunchDescriptionSource from launch.substitutions import ( LaunchConfiguration, PathJoinSubstitution, PythonExpression, ) from launch_ros.actions import PushRosNamespace, SetParameter from launch_ros.substitutions import FindPackageShare def generate_launch_description(): pkg_share = FindPackageShare('my_robot_bringup') use_sim = LaunchConfiguration('use_sim') robot_variant = LaunchConfiguration('robot_variant') enable_perception = LaunchConfiguration('enable_perception') enable_navigation = LaunchConfiguration('enable_navigation') # Hardware layer (skip in simulation) hardware_launch = GroupAction( actions=[ LogInfo(msg='Starting hardware layer...'), IncludeLaunchDescription( PythonLaunchDescriptionSource( PathJoinSubstitution([pkg_share, 'launch', 'hardware.launch.py']) ), ), ], condition=UnlessCondition(use_sim), ) # Driver layer drivers_launch = GroupAction( actions=[ LogInfo(msg='Starting driver layer...'), IncludeLaunchDescription( PythonLaunchDescriptionSource( PathJoinSubstitution([pkg_share, 'launch', 'drivers.launch.py']) ), launch_arguments={'use_sim': use_sim}.items(), ), ], ) # Perception layer (conditional) perception_launch = GroupAction( actions=[ LogInfo(msg='Starting perception layer...'), IncludeLaunchDescription( PythonLaunchDescriptionSource( PathJoinSubstitution([pkg_share, 'launch', 'perception.launch.py']) ), ), ], condition=IfCondition(enable_perception), ) # Application layer (conditional) application_launch = GroupAction( actions=[ LogInfo(msg='Starting application layer...'), IncludeLaunchDescription( PythonLaunchDescriptionSource( PathJoinSubstitution([pkg_share, 'launch', 'application.launch.py']) ), launch_arguments={'use_sim': use_sim}.items(), ), ], condition=IfCondition(enable_navigation), ) return LaunchDescription([ DeclareLaunchArgument('use_sim', default_value='false', description='Use simulation instead of real hardware'), DeclareLaunchArgument('robot_variant', default_value='standard', description='Robot variant: standard, heavy_payload, outdoor'), DeclareLaunchArgument('enable_perception', default_value='true', description='Enable perception stack'), DeclareLaunchArgument('enable_navigation', default_value='true', description='Enable navigation and application stack'), # Set use_sim_time globally SetParameter(name='use_sim_time', value=use_sim), LogInfo(msg=['Bringing up robot variant: ', robot_variant]), hardware_launch, drivers_launch, perception_launch, application_launch, ])
Conditional Loading (Sim vs Real, Robot Variants)
Use
IfCondition, UnlessCondition, and PythonExpression to swap configurations based on runtime arguments.
from launch.conditions import IfCondition, UnlessCondition from launch.substitutions import PythonExpression, LaunchConfiguration robot_variant = LaunchConfiguration('robot_variant') # Load a variant-specific config file variant_config = PathJoinSubstitution([ FindPackageShare('my_robot_bringup'), 'config', 'variants', PythonExpression(["'", robot_variant, "' + '.yaml'"]), ]) # Conditional node: only load the arm driver for heavy_payload variant arm_driver = Node( package='arm_driver', executable='arm_controller_node', name='arm_controller', condition=IfCondition( PythonExpression(["'", robot_variant, "' == 'heavy_payload'"]) ), )
Ordered Startup with Health Checks
Startup Dependency Graph
Nodes must start in a specific order to avoid subscribing to topics that do not yet exist or calling services before they are available.
┌──────────────┐ │ motors.srv │ └──────┬───────┘ │ ┌────────────┼────────────┐ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ camera │ │ lidar │ │ imu │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ ▼ ▼ ▼ ┌──────────────────────────────────┐ │ perception / SLAM │ └──────────────┬───────────────────┘ │ ▼ ┌──────────────────────────────────┐ │ navigation / planning │ └──────────────────────────────────┘
Health Check Scripts
Use health check scripts in
ExecStartPre to block service startup until dependencies are ready.
#!/bin/bash # /usr/local/bin/robot-device-check.sh # Verifies that all required hardware devices are present before starting drivers. # Exit code 0 = all devices found, non-zero = missing device. set -euo pipefail REQUIRED_DEVICES=( "/dev/robot/camera_front" "/dev/robot/lidar" "/dev/robot/imu" "/dev/robot/motor_controller" ) TIMEOUT=30 POLL_INTERVAL=1 elapsed=0 for device in "${REQUIRED_DEVICES[@]}"; do elapsed=0 while [ ! -e "$device" ]; do if [ "$elapsed" -ge "$TIMEOUT" ]; then echo "ERROR: Device $device not found after ${TIMEOUT}s" >&2 exit 1 fi echo "Waiting for $device... (${elapsed}s/${TIMEOUT}s)" sleep "$POLL_INTERVAL" elapsed=$((elapsed + POLL_INTERVAL)) done echo "Found device: $device" done echo "All required devices are present." exit 0
#!/bin/bash # /usr/local/bin/wait-for-ros2-nodes.sh # Blocks until specified ROS2 nodes are active. # Usage: wait-for-ros2-nodes.sh node_name1 node_name2 ... set -euo pipefail source /opt/ros/${ROS_DISTRO}/setup.bash source /home/robot/ros2_ws/install/setup.bash TIMEOUT=60 POLL_INTERVAL=2 for node_name in "$@"; do elapsed=0 while ! ros2 node list 2>/dev/null | grep -q "$node_name"; do if [ "$elapsed" -ge "$TIMEOUT" ]; then echo "ERROR: Node $node_name not found after ${TIMEOUT}s" >&2 exit 1 fi echo "Waiting for node $node_name... (${elapsed}s/${TIMEOUT}s)" sleep "$POLL_INTERVAL" elapsed=$((elapsed + POLL_INTERVAL)) done echo "Node active: $node_name" done echo "All required nodes are active." exit 0
Wait-for-Topic Pattern
A reusable Python utility to block until a topic is being published, useful for ordered startup in launch files.
#!/usr/bin/env python3 # wait_for_topic.py # Usage: python3 wait_for_topic.py /scan sensor_msgs/msg/LaserScan --timeout 30 import argparse import sys import time import importlib import rclpy from rclpy.node import Node from rclpy.qos import qos_profile_sensor_data class TopicWaiter(Node): def __init__(self, topic_name, msg_type_str, timeout): super().__init__('topic_waiter') self.received = False self.timeout = timeout self.start_time = time.time() # Dynamically import the message type module_name, class_name = msg_type_str.rsplit('/', 1) module_name = module_name.replace('/', '.') module = importlib.import_module(module_name) msg_type = getattr(module, class_name) self.sub = self.create_subscription( msg_type, topic_name, self._callback, qos_profile_sensor_data) self.timer = self.create_timer(1.0, self._check_timeout) self.get_logger().info(f'Waiting for topic {topic_name}...') def _callback(self, msg): self.get_logger().info('Topic is active, message received.') self.received = True def _check_timeout(self): if self.received: raise SystemExit(0) elapsed = time.time() - self.start_time if elapsed > self.timeout: self.get_logger().error(f'Timeout after {self.timeout}s') raise SystemExit(1) def main(): parser = argparse.ArgumentParser() parser.add_argument('topic', help='Topic name to wait for') parser.add_argument('msg_type', help='Message type (e.g., sensor_msgs/msg/LaserScan)') parser.add_argument('--timeout', type=float, default=30.0) args = parser.parse_args() rclpy.init() node = TopicWaiter(args.topic, args.msg_type, args.timeout) rclpy.spin(node) if __name__ == '__main__': main()
Lifecycle Node Orchestration for Ordered Startup
Use lifecycle (managed) nodes to enforce startup ordering. A lifecycle manager configures and activates nodes in sequence, ensuring each node completes its configuration before the next one starts.
# lifecycle_manager.launch.py from launch import LaunchDescription from launch_ros.actions import Node def generate_launch_description(): # Lifecycle manager controls the startup/shutdown order lifecycle_manager = Node( package='nav2_lifecycle_manager', executable='lifecycle_manager', name='lifecycle_manager', output='screen', parameters=[{ # Nodes are transitioned in order: configure, then activate 'node_names': [ 'motor_controller', 'camera_driver', 'lidar_driver', 'slam', 'navigation', ], 'autostart': True, # Timeout for each node transition 'bond_timeout': 10.0, # Check period for node bonds 'bond_respawn_max_duration': 2.0, }], ) return LaunchDescription([ lifecycle_manager, ])
udev Rules for Deterministic Device Naming
Writing udev Rules for Cameras
USB cameras can enumerate in any order on boot, causing
/dev/video0 to be unpredictable. Use udev rules to assign stable symlinks based on device attributes.
# /etc/udev/rules.d/99-robot-cameras.rules # Assign stable device names based on USB port path (physical location). # Find attributes with: udevadm info --name=/dev/video0 --attribute-walk # Front camera: USB hub port 1, interface 0 (video capture) SUBSYSTEM=="video4linux", ATTRS{idVendor}=="1234", ATTRS{idProduct}=="5678", \ KERNELS=="1-1.2:1.0", ATTR{index}=="0", \ SYMLINK+="robot/camera_front", MODE="0666", GROUP="video" # Rear camera: USB hub port 2, interface 0 (video capture) SUBSYSTEM=="video4linux", ATTRS{idVendor}=="1234", ATTRS{idProduct}=="5678", \ KERNELS=="1-1.3:1.0", ATTR{index}=="0", \ SYMLINK+="robot/camera_rear", MODE="0666", GROUP="video" # Depth camera (RealSense): by serial number SUBSYSTEM=="video4linux", ATTRS{idVendor}=="8086", ATTRS{idProduct}=="0b3a", \ ATTRS{serial}=="123456789", ATTR{index}=="0", \ SYMLINK+="robot/camera_depth", MODE="0666", GROUP="video"
Writing udev Rules for Serial Devices
Serial devices (IMU, motor controller, GPS) also need stable names since
/dev/ttyUSB* numbering is non-deterministic.
# /etc/udev/rules.d/99-robot-serial.rules # IMU on FTDI serial adapter (identified by serial number) SUBSYSTEM=="tty", ATTRS{idVendor}=="0403", ATTRS{idProduct}=="6001", \ ATTRS{serial}=="AB0CDEFG", \ SYMLINK+="robot/imu", MODE="0666", GROUP="dialout" # Motor controller on USB port path SUBSYSTEM=="tty", ATTRS{idVendor}=="1a86", ATTRS{idProduct}=="7523", \ KERNELS=="1-1.4:1.0", \ SYMLINK+="robot/motor_controller", MODE="0666", GROUP="dialout" # GPS receiver SUBSYSTEM=="tty", ATTRS{idVendor}=="1546", ATTRS{idProduct}=="01a7", \ SYMLINK+="robot/gps", MODE="0666", GROUP="dialout" # LiDAR (CP2102 adapter) SUBSYSTEM=="tty", ATTRS{idVendor}=="10c4", ATTRS{idProduct}=="ea60", \ ATTRS{serial}=="0001", \ SYMLINK+="robot/lidar", MODE="0666", GROUP="dialout"
Reloading and Testing
# Reload udev rules without rebooting sudo udevadm control --reload-rules sudo udevadm trigger # Test a rule against a specific device sudo udevadm test $(udevadm info --query=path --name=/dev/ttyUSB0) # View all attributes for a device (use to find idVendor, serial, etc.) udevadm info --name=/dev/ttyUSB0 --attribute-walk # Monitor udev events in real time (plug/unplug devices to see events) udevadm monitor --subsystem-match=tty --property
Network Configuration for Multi-Machine ROS2
Static IP Configuration
Assign a static IP to the robot's wired interface using netplan (Ubuntu 22.04+).
# /etc/netplan/01-robot-network.yaml network: version: 2 ethernets: eth0: addresses: - 10.0.0.10/24 routes: - to: default via: 10.0.0.1 nameservers: addresses: - 8.8.8.8 - 8.8.4.4 wifis: wlan0: dhcp4: true access-points: "RobotNetwork": password: "securepassword"
# Apply netplan configuration sudo netplan apply
DDS Discovery Across Machines
CycloneDDS requires explicit peer configuration for multi-machine setups since multicast may not work across network segments.
<!-- /etc/robot/cyclonedds.xml --> <CycloneDDS> <Domain> <General> <Interfaces> <NetworkInterface name="eth0" priority="default" multicast="false"/> </Interfaces> <AllowMulticast>false</AllowMulticast> </General> <Discovery> <ParticipantIndex>auto</ParticipantIndex> <Peers> <!-- Robot onboard computer --> <Peer address="10.0.0.10"/> <!-- Base station / operator workstation --> <Peer address="10.0.0.20"/> <!-- Second robot (if applicable) --> <Peer address="10.0.0.11"/> </Peers> <MaxAutoParticipantIndex>30</MaxAutoParticipantIndex> </Discovery> <Internal> <SocketReceiveBufferSize min="10MB"/> </Internal> </Domain> </CycloneDDS>
Firewall Rules
DDS uses a range of UDP ports for discovery and data exchange. Open these ports on both the robot and the base station.
#!/bin/bash # /usr/local/bin/robot-firewall-setup.sh # Open firewall ports for CycloneDDS discovery and data exchange. # DDS discovery (SPDP) uses UDP port 7400 + (250 * domain_id) + participant_id # For domain_id=42: base port = 7400 + 250*42 = 17900 DOMAIN_ID=42 BASE_PORT=$((7400 + 250 * DOMAIN_ID)) # Allow discovery (SPDP) multicast/unicast sudo ufw allow proto udp from 10.0.0.0/24 to any port $BASE_PORT:$((BASE_PORT + 100)) # Allow data exchange (SEDP) user traffic ports DATA_PORT=$((BASE_PORT + 1)) sudo ufw allow proto udp from 10.0.0.0/24 to any port $DATA_PORT:$((DATA_PORT + 200)) # Allow all traffic on the robot subnet (simpler alternative) # sudo ufw allow from 10.0.0.0/24 sudo ufw reload echo "Firewall configured for ROS2 DDS on domain $DOMAIN_ID"
ROS_DOMAIN_ID and ROS_LOCALHOST_ONLY
# Isolate robots on the same network by domain ID (0-232) export ROS_DOMAIN_ID=42 # Lock DDS traffic to localhost only (useful for single-machine development) export ROS_LOCALHOST_ONLY=1 # For multi-machine setups, ensure ROS_LOCALHOST_ONLY is 0 on all machines export ROS_LOCALHOST_ONLY=0 # Verify DDS discovery across machines ros2 daemon stop && ros2 daemon start ros2 topic list # Should see topics from both machines
Watchdog and Heartbeat Monitoring
systemd Watchdog Integration
When
WatchdogSec is set in the service unit, the process must periodically notify systemd that it is alive. If the notification is missed, systemd restarts the service.
#!/usr/bin/env python3 # watchdog_node.py # A ROS2 node that integrates with systemd watchdog via sd_notify. import os import socket import time import rclpy from rclpy.node import Node from diagnostic_msgs.msg import DiagnosticArray, DiagnosticStatus class WatchdogNode(Node): """Notifies systemd that the ROS2 process is alive.""" def __init__(self): super().__init__('watchdog_node') # Read the watchdog interval from systemd environment watchdog_usec = os.environ.get('WATCHDOG_USEC') if watchdog_usec: # Notify at half the watchdog interval for safety margin interval_sec = int(watchdog_usec) / 1_000_000 / 2.0 else: interval_sec = 10.0 self.get_logger().warn('WATCHDOG_USEC not set, using 10s interval') # Connect to systemd notification socket self.notify_socket = os.environ.get('NOTIFY_SOCKET') # Signal that startup is complete self._sd_notify('READY=1') self.get_logger().info( f'Watchdog node started, notify interval: {interval_sec:.1f}s') # Periodically send watchdog keepalive self.create_timer(interval_sec, self._watchdog_tick) # Subscribe to system diagnostics to detect failures self.diag_sub = self.create_subscription( DiagnosticArray, '/diagnostics', self._diag_callback, 10) self.system_healthy = True def _watchdog_tick(self): """Send watchdog keepalive to systemd if system is healthy.""" if self.system_healthy: self._sd_notify('WATCHDOG=1') else: self.get_logger().error( 'System unhealthy, withholding watchdog notification') def _diag_callback(self, msg): """Monitor diagnostics for critical errors.""" for status in msg.status: if status.level == DiagnosticStatus.ERROR: self.get_logger().error(f'Critical error: {status.name}: {status.message}') self.system_healthy = False return self.system_healthy = True def _sd_notify(self, state): """Send notification to systemd.""" if not self.notify_socket: return addr = self.notify_socket if addr[0] == '@': addr = '\0' + addr[1:] sock = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM) try: sock.connect(addr) sock.sendall(state.encode()) finally: sock.close() def main(): rclpy.init() node = WatchdogNode() rclpy.spin(node) node.destroy_node() rclpy.shutdown() if __name__ == '__main__': main()
ROS2-Level Heartbeat Monitor Node
A monitor node that subscribes to heartbeat topics from critical subsystems and publishes overall system health. If a heartbeat is missed, it triggers a safe stop.
#!/usr/bin/env python3 # heartbeat_monitor.py # Monitors heartbeats from critical nodes and triggers safe stop if any go silent. import time import rclpy from rclpy.node import Node from rclpy.qos import QoSProfile, ReliabilityPolicy, DurabilityPolicy from std_msgs.msg import Bool, String from geometry_msgs.msg import Twist from diagnostic_msgs.msg import DiagnosticArray, DiagnosticStatus, KeyValue class HeartbeatMonitor(Node): def __init__(self): super().__init__('heartbeat_monitor') # Declare parameters for monitored nodes and timeout self.declare_parameter('monitored_nodes', [ 'motor_controller', 'camera_driver', 'lidar_driver', 'slam' ]) self.declare_parameter('heartbeat_timeout_sec', 5.0) self.declare_parameter('check_period_sec', 1.0) self.monitored_nodes = self.get_parameter('monitored_nodes').value self.timeout = self.get_parameter('heartbeat_timeout_sec').value check_period = self.get_parameter('check_period_sec').value # Track last heartbeat time for each monitored node self.last_heartbeat = {name: time.time() for name in self.monitored_nodes} # Subscribe to each node's heartbeat topic reliable_qos = QoSProfile( reliability=ReliabilityPolicy.RELIABLE, durability=DurabilityPolicy.VOLATILE, depth=1, ) for node_name in self.monitored_nodes: self.create_subscription( Bool, f'/{node_name}/heartbeat', lambda msg, n=node_name: self._heartbeat_callback(n, msg), reliable_qos, ) # Publishers self.health_pub = self.create_publisher( DiagnosticArray, '/system_health', 10) self.estop_pub = self.create_publisher( Bool, '/emergency_stop', reliable_qos) self.cmd_vel_pub = self.create_publisher( Twist, '/cmd_vel', 10) # Periodic health check self.create_timer(check_period, self._check_health) self.get_logger().info( f'Monitoring heartbeats for: {self.monitored_nodes}') def _heartbeat_callback(self, node_name, msg): """Record heartbeat reception time.""" self.last_heartbeat[node_name] = time.time() def _check_health(self): """Check all heartbeats and publish diagnostics.""" now = time.time() diag_array = DiagnosticArray() diag_array.header.stamp = self.get_clock().now().to_msg() all_healthy = True for node_name in self.monitored_nodes: elapsed = now - self.last_heartbeat[node_name] status = DiagnosticStatus() status.name = f'heartbeat/{node_name}' if elapsed < self.timeout: status.level = DiagnosticStatus.OK status.message = f'Alive ({elapsed:.1f}s ago)' else: status.level = DiagnosticStatus.ERROR status.message = f'TIMEOUT ({elapsed:.1f}s since last heartbeat)' all_healthy = False self.get_logger().error( f'Heartbeat timeout for {node_name}: {elapsed:.1f}s') status.values = [ KeyValue(key='elapsed_sec', value=f'{elapsed:.2f}'), KeyValue(key='timeout_sec', value=f'{self.timeout:.2f}'), ] diag_array.status.append(status) self.health_pub.publish(diag_array) if not all_healthy: self._trigger_safe_stop() def _trigger_safe_stop(self): """Send zero velocity and emergency stop signal.""" self.get_logger().warn('Triggering safe stop due to heartbeat failure') # Publish zero velocity self.cmd_vel_pub.publish(Twist()) # Publish emergency stop estop_msg = Bool() estop_msg.data = True self.estop_pub.publish(estop_msg) def main(): rclpy.init() node = HeartbeatMonitor() rclpy.spin(node) node.destroy_node() rclpy.shutdown() if __name__ == '__main__': main()
Hardware Watchdog Integration
Many robot onboard computers have a hardware watchdog timer (e.g., Intel TCO, iTCO_wdt). If the software fails to pet the watchdog, the hardware performs a hard reboot.
# Enable the hardware watchdog in systemd # /etc/systemd/system.conf # RuntimeWatchdogSec=30 # RebootWatchdogSec=10min # ShutdownWatchdogSec=10min # Or configure per-service in the unit file: # WatchdogSec=30 triggers systemd to restart the service # The hardware watchdog (configured via RuntimeWatchdogSec) reboots # the entire machine if systemd itself becomes unresponsive. # Verify hardware watchdog is active sudo cat /sys/class/watchdog/watchdog0/state # Should output: active # Check watchdog timeout sudo cat /sys/class/watchdog/watchdog0/timeout
Logging and Log Rotation
ROS2 Log Configuration
# Set log level via environment export RCUTILS_LOGGING_USE_STDOUT=0 # Log to stderr (captured by journald) export RCUTILS_COLORIZED_OUTPUT=0 # Disable color codes in log files export RCUTILS_CONSOLE_OUTPUT_FORMAT="[{severity}] [{time}] [{name}]: {message}" # Set log level at runtime ros2 run my_pkg my_node --ros-args --log-level debug ros2 run my_pkg my_node --ros-args --log-level my_node:=debug # Set log level via parameter (Humble+) ros2 param set /my_node use_sim_time false ros2 service call /my_node/set_logger_level rcl_interfaces/srv/SetLoggerLevel \ "{logger_name: 'my_node', level: 10}"
journald Configuration for ROS2 Services
# /etc/systemd/journald.conf.d/robot.conf [Journal] # Persist logs across reboots Storage=persistent # Limit total journal size to 1 GB SystemMaxUse=1G # Limit per-file size to 100 MB SystemMaxFileSize=100M # Keep logs for 30 days MaxRetentionSec=30day # Rate limit: allow bursts during startup RateLimitIntervalSec=10s RateLimitBurst=10000 # Forward to syslog for remote logging ForwardToSyslog=yes
# View logs for a specific robot service journalctl -u robot-drivers.service -f # View logs since last boot journalctl -u robot-bringup.service -b # View logs with priority filtering (error and above) journalctl -u robot-bringup.service -p err # Export logs for analysis journalctl -u robot-bringup.service --since "2024-01-01" --output=json > logs.json
logrotate for ROS2 Log Files
ROS2 writes log files to
~/.ros/log/ by default, or to $ROS_LOG_DIR. These grow unbounded without rotation.
# /etc/logrotate.d/ros2 /var/log/ros2/*.log { daily rotate 7 compress delaycompress missingok notifempty create 0644 robot robot maxsize 100M dateext dateformat -%Y%m%d postrotate # Notify ROS2 nodes to reopen log files (if using file logging) systemctl kill --signal=HUP robot-bringup.service 2>/dev/null || true endscript } /home/robot/.ros/log/**/*.log { daily rotate 3 compress missingok notifempty maxsize 50M }
Structured Logging for Production
import json import logging from rclpy.node import Node class StructuredLogger: """Wraps ROS2 logger with structured JSON output for production monitoring.""" def __init__(self, node: Node): self.node = node self.logger = node.get_logger() def log_event(self, event_type: str, level: str = 'info', **kwargs): """Log a structured event with key-value metadata.""" entry = { 'event': event_type, 'node': self.node.get_name(), 'namespace': self.node.get_namespace(), 'stamp': self.node.get_clock().now().nanoseconds, **kwargs, } message = json.dumps(entry) getattr(self.logger, level)(message) # Usage in a node: # self.slog = StructuredLogger(self) # self.slog.log_event('detection', count=5, latency_ms=12.3) # self.slog.log_event('motor_fault', level='error', motor_id=2, code=0x0A)
Graceful Shutdown Sequences
Signal Handling in ROS2 Nodes
ROS2 nodes should handle
SIGINT and SIGTERM to bring actuators to a safe state before exiting.
#!/usr/bin/env python3 # safe_shutdown_node.py # Demonstrates graceful shutdown with safe state transitions. import signal import sys import rclpy from rclpy.node import Node from geometry_msgs.msg import Twist from std_msgs.msg import Bool class SafeShutdownNode(Node): def __init__(self): super().__init__('safe_shutdown_node') self.cmd_vel_pub = self.create_publisher(Twist, '/cmd_vel', 10) self.brake_pub = self.create_publisher(Bool, '/brakes/engage', 10) # Register signal handlers for graceful shutdown signal.signal(signal.SIGTERM, self._shutdown_handler) signal.signal(signal.SIGINT, self._shutdown_handler) self.get_logger().info('Node started with graceful shutdown handler') def _shutdown_handler(self, signum, frame): """Handle shutdown signals by commanding safe state.""" sig_name = signal.Signals(signum).name self.get_logger().warn(f'Received {sig_name}, initiating safe shutdown...') # Step 1: Command zero velocity immediately zero_twist = Twist() # All fields default to 0.0 for _ in range(5): self.cmd_vel_pub.publish(zero_twist) # Step 2: Engage brakes brake_msg = Bool() brake_msg.data = True self.brake_pub.publish(brake_msg) # Step 3: Wait briefly for commands to be received self.get_logger().info('Safe state commanded, shutting down...') # Step 4: Clean exit self.destroy_node() rclpy.shutdown() sys.exit(0) def main(): rclpy.init() node = SafeShutdownNode() rclpy.spin(node) node.destroy_node() rclpy.shutdown() if __name__ == '__main__': main()
Ordered Shutdown via systemd Dependencies
The
PartOf and Before directives ensure that application-level services are stopped before drivers, preventing the situation where a navigation node sends velocity commands after the motor driver has exited.
# /etc/systemd/system/robot-application.service [Unit] # ... PartOf=robot-perception.service Before=robot-perception.service # When robot-perception stops, robot-application is stopped FIRST # (Before= reverses the stop order relative to start order) [Service] # Use SIGINT for ROS2's signal handler, then SIGTERM, then SIGKILL KillSignal=SIGINT TimeoutStopSec=15 FinalKillSignal=SIGTERM SendSIGKILL=yes
Safe State on Shutdown
# Use rclpy context shutdown callback for cleanup import rclpy from rclpy.node import Node class ActuatorNode(Node): def __init__(self): super().__init__('actuator_node') self.cmd_pub = self.create_publisher(Twist, '/cmd_vel', 10) # Register a callback that runs during rclpy.shutdown() context = self.context context.on_shutdown(self._on_shutdown) def _on_shutdown(self): """Called automatically during rclpy.shutdown().""" self.get_logger().info('Shutdown callback: commanding zero velocity') self.cmd_pub.publish(Twist())
Remote Monitoring and Debugging
SSH Tunneling for ROS2 Topics
Forward DDS traffic over SSH when direct network connectivity is not available (e.g., robot is on a cellular connection).
#!/bin/bash # ssh-ros2-tunnel.sh # Creates an SSH tunnel for ROS2 DDS traffic between local machine and robot. # Usage: ./ssh-ros2-tunnel.sh robot@10.0.0.10 set -euo pipefail ROBOT_HOST="${1:?Usage: $0 robot@host}" DOMAIN_ID="${ROS_DOMAIN_ID:-0}" BASE_PORT=$((7400 + 250 * DOMAIN_ID)) echo "Setting up SSH tunnel for ROS2 domain $DOMAIN_ID (ports $BASE_PORT-$((BASE_PORT + 200)))" # Forward DDS discovery and data ports ssh -N \ -L ${BASE_PORT}:localhost:${BASE_PORT} \ -L $((BASE_PORT + 1)):localhost:$((BASE_PORT + 1)) \ -L $((BASE_PORT + 10)):localhost:$((BASE_PORT + 10)) \ -L $((BASE_PORT + 11)):localhost:$((BASE_PORT + 11)) \ "$ROBOT_HOST" & SSH_PID=$! echo "SSH tunnel PID: $SSH_PID" # Set environment for local ROS2 to use localhost-only discovery export ROS_LOCALHOST_ONLY=1 echo "Run: export ROS_LOCALHOST_ONLY=1" echo "Then use ros2 topic list, ros2 topic echo, etc." echo "Press Ctrl+C to close tunnel." wait $SSH_PID
Remote journalctl and Service Management
# View live robot logs remotely ssh robot@10.0.0.10 'journalctl -u robot-bringup.service -f' # Check service status ssh robot@10.0.0.10 'systemctl status robot-drivers.service robot-perception.service' # Restart a single layer without rebooting ssh robot@10.0.0.10 'sudo systemctl restart robot-perception.service' # View boot-time service ordering ssh robot@10.0.0.10 'systemd-analyze blame | head -20' # Check for failed services ssh robot@10.0.0.10 'systemctl --failed' # Stream structured logs as JSON ssh robot@10.0.0.10 'journalctl -u robot-bringup.service -o json --follow'
Deploying Updates via SSH
#!/bin/bash # deploy-to-robot.sh # Build locally, copy to robot, and restart services. # Usage: ./deploy-to-robot.sh robot@10.0.0.10 set -euo pipefail ROBOT_HOST="${1:?Usage: $0 robot@host}" WORKSPACE="/home/robot/ros2_ws" echo "=== Building workspace locally ===" colcon build --cmake-args -DCMAKE_BUILD_TYPE=Release --packages-select my_robot_bringup echo "=== Syncing to robot ===" rsync -avz --delete \ --exclude='build/' --exclude='log/' \ src/ "${ROBOT_HOST}:${WORKSPACE}/src/" echo "=== Building on robot ===" ssh "$ROBOT_HOST" "cd ${WORKSPACE} && \ source /opt/ros/\${ROS_DISTRO}/setup.bash && \ colcon build --cmake-args -DCMAKE_BUILD_TYPE=Release" echo "=== Restarting robot services ===" ssh "$ROBOT_HOST" 'sudo systemctl restart robot-bringup.target' echo "=== Checking service status ===" ssh "$ROBOT_HOST" 'sleep 3 && systemctl status robot-bringup.target --no-pager' echo "Deploy complete."
Robot Bringup Anti-Patterns
1. Sourcing setup.bash in .bashrc for systemd
Problem: systemd services do not load
~/.bashrc or ~/.profile. Environment variables set there are invisible to the service, causing ROS2 commands to fail with "command not found" or missing package errors.
# BAD: Relying on .bashrc for systemd services # ~/.bashrc source /opt/ros/humble/setup.bash # systemd will never see this # GOOD: Use EnvironmentFile in the service unit and source explicitly in ExecStart # /etc/robot/ros2.env ROS_DISTRO=humble RMW_IMPLEMENTATION=rmw_cyclonedds_cpp # In the service unit: # EnvironmentFile=/etc/robot/ros2.env # ExecStart=/bin/bash -c 'source /opt/ros/${ROS_DISTRO}/setup.bash && ...'
2. No Startup Ordering
Problem: Starting all ROS2 nodes simultaneously causes race conditions. A navigation node may attempt to call a service that has not yet been advertised by the driver, leading to intermittent startup failures.
Fix: Use
After= and Requires= in systemd units, or use a lifecycle manager to enforce ordered transitions:
# BAD: All services start in parallel with no ordering [Unit] Description=Robot Navigation # No After= or Requires= directives # GOOD: Explicit dependency chain [Unit] Description=Robot Navigation After=robot-perception.service Requires=robot-perception.service
3. Using Restart=always Without Rate Limiting
Problem: A service that crashes on startup (e.g., missing config file, hardware disconnected) will restart in a tight loop, consuming CPU and flooding the journal.
Fix: Use
StartLimitIntervalSec and StartLimitBurst to cap restart attempts:
# BAD: Infinite restart loop [Service] Restart=always RestartSec=1 # GOOD: Rate-limited restarts with failure notification [Service] Restart=on-failure RestartSec=5 StartLimitIntervalSec=120 StartLimitBurst=5
4. Relying on network.target Instead of network-online.target
Problem:
network.target is reached as soon as the network configuration starts, not when connectivity is actually established. DDS discovery fails because the network interface does not have an IP address yet.
Fix: Use
network-online.target and ensure systemd-networkd-wait-online.service or NetworkManager-wait-online.service is enabled:
# BAD: network.target does not guarantee connectivity [Unit] After=network.target # GOOD: Wait for actual network connectivity [Unit] After=network-online.target Wants=network-online.target
5. No Log Rotation
Problem: ROS2 log files in
~/.ros/log/ and journal entries grow without limit, eventually filling the disk on an embedded system with limited storage.
Fix: Configure logrotate for ROS2 log files and set journald size limits:
# BAD: No log management # Logs in ~/.ros/log/ grow forever, disk fills up after weeks of operation # GOOD: logrotate config + journald limits # /etc/logrotate.d/ros2 (see Logging section above) # /etc/systemd/journald.conf: SystemMaxUse=1G
6. Hardcoded Device Paths (/dev/ttyUSB0)
Problem:
/dev/ttyUSB0 can be assigned to any USB serial device depending on enumeration order. After a reboot, the IMU might become /dev/ttyUSB1 and the motor controller /dev/ttyUSB0, reversing the mapping.
Fix: Use udev rules to create stable symlinks:
# BAD: Hardcoded device path in ROS2 params serial_port: "/dev/ttyUSB0" # Which device is this? It changes on reboot! # GOOD: Stable symlink via udev rule serial_port: "/dev/robot/imu" # Always points to the correct device
7. Running the Entire Stack as Root
Problem: Running ROS2 as root is a security risk and can cause permission issues with
rosbag2, log files, and parameter persistence. A bug in a ROS2 node could damage the operating system.
Fix: Create a dedicated
robot user and grant only the necessary device permissions via udev GROUP and MODE rules:
# BAD: Running as root # ExecStart=/bin/bash -c 'source /opt/ros/humble/setup.bash && ros2 launch ...' # (runs as root because no User= is specified) # GOOD: Dedicated user with minimal privileges # Create robot user sudo useradd -r -m -s /bin/bash robot sudo usermod -aG dialout,video,plugdev robot # In the service unit: # User=robot # Group=robot # udev rules grant device access to the robot user's groups: # MODE="0666", GROUP="dialout"
8. No Graceful Shutdown Handler
Problem: When systemd sends
SIGTERM or SIGINT to stop a ROS2 node, the node exits immediately without commanding zero velocity or engaging brakes. The robot may coast or continue moving with the last commanded velocity.
Fix: Register signal handlers or use rclpy's shutdown callback to command a safe state:
# BAD: No shutdown handling, node just exits def main(): rclpy.init() node = MotorControlNode() rclpy.spin(node) # Robot is still moving with last commanded velocity! # GOOD: Shutdown handler commands safe state def main(): rclpy.init() node = MotorControlNode() try: rclpy.spin(node) except KeyboardInterrupt: pass finally: node.command_zero_velocity() node.engage_brakes() node.destroy_node() rclpy.shutdown()
Robot Bringup Checklist
- udev rules written and tested for all USB devices (cameras, LiDARs, serial adapters) with stable symlinks under
/dev/robot/ - systemd service units created for each layer (drivers, perception, application) with correct
/After=
orderingRequires= - ROS2 environment file (
) configured with/etc/robot/ros2.env
,ROS_DISTRO
,RMW_IMPLEMENTATION
, andROS_DOMAIN_IDCYCLONEDDS_URI - CycloneDDS or FastDDS XML configured with explicit peer list for multi-machine discovery
- Launch files layered and composable with conditional arguments for sim/real and robot variants
- Health check scripts written for
to verify device presence before starting driversExecStartPre - Watchdog integration configured:
in service units andWatchdogSec
in the ROS2 processsd_notify(WATCHDOG=1) - Heartbeat monitor node deployed to detect node failures and trigger safe stop
- Graceful shutdown handlers registered in all actuator nodes (zero velocity, engage brakes on
/SIGINT
)SIGTERM - Log rotation configured via logrotate for
and journald$ROS_LOG_DIR
limitsSystemMaxUse - Restart policies rate-limited with
andStartLimitIntervalSec
to prevent restart loopsStartLimitBurst - Resource limits set via
,MemoryMax
to prevent runaway nodes from starving the systemCPUQuota - Network and firewall configured with static IPs, DDS port rules, and
set correctlyROS_LOCALHOST_ONLY - Full boot test performed from power-off to autonomous operation, verifying service ordering and recovery from simulated failures