Docker swarm on NVidia Jetson TX1 - bridge device issues

I am attempting to use an NVidia Jetson TX1 (aarch64) as worker within a docker swarm.

The swarm also contains two x86_64 nodes - the master and another worker - both running stock latest revisions of Ubuntu 18.04 and docker-ce 18.09.3.

The TX1 is in a standard dev board, is running a somewhat stripped down install of L4T from Jetpack 3.3 (Ubuntu 16.04 derived), and with docker-ce 18.09.3.

Using docker on the TX1 standalone works fine for simple things at least. I can start and stop containers and connect to them etc.

When I then try to add the TX1 to the cluster after the usual set of info level messages wrt the gossip cluster getting wired up I then see two error messages in ‘journalctl -u docker.service’:

Mar 19 10:06:47 tegra-ubuntu dockerd[1028]: time=“2019-03-19T10:06:47Z” level=error msg=“enabling default vlan on bridge br0 failed open /sys/class/net/br0/bridge/default_pvid: permission denied”
Mar 19 10:06:47 tegra-ubuntu dockerd[1028]: time=“2019-03-19T10:06:47.662790794Z” level=error msg=“reexec to set bridge default vlan failed exit status 1”

… which I don’t see on my other worker.

The TX1 appears to have joined the cluster successfully from the manager:

$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
s14o76ap2sdgf2g7jfyka8b5h geoff-OldMacBook Ready Active 18.09.3
1k74jna5lhfba50ks6g7k0r7e tegra-ubuntu Ready Active 18.09.3
94zgws7ym9dwbo2x6be67hp9u * toc17-office Ready Active Leader 18.09.3

When I try to deploy a stack which puts a simple container onto the TX1 based on the following compose file fragment:

pub:
image: ros:melodic-ros-core
environment:
- “ROS_MASTER_URI=http://ros-master:11311”
- “ROS_HOSTNAME=pub”
command: stdbuf -o L rostopic pub /turtle1/cmd_vel geometry_msgs/Twist -r 1 – ‘[2.0, 0.0, 0.0]’ ‘[0.0, 0.0, -1.8]’
deploy:
placement:
constraints: [node.hostname == tegra-ubuntu]

It seems to get stuck in a continual fail-restart loop. If I deploy it to the other worker instead it works fine. I can run that container image directly on the TX1 outside the swarm via docker run

I am guessing that something in the L4T setup is causing the docker swarm overlay network not to be created correctly? Has anybody come across this sort of thing before on Jetson or otherwise?

Thanks,

Geoff

Having dug a little further into it it does seem to be a network setup issue in the vicinity of the TX1.

If I create a quiescent container on each worker using the following compose fragment:

pub:
image: ros:melodic-ros-core
environment:
- “ROS_MASTER_URI=http://ros-master:11311”
- “ROS_HOSTNAME=pub”
command: sleep 999999
deploy:
placement:
constraints: [node.hostname == tegra-ubuntu]

pub2:
image: ros:melodic-ros-core
environment:
- “ROS_MASTER_URI=http://ros-master:11311”
- “ROS_HOSTNAME=pub2”
command: sleep 999999
deploy:
placement:
constraints: [node.hostname == geoff-OldMacBook]

… and then shell in to each by docker exec’ing /bin/bash I find that:

  • ros-master resolves to 10.0.22.8 on both workers
  • “ping ros-master” works correctly on both workers
  • “curl http://ros-master:11311” succeeds on the other worker but gets connection refused on the TX1

Intriguing!,

Geoff

PS. ros-master is another container in the swarm running on the manager node

Focusing on the network setup on the two works there does seem to be an extra bridge interface on the working one compared to the TX1.

Working Ubuntu 18.04 worker:

$ ip -o a
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
1: lo inet6 ::1/128 scope host \ valid_lft forever preferred_lft forever
2: enp0s10 inet 192.168.0.3/24 brd 192.168.0.255 scope global noprefixroute enp0s10\ valid_lft forever preferred_lft forever
2: enp0s10 inet6 fe80::a673:3634:ff93:a8c2/64 scope link noprefixroute \ valid_lft forever preferred_lft forever
5: wls3 inet 192.168.218.129/24 brd 192.168.218.255 scope global dynamic noprefixroute wls3\ valid_lft 60406sec preferred_lft 60406sec
5: wls3 inet6 fe80::f51c:2d63:6ef1:b748/64 scope link noprefixroute \ valid_lft forever preferred_lft forever
6: docker0 inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0\ valid_lft forever preferred_lft forever
7: br-26255e7ca904 inet 172.18.0.1/16 brd 172.18.255.255 scope global br-26255e7ca904\ valid_lft forever preferred_lft forever
8: docker_gwbridge inet 172.19.0.1/16 brd 172.19.255.255 scope global docker_gwbridge\ valid_lft forever preferred_lft forever
8: docker_gwbridge inet6 fe80::42:22ff:febb:c01f/64 scope link \ valid_lft forever preferred_lft forever
33: veth1fe0a5f inet6 fe80::e064:92ff:fed5:46d5/64 scope link \ valid_lft forever preferred_lft forever
46: enp0s4f1u1 inet6 fe80::2ec1:94bd:6108:6f78/64 scope link noprefixroute \ valid_lft forever preferred_lft forever
47: enp0s4f1u1i5 inet6 fe80::db42:3d8a:4c38:3cf2/64 scope link noprefixroute \ valid_lft forever preferred_lft forever
61: vethf8cf197 inet6 fe80::a01b:32ff:fe66:57d7/64 scope link \ valid_lft forever preferred_lft forever

TX1 L4Tworker:

$ ip -o a
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
1: lo inet6 ::1/128 scope host \ valid_lft forever preferred_lft forever
6: usb0 inet6 fe80::f886:67ff:fe3d:bb53/64 scope link \ valid_lft forever preferred_lft forever
7: usb1 inet6 fe80::ac3a:37ff:fed1:eb8/64 scope link \ valid_lft forever preferred_lft forever
8: l4tbr0 inet 192.168.55.1/24 brd 192.168.55.255 scope global l4tbr0\ valid_lft forever preferred_lft forever
8: l4tbr0 inet6 fe80::fc5c:7dff:fea1:7948/64 scope link \ valid_lft forever preferred_lft forever
11: eth1 inet 192.168.0.2/24 brd 192.168.0.255 scope global eth1\ valid_lft forever preferred_lft forever
11: eth1 inet6 fe80::204:4bff:fe5a:b45d/64 scope link \ valid_lft forever preferred_lft forever
12: docker0 inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0\ valid_lft forever preferred_lft forever
12: docker0 inet6 fe80::42:23ff:feda:569d/64 scope link \ valid_lft forever preferred_lft forever
13: docker_gwbridge inet 172.18.0.1/16 brd 172.18.255.255 scope global docker_gwbridge\ valid_lft forever preferred_lft forever
13: docker_gwbridge inet6 fe80::42:aff:fe51:e49d/64 scope link \ valid_lft forever preferred_lft forever
29: veth1f61ce0 inet6 fe80::e881:baff:fe04:20b1/64 scope link \ valid_lft forever preferred_lft forever
247: veth53367a5 inet6 fe80::3c01:e8ff:fef6:eabe/64 scope link \ valid_lft forever preferred_lft forever

Geoff

Digging deeper into the running containers I notice that the one running on the TX1 has a set of extra network devices compared with containers running on either the other worker or the manager node:

root@cc1d3ae129ad:/# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: tunl0@NONE: mtu 1480 qdisc noop state DOWN group default qlen 1
link/ipip 0.0.0.0 brd 0.0.0.0
3: sit0@NONE: mtu 1480 qdisc noop state DOWN group default qlen 1
link/sit 0.0.0.0 brd 0.0.0.0
4: ip6tnl0@NONE: mtu 1452 qdisc noop state DOWN group default qlen 1
link/tunnel6 :: brd ::
244: eth0@if245: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 02:42:0a:00:16:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.22.3/24 brd 10.0.22.255 scope global eth0
valid_lft forever preferred_lft forever
246: eth1@if247: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet 172.18.0.3/16 brd 172.18.255.255 scope global eth1
valid_lft forever preferred_lft forever

They are all NOARP+DOWN so shouldn’t be doing anything but unexpected. Here is the list from the other worker to compare:

root@270468515fa2:/# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
58: eth0@if59: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 02:42:0a:00:16:06 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.22.6/24 brd 10.0.22.255 scope global eth0
valid_lft forever preferred_lft forever
60: eth1@if61: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:13:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet 172.19.0.3/16 brd 172.19.255.255 scope global eth1
valid_lft forever preferred_lft forever

Geoff

I have switched my board over to a Jetson TX2 and flashed with the latest Jetpack 4.2 / LFT R32.1 and swarmworks fine with that setup so it looks like the issue is related to the network setup in the older Jetpack 3.3 L4T image.

Unfortunately signs are that NVIDIA aren’t going to support the original TX1 in Jetpack 4.x :-<,

Geoff