Network interfaces being renamed and disappearing on docker swarm worker nodes

Docker version 19.03.8, build afacb8b
OS Raspbian 5.4.42-v8+ #1319 SMP PREEMPT Wed May 20 14:18:56 BST 2020 aarch64 GNU/Linux

I have a swarm with 1 manager and 3 workers, all running on identical Raspberry Pi4 machines.

Every 24 hours or so, 1 or more of the worker nodes disappears from the network (often all 3 workers simultaneously). The node stays up and continues logging to /var/log, but is not accessible over ssh, and does not respond to ping.

I’d appreciate any help in how to debug what is going on and get this cluster stable!

Here’s an extract from the logs on one of the worker nodes that failed on Jul 31 at approx 18:36:

syslog:

Jul 31 18:36:47 pinode02 dockerd[10156]: time=“2020-07-31T18:36:47.560127105+01:00” level=info msg=“memberlist: Suspect f5bda1824f98 has failed, no acks received”
Jul 31 18:36:49 pinode02 dockerd[10156]: time=“2020-07-31T18:36:49.578821398+01:00” level=info msg=“Node f5bda1824f98/192.168.86.228, left gossip cluster”
Jul 31 18:36:49 pinode02 dockerd[10156]: time=“2020-07-31T18:36:49.579095950+01:00” level=info msg=“Node f5bda1824f98 change state NodeActive → NodeFailed”
Jul 31 18:36:49 pinode02 dockerd[10156]: time=“2020-07-31T18:36:49.587828841+01:00” level=info msg=“Node f5bda1824f98/192.168.86.228, added to failed nodes list”
Jul 31 18:36:50 pinode02 dockerd[10156]: time=“2020-07-31T18:36:50.549229306+01:00” level=info msg=“memberlist: Suspect f5bda1824f98 has failed, no acks received”
Jul 31 18:36:51 pinode02 dockerd[10156]: time=“2020-07-31T18:36:51.975090954+01:00” level=error msg=“heartbeat to manager {buvirih35jhbxhkuff4th7rfh 192.168.86.228:2377} failed” error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded” method=“(*session).heartbeat” module=node/agent node.id=ssowjkssmk5a5yv1wcvg101jv session.id=vhg4tcu3uite7z12ltypi8dig sessionID=vhg4tcu3uite7z12ltypi8dig
Jul 31 18:36:51 pinode02 dockerd[10156]: time=“2020-07-31T18:36:51.976552896+01:00” level=error msg=“agent: session failed” backoff=100ms error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded” module=node/agent node.id=ssowjkssmk5a5yv1wcvg101jv
Jul 31 18:36:51 pinode02 dockerd[10156]: time=“2020-07-31T18:36:51.976810855+01:00” level=info msg=“parsed scheme: ""” module=grpc
Jul 31 18:36:51 pinode02 dockerd[10156]: time=“2020-07-31T18:36:51.976860613+01:00” level=info msg=“scheme "" not registered, fallback to default scheme” module=grpc
Jul 31 18:36:51 pinode02 dockerd[10156]: time=“2020-07-31T18:36:51.977329070+01:00” level=info msg=“ccResolverWrapper: sending update to cc: {[{192.168.86.228:2377 0 }] }” module=grpc
Jul 31 18:36:51 pinode02 dockerd[10156]: time=“2020-07-31T18:36:51.977389087+01:00” level=info msg=“ClientConn switching balancer to "pick_first"” module=grpc
Jul 31 18:36:51 pinode02 dockerd[10156]: time=“2020-07-31T18:36:51.977472216+01:00” level=info msg=“manager selected by agent for new session: {buvirih35jhbxhkuff4th7rfh 192.168.86.228:2377}” module=node/agent node.id=ssowjkssmk5a5yv1wcvg101jv
Jul 31 18:36:51 pinode02 dockerd[10156]: time=“2020-07-31T18:36:51.977587307+01:00” level=info msg=“waiting 50.073616ms before registering session” module=node/agent node.id=ssowjkssmk5a5yv1wcvg101jv
Jul 31 18:36:57 pinode02 dockerd[10156]: time=“2020-07-31T18:36:57.028214385+01:00” level=error msg=“agent: session failed” backoff=300ms error=“session initiation timed out” module=node/agent node.id=ssowjkssmk5a5yv1wcvg101jv
Jul 31 18:36:57 pinode02 dockerd[10156]: time=“2020-07-31T18:36:57.028687563+01:00” level=info msg=“parsed scheme: ""” module=grpc
Jul 31 18:36:57 pinode02 dockerd[10156]: time=“2020-07-31T18:36:57.028793580+01:00” level=info msg=“scheme "" not registered, fallback to default scheme” module=grpc
Jul 31 18:36:57 pinode02 dockerd[10156]: time=“2020-07-31T18:36:57.029778381+01:00” level=info msg=“ccResolverWrapper: sending update to cc: {[{192.168.86.228:2377 0 }] }” module=grpc
Jul 31 18:36:57 pinode02 dockerd[10156]: time=“2020-07-31T18:36:57.029900861+01:00” level=info msg=“ClientConn switching balancer to "pick_first"” module=grpc
Jul 31 18:36:57 pinode02 dockerd[10156]: time=“2020-07-31T18:36:57.030065062+01:00” level=info msg=“manager selected by agent for new session: {buvirih35jhbxhkuff4th7rfh 192.168.86.228:2377}” module=node/agent node.id=ssowjkssmk5a5yv1wcvg101jv

On the master node, at around the same time, I see these messages telling me that some network interfaces have been renamed:

Jul 31 18:37:00 pinode01 kernel: [86438.658276] veth2: renamed from veth3ac8593
Jul 31 18:37:00 pinode01 kernel: [86438.678203] br0: port 4(veth2) entered blocking state
Jul 31 18:37:00 pinode01 kernel: [86438.678211] br0: port 4(veth2) entered disabled state
Jul 31 18:37:00 pinode01 kernel: [86438.678480] device veth2 entered promiscuous mode
Jul 31 18:37:00 pinode01 kernel: [86438.689774] docker_gwbridge: port 13(veth6648b21) entered blocking state
Jul 31 18:37:00 pinode01 kernel: [86438.689792] docker_gwbridge: port 13(veth6648b21) entered disabled state
Jul 31 18:37:00 pinode01 kernel: [86438.690250] device veth6648b21 entered promiscuous mode
Jul 31 18:37:00 pinode01 kernel: [86438.690644] docker_gwbridge: port 13(veth6648b21) entered blocking state
Jul 31 18:37:00 pinode01 kernel: [86438.690652] docker_gwbridge: port 13(veth6648b21) entered forwarding state
Jul 31 18:37:01 pinode01 kernel: [86438.826205] veth11: renamed from veth1dd941b
Jul 31 18:37:01 pinode01 kernel: [86438.853819] br0: port 13(veth11) entered blocking state
Jul 31 18:37:01 pinode01 kernel: [86438.853827] br0: port 13(veth11) entered disabled state
Jul 31 18:37:01 pinode01 kernel: [86438.854014] device veth11 entered promiscuous mode
Jul 31 18:37:01 pinode01 kernel: [86438.854243] br0: port 13(veth11) entered blocking state
Jul 31 18:37:01 pinode01 kernel: [86438.854248] br0: port 13(veth11) entered forwarding state
Jul 31 18:37:01 pinode01 kernel: [86439.607799] eth0: renamed from veth06c40d1
Jul 31 18:37:01 pinode01 kernel: [86439.633904] docker_gwbridge: port 13(veth6648b21) entered disabled state
Jul 31 18:37:01 pinode01 kernel: [86439.634078] br0: port 13(veth11) entered disabled state
Jul 31 18:37:01 pinode01 kernel: [86439.634733] br0: port 13(veth11) entered blocking state
Jul 31 18:37:01 pinode01 kernel: [86439.634739] br0: port 13(veth11) entered forwarding state
Jul 31 18:37:02 pinode01 kernel: [86440.054787] eth1: renamed from vethc2c329d
Jul 31 18:37:02 pinode01 kernel: [86440.078353] br0: port 4(veth2) entered blocking state
Jul 31 18:37:02 pinode01 kernel: [86440.078362] br0: port 4(veth2) entered forwarding state
Jul 31 18:37:02 pinode01 kernel: [86440.106290] eth2: renamed from vethd7277da
Jul 31 18:37:02 pinode01 kernel: [86440.122570] IPv6: ADDRCONF(NETDEV_CHANGE): veth6648b21: link becomes ready
Jul 31 18:37:02 pinode01 kernel: [86440.122676] docker_gwbridge: port 13(veth6648b21) entered blocking state
Jul 31 18:37:02 pinode01 kernel: [86440.122682] docker_gwbridge: port 13(veth6648b21) entered forwarding state
Jul 31 18:37:03 pinode01 kernel: [86441.688994] veth12: renamed from veth4806182
Jul 31 18:37:04 pinode01 kernel: [86441.722197] br0: port 14(veth12) entered blocking state
Jul 31 18:37:04 pinode01 kernel: [86441.722205] br0: port 14(veth12) entered disabled state
Jul 31 18:37:04 pinode01 kernel: [86441.722469] device veth12 entered promiscuous mode
Jul 31 18:37:04 pinode01 kernel: [86441.741921] docker_gwbridge: port 14(vethec5b7c8) entered blocking state
Jul 31 18:37:04 pinode01 kernel: [86441.741931] docker_gwbridge: port 14(vethec5b7c8) entered disabled state
Jul 31 18:37:04 pinode01 kernel: [86441.742202] device vethec5b7c8 entered promiscuous mode
Jul 31 18:37:04 pinode01 kernel: [86441.742504] docker_gwbridge: port 14(vethec5b7c8) entered blocking state
Jul 31 18:37:04 pinode01 kernel: [86441.742513] docker_gwbridge: port 14(vethec5b7c8) entered forwarding state
Jul 31 18:37:04 pinode01 kernel: [86441.851006] veth3: renamed from veth26de17e
Jul 31 18:37:04 pinode01 kernel: [86441.874668] br0: port 5(veth3) entered blocking state
Jul 31 18:37:04 pinode01 kernel: [86441.874678] br0: port 5(veth3) entered disabled state
Jul 31 18:37:04 pinode01 kernel: [86441.875066] device veth3 entered promiscuous mode
Jul 31 18:37:04 pinode01 kernel: [86441.875552] br0: port 5(veth3) entered blocking state
Jul 31 18:37:04 pinode01 kernel: [86441.875561] br0: port 5(veth3) entered forwarding state
Jul 31 18:37:04 pinode01 kernel: [86442.669653] docker_gwbridge: port 14(vethec5b7c8) entered disabled state
Jul 31 18:37:04 pinode01 kernel: [86442.669908] br0: port 5(veth3) entered disabled state
Jul 31 18:37:05 pinode01 kernel: [86442.882311] eth0: renamed from vetha31591b
Jul 31 18:37:05 pinode01 kernel: [86442.901668] br0: port 14(veth12) entered blocking state
Jul 31 18:37:05 pinode01 kernel: [86442.901680] br0: port 14(veth12) entered forwarding state
Jul 31 18:37:05 pinode01 kernel: [86443.350341] eth1: renamed from vethd967dda
Jul 31 18:37:05 pinode01 kernel: [86443.379922] br0: port 5(veth3) entered blocking state
Jul 31 18:37:05 pinode01 kernel: [86443.379934] br0: port 5(veth3) entered forwarding state
Jul 31 18:37:05 pinode01 kernel: [86443.522398] eth2: renamed from veth800b8d8
Jul 31 18:37:05 pinode01 kernel: [86443.548668] IPv6: ADDRCONF(NETDEV_CHANGE): vethec5b7c8: link becomes ready
Jul 31 18:37:05 pinode01 kernel: [86443.548804] docker_gwbridge: port 14(vethec5b7c8) entered blocking state
Jul 31 18:37:05 pinode01 kernel: [86443.548810] docker_gwbridge: port 14(vethec5b7c8) entered forwarding state

I‘m having exactly the same problem in an almost identical setting. Does anyone have an idea on this?

What’s odd: I‘m reading lots of similar descriptions of this problem on the net, but almost all threads just end with someone asking the question without anyone having an idea on how to solve. Is this a super rare corner case?

My issue seemed to be stemming from DHCP lease configuration on my raspberry pi’s. I tweaked too many settings to count without getting a clean resolution. My “fix” is a horrific hack to just continually renew the lease. It’s worked without issue for quite a while now

Here’s the code:

And the playbook: