All docker service in docker swarm suddenly restarted

Hello,

i have a docker swarm cluster of 14 nodes all are manager.
Server Version: 20.10.12
Plugins:
** app: Docker App (Docker Inc., v0.9.1-beta3)**
** buildx: Docker Buildx (Docker Inc., v0.7.1-docker)**
** scan: Docker Scan (Docker Inc., v0.12.0)**

at some point all services were restarted and i don’t understand exactly why. here are logs from two nodes.
i only insert warning and errors. is it a network problem?

115	Sep 22 11:29:18 node002 dockerd[2948005]: time="2022-09-22T11:29:18.540568839Z" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = Unknown desc = context canceled" module=node/agent node.id=npfekhs89310r9c3fiqszvju5
119	Sep 22 11:29:23 node002 dockerd[2948005]: time="2022-09-22T11:29:23.558436742Z" level=error msg="agent: session failed" backoff=300ms error="session initiation timed out" module=node/agent node.id=npfekhs89310r9c3fiqszvju5
125	Sep 22 11:29:28 node002 dockerd[2948005]: time="2022-09-22T11:29:28.856514806Z" level=error msg="agent: session failed" backoff=700ms error="session initiation timed out" module=node/agent node.id=npfekhs89310r9c3fiqszvju5
135	Sep 22 11:29:40 node002 dockerd[2948005]: time="2022-09-22T11:29:40.784675768Z" level=warning msg="failed to deactivate service binding for container traefik-gateway_traefik.npfekhs89310r9c3fiqszvju5.l08gj1do3br7kxf1q3sqfk9lu" error="No such container: traefik-gateway_traefik.npfekhs89310r9c3fiqszvju5.l08gj1do3br7kxf1q3sqfk9lu" module=node/agent node.id=npfekhs89310r9c3fiqszvju5
162	Sep 22 11:29:47 node002 dockerd[2948005]: time="2022-09-22T11:29:47.575638185Z" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = Unknown desc = context canceled" module=node/agent node.id=npfekhs89310r9c3fiqszvju5
167	Sep 22 11:29:47 node002 dockerd[2948005]: time="2022-09-22T11:29:47.691096033Z" level=warning msg="Peer operation failed:could not delete fdb entry for nid:0gmfglhi3i27yv32tou5p35fa eid:d552e9aaff46b2586e906d8e678559c0a4205532d311cfe0434329de3ef4f15a into the sandbox:Search neighbor failed for IP 10.226.0.19, mac 02:42:0a:00:06:04, present in db:false op:&{2 0gmfglhi3i27yv32tou5p35fa d552e9aaff46b2586e906d8e678559c0a4205532d311cfe0434329de3ef4f15a [0 0 0 0 0 0 0 0 0 0 255 255 10 0 6 4] [255 255 255 0] [2 66 10 0 6 4] [0 0 0 0 0 0 0 0 0 0 255 255 10 226 0 19] false false false EventNotify}"
168	Sep 22 11:29:47 node002 dockerd[2948005]: time="2022-09-22T11:29:47.692479770Z" level=warning msg="Neighbor entry already present for IP 10.0.6.12, mac 02:42:0a:00:06:0c neighbor:&{dstIP:[0 0 0 0 0 0 0 0 0 0 255 255 10 0 6 12] dstMac:[2 66 10 0 6 12] linkName:vx-001006-0gmfg linkDst:vxlan0 family:0} forceUpdate:false"
191	Sep 22 11:29:52 node002 containerd[577]: time="2022-09-22T11:29:52.543010526Z" level=error msg="copy shim log" error="read /proc/self/fd/15: file already closed"
192	Sep 22 11:29:52 node002 dockerd[2948005]: time="2022-09-22T11:29:52.547309340Z" level=warning msg="rmServiceBinding bd07dfbf0c5f93724ecc5b770f0b2a388139e8d27c3b7b39f757a21ba538f1e7 possible transient state ok:false entries:0 set:false "
198	Sep 22 11:29:52 node002 dockerd[2948005]: time="2022-09-22T11:29:52.589079718Z" level=error msg="Handler for GET /v1.24/tasks returned error: write unix /run/docker.sock->@: write: broken pipe"
200	Sep 22 11:29:52 node002 dockerd[2948005]: time="2022-09-22T11:29:52.617154429Z" level=error msg="agent: session failed" backoff=300ms error="session initiation timed out" module=node/agent node.id=npfekhs89310r9c3fiqszvju5
265	Sep 22 11:29:57 node002 dockerd[2948005]: time="2022-09-22T11:29:57.837383087Z" level=error msg="agent: session failed" backoff=700ms error="session initiation timed out" module=node/agent node.id=npfekhs89310r9c3fiqszvju5
269	Sep 22 11:29:58 node002 dockerd[2948005]: time="2022-09-22T11:29:58.474352978Z" level=error msg="agent: session failed" backoff=1.5s error="rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online." module=node/agent node.id=npfekhs89310r9c3fiqszvju5
272	Sep 22 11:29:58 node002 dockerd[2948005]: time="2022-09-22T11:29:58.786772449Z" level=error msg="agent: session failed" backoff=3.1s error="rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online." module=node/agent node.id=npfekhs89310r9c3fiqszvju5
275	Sep 22 11:29:59 node002 dockerd[2948005]: time="2022-09-22T11:29:59.265387941Z" level=error msg="agent: session failed" backoff=6.3s error="rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online." module=node/agent node.id=npfekhs89310r9c3fiqszvju5
280	Sep 22 11:29:59 node002 dockerd[2948005]: time="2022-09-22T11:29:59.512851831Z" level=warning msg="rmServiceBinding 5bca1782ecafde6b16595ed018f940f7111b5ac3b331d6d279da13cc8d870065 possible transient state ok:false entries:0 set:false "
303	Sep 22 11:30:01 node002 dockerd[2948005]: time="2022-09-22T11:30:01.715244933Z" level=warning msg="rmServiceBinding 1ff30b215db1aa3007ae62853fb467d12a8e7698ec39163e0d268a67dc12d96b possible transient state ok:false entries:0 set:false "
304	Sep 22 11:30:01 node002 dockerd[2948005]: time="2022-09-22T11:30:01.719435050Z" level=warning msg="rmServiceBinding c8602047e3c4a0c5eb6428505d6d1aaf84eebe902991e4b6be6507b2f24c608f possible transient state ok:false entries:0 set:false "
305	Sep 22 11:30:01 node002 dockerd[2948005]: time="2022-09-22T11:30:01.721746587Z" level=warning msg="rmServiceBinding 813d446cca450cb8dd3ca667d808e170431a76be8a7f1f6b9bc204e1b8c5288a possible transient state ok:false entries:0 set:false "
306	Sep 22 11:30:01 node002 dockerd[2948005]: time="2022-09-22T11:30:01.722701128Z" level=warning msg="rmServiceBinding 69920cb261a4e85852c80e751dabe4f6b3c7e5962fc6560d1d0431908441062d possible transient state ok:false entries:0 set:false "
307	Sep 22 11:30:01 node002 dockerd[2948005]: time="2022-09-22T11:30:01.797986881Z" level=error msg="agent: session failed" backoff=8s error="rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online." module=node/agent node.id=npfekhs89310r9c3fiqszvju5
313	Sep 22 11:30:03 node002 containerd[577]: time="2022-09-22T11:30:03.337565622Z" level=error msg="copy shim log" error="read /proc/self/fd/15: file already closed"
346	Sep 22 11:30:07 node002 dockerd[2948005]: time="2022-09-22T11:30:07.683232349Z" level=error msg="agent: session failed" backoff=8s error="rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online." module=node/agent node.id=npfekhs89310r9c3fiqszvju5
360	Sep 22 11:30:18 node002 dockerd[2948005]: time="2022-09-22T11:30:18.478136367Z" level=warning msg="rmServiceBinding 416f5bea8e8e23743ad05485d797d9a423673adffe35ec75079911ba0f3b73de possible transient state ok:false entries:0 set:false "
361	Sep 22 11:30:18 node002 dockerd[2948005]: time="2022-09-22T11:30:18.640400285Z" level=warning msg="rmServiceBinding e8f294a103f52957a40d0620d6183c1fd4e38b47a1fd41293990c813e837cd66 possible transient state ok:false entries:0 set:false "
400	Sep 22 11:30:21 node002 dockerd[2948005]: time="2022-09-22T11:30:21.041939582Z" level=warning msg="rmServiceBinding 78902856456aa27fc7b52386d34fb17aa62508473026fafe37d581863e0adc8c possible transient state ok:false entries:0 set:false "
401	Sep 22 11:30:21 node002 dockerd[2948005]: time="2022-09-22T11:30:21.044079775Z" level=warning msg="rmServiceBinding 08649475f6c8e6df7fceadada8c11d3e2eb2a3a89d24004b7d2248b9d517773b possible transient state ok:false entries:0 set:false "
420	Sep 22 11:30:21 node002 dockerd[2948005]: time="2022-09-22T11:30:21.285331576Z" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint 0gmfglhi3i27yv32tou5p35fa b7faade661b5e4f0786b11cea93bc8a8cc8de5e48853cf47493e35b4699c3be5], retrying...."
422	Sep 22 11:30:21 node002 dockerd[2948005]: time="2022-09-22T11:30:21.305331378Z" level=warning msg="rmServiceBinding handleEpTableEvent docker-system-prune_main 438e7108bc3ab4c83236100ef74d7e22416b7a54e23fa14290034b5a3a7b306a aborted c.serviceBindings[skey] !ok"
445	Sep 22 11:30:23 node002 dockerd[2948005]: time="2022-09-22T11:30:23.411618778Z" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = Unknown desc = context canceled" module=node/agent node.id=npfekhs89310r9c3fiqszvju5
468	Sep 22 11:30:27 node002 dockerd[2948005]: time="2022-09-22T11:30:27.418375164Z" level=warning msg="rmServiceBinding ddb1006f9800c48b6851f9dd5363cfa81708e9a0299be7ca92b7cee0e59fb0a2 possible transient state ok:false entries:0 set:false "
469	Sep 22 11:30:27 node002 dockerd[2948005]: time="2022-09-22T11:30:27.560684401Z" level=warning msg="rmServiceBinding a07de496b307c3f031d72e501b75ba0e0f0c7db19b8a9a27eef97ec28300ccb6 possible transient state ok:false entries:0 set:false "
470	Sep 22 11:30:27 node002 dockerd[2948005]: time="2022-09-22T11:30:27.562064675Z" level=warning msg="rmServiceBinding 1e146a65175604224624787c4da0582efbc6860782cda589f931f427cf0a8483 possible transient state ok:false entries:0 set:false "
471	Sep 22 11:30:27 node002 dockerd[2948005]: time="2022-09-22T11:30:27.769339209Z" level=warning msg="rmServiceBinding 30550d91e27e78762293f42d28558fbbe1b7056a381e91b3ad2e4d2fcf9b2169 possible transient state ok:false entries:0 set:false "
472	Sep 22 11:30:28 node002 dockerd[2948005]: time="2022-09-22T11:30:28.457418736Z" level=error msg="agent: session failed" backoff=300ms error="session initiation timed out" module=node/agent node.id=npfekhs89310r9c3fiqszvju5
475	Sep 22 11:30:28 node002 dockerd[2948005]: time="2022-09-22T11:30:28.818614168Z" level=warning msg="rmServiceBinding 12628e5dd47cc001e5e1cddaf3d391ade5264dc365c1b8610957bdeb2356eb15 possible transient state ok:false entries:0 set:false "
476	Sep 22 11:30:31 node002 dockerd[2948005]: time="2022-09-22T11:30:31.838001372Z" level=warning msg="rmServiceBinding 1f78b3abdf94a25f4e64a0d9e836461fb559d8dd4f5e666bb4637fa66eface7b possible transient state ok:false entries:0 set:false "
503	Sep 22 11:30:33 node002 dockerd[2948005]: time="2022-09-22T11:30:33.629004786Z" level=error msg="agent: session failed" backoff=700ms error="session initiation timed out" module=node/agent node.id=npfekhs89310r9c3fiqszvju5
509	Sep 22 11:30:34 node002 dockerd[2948005]: time="2022-09-22T11:30:34.169415465Z" level=warning msg="rmServiceBinding 3d17037ae2ec8f026efb4be4e4dbb14419ba18a723a05a5cf0a4a7acf4fd33a2 possible transient state ok:false entries:0 set:false "
511	Sep 22 11:30:39 node002 dockerd[2948005]: time="2022-09-22T11:30:39.107370752Z" level=error msg="agent: session failed" backoff=1.5s error="session initiation timed out" module=node/agent node.id=npfekhs89310r9c3fiqszvju5
516	Sep 22 11:30:39 node002 dockerd[2948005]: time="2022-09-22T11:30:39.312158879Z" level=warning msg="rmServiceBinding afa29a6c864aac1302921383a1f15279bb943bd16753e83c4fa5136114d2f40c possible transient state ok:false entries:0 set:false "
554	Sep 22 11:30:43 node002 dockerd[2948005]: time="2022-09-22T11:30:43.153125159Z" level=error msg="agent: session failed" backoff=3.1s error="rpc error: code = Aborted desc = dispatcher is stopped" module=node/agent node.id=npfekhs89310r9c3fiqszvju5
557	Sep 22 11:30:43 node002 dockerd[2948005]: time="2022-09-22T11:30:43.312871815Z" level=error msg="error creating cluster object" error="name conflicts with an existing object" module=node node.id=npfekhs89310r9c3fiqszvju5
561	Sep 22 11:30:45 node002 dockerd[2948005]: time="2022-09-22T11:30:45.373237599Z" level=error msg="error receiving response" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
585	Sep 22 11:30:50 node002 dockerd[2948005]: time="2022-09-22T11:30:50.437797166Z" level=error msg="error receiving response" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
586	Sep 22 11:30:51 node002 dockerd[2948005]: time="2022-09-22T11:30:51.647542550Z" level=warning msg="failed to deactivate service binding for container docker-system-prune_main.npfekhs89310r9c3fiqszvju5.61larr04mynn5rsh9u4ofsqrk" error="No such container: docker-system-prune_main.npfekhs89310r9c3fiqszvju5.61larr04mynn5rsh9u4ofsqrk" module=node/agent node.id=npfekhs89310r9c3fiqszvju5
587	Sep 22 11:30:52 node002 dockerd[2948005]: time="2022-09-22T11:30:52.959428017Z" level=error msg="error receiving response" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
589	Sep 22 11:30:55 node002 dockerd[2948005]: time="2022-09-22T11:30:55.520151160Z" level=error msg="error receiving response" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
591	Sep 22 11:30:57 node002 dockerd[2948005]: time="2022-09-22T11:30:57.520633562Z" level=error msg="error receiving response" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
592	Sep 22 11:31:02 node002 dockerd[2948005]: time="2022-09-22T11:31:02.673381909Z" level=error msg="error receiving response" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
618	Sep 22 11:31:09 node002 dockerd[2948005]: time="2022-09-22T11:31:09.868119911Z" level=warning msg="Neighbor entry already present for IP 10.0.6.21, mac 02:42:0a:00:06:15 neighbor:&{dstIP:[0 0 0 0 0 0 0 0 0 0 255 255 10 0 6 21] dstMac:[2 66 10 0 6 21] linkName:vx-001006-0gmfg linkDst:vxlan0 family:0} forceUpdate:false"
619	Sep 22 11:31:09 node002 dockerd[2948005]: time="2022-09-22T11:31:09.868211943Z" level=warning msg="Neighbor entry already present for IP 10.0.6.9, mac 02:42:0a:00:06:09 neighbor:&{dstIP:[0 0 0 0 0 0 0 0 0 0 255 255 10 0 6 9] dstMac:[2 66 10 0 6 9] linkName:vx-001006-0gmfg linkDst:vxlan0 family:0} forceUpdate:false"
705	Sep 22 11:31:19 node002 containerd[577]: time="2022-09-22T11:31:19.445020599Z" level=error msg="copy shim log" error="read /proc/self/fd/32: file already closed"
756	Sep 22 11:31:24 node002 dockerd[2948005]: time="2022-09-22T11:31:24.290948765Z" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint 0gmfglhi3i27yv32tou5p35fa ab70b026fca20e42a20a10f6675e9591f831e7365ecc8df8e635c8192166b99d], retrying...."
811	Sep 22 11:34:55 node002 dockerd[2948005]: time="2022-09-22T11:34:55.240558589Z" level=error msg="Bulk sync to node 0757c57d0808 timed out"
844	Sep 22 11:36:55 node002 dockerd[2948005]: time="2022-09-22T11:36:55.233987690Z" level=error msg="Bulk sync to node d3266578b182 timed out"

123	Sep 22 11:28:59 node001 dockerd[1068]: time="2022-09-22T11:28:59.879247338Z" level=warning msg="memberlist: Failed fallback ping: read tcp 10.226.0.18:48492->10.226.96.3:7946: i/o timeout"
127	Sep 22 11:29:03 node001 dockerd[1068]: time="2022-09-22T11:29:03.879197167Z" level=warning msg="memberlist: Failed fallback ping: read tcp 10.226.0.18:48506->10.226.80.4:7946: i/o timeout"
133	Sep 22 11:29:13 node001 dockerd[1068]: time="2022-09-22T11:29:13.878815439Z" level=warning msg="memberlist: Failed fallback ping: read tcp 10.226.0.18:55898->10.226.96.4:7946: i/o timeout"
134	Sep 22 11:29:14 node001 dockerd[1068]: time="2022-09-22T11:29:14.846668082Z" level=warning msg="memberlist: Refuting a suspect message (from: 7d1a2a939c07)"
136	Sep 22 11:29:18 node001 dockerd[1068]: time="2022-09-22T11:29:18.465143187Z" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = Unknown desc = context canceled" module=node/agent node.id=rwwmifudcdpkivgf3h8gaaxpx
139	Sep 22 11:29:21 node001 dockerd[1068]: time="2022-09-22T11:29:21.879000773Z" level=warning msg="memberlist: Failed fallback ping: read tcp 10.226.0.18:36672->10.226.80.2:7946: i/o timeout"
140	Sep 22 11:29:22 node001 dockerd[1068]: time="2022-09-22T11:29:22.478707081Z" level=warning msg="memberlist: failed to receive: read tcp 10.226.0.18:7946->10.226.80.2:52064: i/o timeout from=10.226.80.2:52064"
141	Sep 22 11:29:22 node001 dockerd[1068]: time="2022-09-22T11:29:22.478853856Z" level=warning msg="memberlist: Failed to send error: write tcp 10.226.0.18:7946->10.226.80.2:52064: i/o timeout from=10.226.80.2:52064"
143	Sep 22 11:29:23 node001 dockerd[1068]: time="2022-09-22T11:29:23.555832143Z" level=error msg="agent: session failed" backoff=300ms error="session initiation timed out" module=node/agent node.id=rwwmifudcdpkivgf3h8gaaxpx
147	Sep 22 11:29:28 node001 dockerd[1068]: time="2022-09-22T11:29:28.855037437Z" level=error msg="agent: session failed" backoff=700ms error="session initiation timed out" module=node/agent node.id=rwwmifudcdpkivgf3h8gaaxpx
161	Sep 22 11:29:38 node001 dockerd[1068]: time="2022-09-22T11:29:38.879706862Z" level=warning msg="memberlist: Failed fallback ping: read tcp 10.226.0.18:48496->10.226.96.3:7946: i/o timeout"
162	Sep 22 11:29:41 node001 dockerd[1068]: time="2022-09-22T11:29:41.878478964Z" level=warning msg="memberlist: Failed fallback ping: read tcp 10.226.0.18:36680->10.226.80.2:7946: i/o timeout"
168	Sep 22 11:29:46 node001 dockerd[1068]: time="2022-09-22T11:29:46.603132699Z" level=error msg="error while reading from stream" error="rpc error: code = Canceled desc = context canceled"
208	Sep 22 11:29:47 node001 dockerd[1068]: time="2022-09-22T11:29:47.499404814Z" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = Unknown desc = context canceled" module=node/agent node.id=rwwmifudcdpkivgf3h8gaaxpx
246	Sep 22 11:29:47 node001 dockerd[1068]: time="2022-09-22T11:29:47.878877340Z" level=warning msg="memberlist: Failed fallback ping: read tcp 10.226.0.18:55900->10.226.96.4:7946: i/o timeout"
266	Sep 22 11:29:49 node001 dockerd[1068]: time="2022-09-22T11:29:49.102420567Z" level=warning msg="rmServiceBinding 5bca1782ecafde6b16595ed018f940f7111b5ac3b331d6d279da13cc8d870065 possible transient state ok:false entries:0 set:false "
267	Sep 22 11:29:49 node001 dockerd[1068]: time="2022-09-22T11:29:49.503311843Z" level=warning msg="rmServiceBinding 1ff30b215db1aa3007ae62853fb467d12a8e7698ec39163e0d268a67dc12d96b possible transient state ok:false entries:0 set:false "
270	Sep 22 11:29:49 node001 containerd[603]: time="2022-09-22T11:29:49.522001753Z" level=error msg="copy shim log" error="read /proc/self/fd/36: file already closed"
294	Sep 22 11:29:50 node001 dockerd[1068]: time="2022-09-22T11:29:50.505793229Z" level=warning msg="rmServiceBinding c8602047e3c4a0c5eb6428505d6d1aaf84eebe902991e4b6be6507b2f24c608f possible transient state ok:false entries:0 set:false "
311	Sep 22 11:29:52 node001 dockerd[1068]: time="2022-09-22T11:29:52.579273897Z" level=error msg="agent: session failed" backoff=300ms error="session initiation timed out" module=node/agent node.id=rwwmifudcdpkivgf3h8gaaxpx
314	Sep 22 11:29:52 node001 dockerd[1068]: time="2022-09-22T11:29:52.952507819Z" level=warning msg="rmServiceBinding bd07dfbf0c5f93724ecc5b770f0b2a388139e8d27c3b7b39f757a21ba538f1e7 possible transient state ok:false entries:0 set:false "
315	Sep 22 11:29:53 node001 dockerd[1068]: time="2022-09-22T11:29:53.750684645Z" level=warning msg="rmServiceBinding 813d446cca450cb8dd3ca667d808e170431a76be8a7f1f6b9bc204e1b8c5288a possible transient state ok:false entries:0 set:false "
320	Sep 22 11:29:54 node001 containerd[603]: time="2022-09-22T11:29:54.274907510Z" level=error msg="copy shim log" error="read /proc/self/fd/29: file already closed"
321	Sep 22 11:29:54 node001 dockerd[1068]: time="2022-09-22T11:29:54.275682610Z" level=warning msg="rmServiceBinding 69920cb261a4e85852c80e751dabe4f6b3c7e5962fc6560d1d0431908441062d possible transient state ok:false entries:0 set:false "
353	Sep 22 11:29:57 node001 dockerd[1068]: time="2022-09-22T11:29:57.790504243Z" level=error msg="agent: session failed" backoff=700ms error="session initiation timed out" module=node/agent node.id=rwwmifudcdpkivgf3h8gaaxpx
358	Sep 22 11:30:02 node001 dockerd[1068]: time="2022-09-22T11:30:02.970988344Z" level=error msg="agent: session failed" backoff=1.5s error="session initiation timed out" module=node/agent node.id=rwwmifudcdpkivgf3h8gaaxpx
366	Sep 22 11:30:04 node001 dockerd[1068]: time="2022-09-22T11:30:04.150389507Z" level=warning msg="memberlist: Refuting a suspect message (from: ec2f2f7ee3d7)"
368	Sep 22 11:30:08 node001 dockerd[1068]: time="2022-09-22T11:30:08.098575172Z" level=error msg="agent: session failed" backoff=3.1s error="session initiation timed out" module=node/agent node.id=rwwmifudcdpkivgf3h8gaaxpx
371	Sep 22 11:30:08 node001 dockerd[1068]: time="2022-09-22T11:30:08.695144642Z" level=warning msg="memberlist: Push/Pull with 5be85f88c168 failed: read tcp 10.226.0.18:36682->10.226.80.2:7946: i/o timeout"
402	Sep 22 11:30:17 node001 dockerd[1068]: time="2022-09-22T11:30:17.150203892Z" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint 0gmfglhi3i27yv32tou5p35fa 2f60d7d3b83ba89fff188b19be149b071cbe5df49d71c1068ab1c5cc51ce0a9e], retrying...."
411	Sep 22 11:30:18 node001 dockerd[1068]: time="2022-09-22T11:30:18.552976990Z" level=warning msg="rmServiceBinding 416f5bea8e8e23743ad05485d797d9a423673adffe35ec75079911ba0f3b73de possible transient state ok:false entries:0 set:false "
413	Sep 22 11:30:18 node001 dockerd[1068]: time="2022-09-22T11:30:18.702711127Z" level=warning msg="rmServiceBinding e8f294a103f52957a40d0620d6183c1fd4e38b47a1fd41293990c813e837cd66 possible transient state ok:false entries:0 set:false "
414	Sep 22 11:30:18 node001 dockerd[1068]: time="2022-09-22T11:30:18.856504178Z" level=warning msg="rmServiceBinding 78902856456aa27fc7b52386d34fb17aa62508473026fafe37d581863e0adc8c possible transient state ok:false entries:0 set:false "
415	Sep 22 11:30:19 node001 dockerd[1068]: time="2022-09-22T11:30:19.353769461Z" level=warning msg="rmServiceBinding 08649475f6c8e6df7fceadada8c11d3e2eb2a3a89d24004b7d2248b9d517773b possible transient state ok:false entries:0 set:false "
454	Sep 22 11:30:23 node001 dockerd[1068]: time="2022-09-22T11:30:23.419043488Z" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = Unknown desc = context canceled" module=node/agent node.id=rwwmifudcdpkivgf3h8gaaxpx
458	Sep 22 11:30:25 node001 dockerd[1068]: time="2022-09-22T11:30:25.877967377Z" level=warning msg="memberlist: Failed fallback ping: read tcp 10.226.0.18:48498->10.226.96.3:7946: i/o timeout"
479	Sep 22 11:30:27 node001 dockerd[1068]: time="2022-09-22T11:30:27.523522178Z" level=warning msg="rmServiceBinding 30550d91e27e78762293f42d28558fbbe1b7056a381e91b3ad2e4d2fcf9b2169 possible transient state ok:false entries:0 set:false "
480	Sep 22 11:30:27 node001 dockerd[1068]: time="2022-09-22T11:30:27.524899861Z" level=warning msg="rmServiceBinding ddb1006f9800c48b6851f9dd5363cfa81708e9a0299be7ca92b7cee0e59fb0a2 possible transient state ok:false entries:0 set:false "
481	Sep 22 11:30:27 node001 dockerd[1068]: time="2022-09-22T11:30:27.686097208Z" level=warning msg="rmServiceBinding 1e146a65175604224624787c4da0582efbc6860782cda589f931f427cf0a8483 possible transient state ok:false entries:0 set:false "
482	Sep 22 11:30:27 node001 dockerd[1068]: time="2022-09-22T11:30:27.686941427Z" level=warning msg="rmServiceBinding a07de496b307c3f031d72e501b75ba0e0f0c7db19b8a9a27eef97ec28300ccb6 possible transient state ok:false entries:0 set:false "
483	Sep 22 11:30:28 node001 dockerd[1068]: time="2022-09-22T11:30:28.452089513Z" level=error msg="agent: session failed" backoff=300ms error="session initiation timed out" module=node/agent node.id=rwwmifudcdpkivgf3h8gaaxpx
486	Sep 22 11:30:28 node001 dockerd[1068]: time="2022-09-22T11:30:28.563925526Z" level=warning msg="rmServiceBinding 12628e5dd47cc001e5e1cddaf3d391ade5264dc365c1b8610957bdeb2356eb15 possible transient state ok:false entries:0 set:false "
488	Sep 22 11:30:31 node001 dockerd[1068]: time="2022-09-22T11:30:31.726979044Z" level=warning msg="rmServiceBinding 1f78b3abdf94a25f4e64a0d9e836461fb559d8dd4f5e666bb4637fa66eface7b possible transient state ok:false entries:0 set:false "
497	Sep 22 11:30:33 node001 dockerd[1068]: time="2022-09-22T11:30:33.636539717Z" level=error msg="agent: session failed" backoff=700ms error="session initiation timed out" module=node/agent node.id=rwwmifudcdpkivgf3h8gaaxpx
500	Sep 22 11:30:34 node001 dockerd[1068]: time="2022-09-22T11:30:34.877950930Z" level=warning msg="memberlist: Failed fallback ping: read tcp 10.226.0.18:48518->10.226.80.4:7946: i/o timeout"
502	Sep 22 11:30:37 node001 dockerd[1068]: time="2022-09-22T11:30:37.520624342Z" level=warning msg="rmServiceBinding afa29a6c864aac1302921383a1f15279bb943bd16753e83c4fa5136114d2f40c possible transient state ok:false entries:0 set:false "
504	Sep 22 11:30:39 node001 dockerd[1068]: time="2022-09-22T11:30:39.116306759Z" level=error msg="agent: session failed" backoff=1.5s error="session initiation timed out" module=node/agent node.id=rwwmifudcdpkivgf3h8gaaxpx
507	Sep 22 11:30:39 node001 dockerd[1068]: time="2022-09-22T11:30:39.877721917Z" level=warning msg="memberlist: Failed fallback ping: read tcp 10.226.0.18:36690->10.226.80.2:7946: i/o timeout"
513	Sep 22 11:30:44 node001 dockerd[1068]: time="2022-09-22T11:30:44.615412170Z" level=error msg="agent: session failed" backoff=3.1s error="session initiation timed out" module=node/agent node.id=rwwmifudcdpkivgf3h8gaaxpx
520	Sep 22 11:31:05 node001 dockerd[1068]: time="2022-09-22T11:31:05.433118209Z" level=warning msg="Entry was not in db: nid:9d326qs1i69j93a5pt692rtno eid:1c4b6b6898f67fd931b44c3ccadf8f0e577ff0ebdf9ca44bbf7d8990ced9e5d2 peerIP:10.0.8.7 peerMac:02:42:0a:00:08:07 isLocal:false vtep:10.226.80.4"
521	Sep 22 11:31:05 node001 dockerd[1068]: time="2022-09-22T11:31:05.433200871Z" level=warning msg="Peer operation failed:could not delete fdb entry for nid:9d326qs1i69j93a5pt692rtno eid:1c4b6b6898f67fd931b44c3ccadf8f0e577ff0ebdf9ca44bbf7d8990ced9e5d2 into the sandbox:Search neighbor failed for IP 10.226.80.4, mac 02:42:0a:00:08:07, present in db:false op:&{2 9d326qs1i69j93a5pt692rtno 1c4b6b6898f67fd931b44c3ccadf8f0e577ff0ebdf9ca44bbf7d8990ced9e5d2 [0 0 0 0 0 0 0 0 0 0 255 255 10 0 8 7] [255 255 255 0] [2 66 10 0 8 7] [0 0 0 0 0 0 0 0 0 0 255 255 10 226 80 4] false false false EventNotify}"
522	Sep 22 11:31:05 node001 dockerd[1068]: time="2022-09-22T11:31:05.877958641Z" level=warning msg="memberlist: Failed fallback ping: read tcp 10.226.0.18:48520->10.226.80.4:7946: i/o timeout"
525	Sep 22 11:31:08 node001 dockerd[1068]: time="2022-09-22T11:31:08.878137711Z" level=warning msg="memberlist: Failed fallback ping: read tcp 10.226.0.18:48522->10.226.80.4:7946: i/o timeout"
628	Sep 22 11:31:24 node001 containerd[603]: time="2022-09-22T11:31:24.139554314Z" level=error msg="copy shim log" error="read /proc/self/fd/36: file already closed"
683	Sep 22 11:31:27 node001 dockerd[1068]: time="2022-09-22T11:31:27.150891694Z" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint 0gmfglhi3i27yv32tou5p35fa df9dc8f17096ec004575b7a13939d2d5f71874439f2b1f29a67b420f7b27c7df], retrying...."
689	Sep 22 11:31:27 node001 dockerd[1068]: time="2022-09-22T11:31:27.183681209Z" level=warning msg="rmServiceBinding handleEpTableEvent docker-system-prune_main 1efb4caa5edda37b49edf10a662f40523ceea3f8338cd3cc371506358e765c61 aborted c.serviceBindings[skey] !ok"
690	Sep 22 11:31:27 node001 dockerd[1068]: time="2022-09-22T11:31:27.183716616Z" level=warning msg="rmServiceBinding handleEpTableEvent docker-system-prune_main 2be25847a2dfaf780b4a929ce44153ececd961e5d4cf0e0943d56524938275ac aborted c.serviceBindings[skey] !ok"
695	Sep 22 11:31:46 node001 dockerd[1068]: time="2022-09-22T11:31:46.164241915Z" level=error msg="Bulk sync to node 21b9d0cb629c timed out"
701	Sep 22 11:32:30 node001 dockerd[1068]: time="2022-09-22T11:32:30.878451417Z" level=warning msg="memberlist: Failed fallback ping: read tcp 10.226.0.18:48504->10.226.96.3:7946: i/o timeout"
707	Sep 22 11:32:44 node001 dockerd[1068]: time="2022-09-22T11:32:44.877879857Z" level=warning msg="memberlist: Failed fallback ping: read tcp 10.226.0.18:36696->10.226.80.2:7946: i/o timeout"
716	Sep 22 11:33:23 node001 dockerd[1068]: time="2022-09-22T11:33:23.878195710Z" level=warning msg="memberlist: Failed fallback ping: read tcp 10.226.0.18:55914->10.226.96.4:7946: i/o timeout"
727	Sep 22 11:34:01 node001 containerd[603]: time="2022-09-22T11:34:01.307584705Z" level=error msg="copy shim log" error="read /proc/self/fd/29: file already closed"
728	Sep 22 11:34:01 node001 dockerd[1068]: time="2022-09-22T11:34:01.308827825Z" level=warning msg="deleteServiceInfoFromCluster NetworkDB DeleteEntry failed for a51e9455fd7a0aa00d1dca7bd909f28a8314d9ef1163946ecb5a59469e408424 l9on31aw69tfju100xpzq6wrl err:cannot delete entry endpoint_table with network id l9on31aw69tfju100xpzq6wrl and key a51e9455fd7a0aa00d1dca7bd909f28a8314d9ef1163946ecb5a59469e408424 does not exist or is already being deleted"
729	Sep 22 11:34:01 node001 dockerd[1068]: time="2022-09-22T11:34:01.308920009Z" level=warning msg="rmServiceBinding deleteServiceInfoFromCluster traefik-gateway_traefik a51e9455fd7a0aa00d1dca7bd909f28a8314d9ef1163946ecb5a59469e408424 aborted lb.backEnds[eid] && lb.disabled[eid] !ok"
756	Sep 22 11:34:01 node001 dockerd[1068]: time="2022-09-22T11:34:01.596493352Z" level=error msg="fatal task error" error="task: non-zero exit (137): dockerexec: unhealthy container" module=node/agent/taskmanager node.id=rwwmifudcdpkivgf3h8gaaxpx service.id=mywcxlajhr0kxj63afymy95lk task.id=t86eq3wgob3ybzbi9plt3gn1b
757	Sep 22 11:34:05 node001 dockerd[1068]: time="2022-09-22T11:34:05.282415694Z" level=warning msg="failed to deactivate service binding for container traefik-gateway_traefik.rwwmifudcdpkivgf3h8gaaxpx.q52wmnzwh0vbhdql5cylh5j07" error="No such container: traefik-gateway_traefik.rwwmifudcdpkivgf3h8gaaxpx.q52wmnzwh0vbhdql5cylh5j07" module=node/agent node.id=rwwmifudcdpkivgf3h8gaaxpx
866	Sep 22 11:37:45 node001 containerd[603]: time="2022-09-22T11:37:45.241267605Z" level=error msg="copy shim log" error="read /proc/self/fd/29: file already closed"
867	Sep 22 11:37:45 node001 dockerd[1068]: time="2022-09-22T11:37:45.242806973Z" level=warning msg="deleteServiceInfoFromCluster NetworkDB DeleteEntry failed for 88ed88c187433f9ca781422ec164b4eb9f360a0b4f5f593b94cb6b4cee117751 l9on31aw69tfju100xpzq6wrl err:cannot delete entry endpoint_table with network id l9on31aw69tfju100xpzq6wrl and key 88ed88c187433f9ca781422ec164b4eb9f360a0b4f5f593b94cb6b4cee117751 does not exist or is already being deleted"
868	Sep 22 11:37:45 node001 dockerd[1068]: time="2022-09-22T11:37:45.242857455Z" level=warning msg="rmServiceBinding deleteServiceInfoFromCluster traefik-gateway_traefik 88ed88c187433f9ca781422ec164b4eb9f360a0b4f5f593b94cb6b4cee117751 aborted lb.backEnds[eid] && lb.disabled[eid] !ok"
895	Sep 22 11:37:45 node001 dockerd[1068]: time="2022-09-22T11:37:45.514221259Z" level=error msg="fatal task error" error="task: non-zero exit (137): dockerexec: unhealthy container" module=node/agent/taskmanager node.id=rwwmifudcdpkivgf3h8gaaxpx service.id=mywcxlajhr0kxj63afymy95lk task.id=zix8lp1fx9omve98f9hkyjjz7

I wouldn’t be surprised if your problem is caused by having 14 manager nodes.

Years ago I researched about consensus algorithms and it stick to my head that it’s not recommended to use more then 7 manager nodes with RAFT, due to overhead and performance degradation (often 3 or 5 is enough). Uneven numbers are preferred, because the next even and uneven number require both the same number of healthy nodes for the cluster to not become headless. E.g.: with 13 nodes, you need 7 healthy nodes, with 14 nodes or 15 nodes you need 8 healthy nodes.

It is not recommended to put manager nodes in an auto-scaling group, where new instances are spawned and old instances might be destroyed. Instead it’s better to have 3 smaller nodes that act purely as manager nodes and put worker nodes in an auto-scaling group where all payload is deployed.

Thus said:the logs indicate that the node is either in a split brain scenario, or you cluster is headless due to few healthy nodes.

Also RAFT needs low latency network for stable operations. Running a cluster across availability zones in a region of a cloud provider is fine, running it across regions is a call for trouble…