Bulk sync to node XXX timed out docker swarm

congchinh262 · June 3, 2022, 3:23pm

I having some problem with my node worker. I have 1 manager and 9 worker nodes from different IP range with manager node, and my problem is my service deployed on A node(this node having some trouble) does seem to communicate with other node in my swarm. I have opened connection both way from A node to all my swarm node but it doesn’t seem working.
All I got is this error message when I view Docker logs: time=“2022-06-03T21:55:22.399445386+06:30” level=error msg=“Bulk sync to node eb8646bd85de timed out”.

I’m using Docker version: 20.10.16 on A node and 19.03.9 on manager node.
All my machine is using CentOS 7.

rimelek · June 3, 2022, 9:29pm

Have you tried to search for this message? There are multiple reports related to this. They say it might be that a required port is not accessible. I realize that not all of the related issues have a solution, so I quote which may be more helpful:

Bulk sync to node XYZ timed out · Issue #40337 · moby/moby · GitHub

What I would do:

Go to the node (SSH) and check if docker is running properly (You have probably done that already)
Check if Docker Swarm is listening on its ports: Open protocols and ports between the hosts
You can use netstat to check the ports
Try to check if you can access those ports locally. You can use telnet or netcat for that.
Try to check the ports from an an other node…
Configure your firewall if that is the problem

meyay · June 5, 2022, 3:34pm

To extend on @rimelek’s post: make sure that the subnets (what you call “different IP range”) allow low-latency network connections amongst all nodes. Swarm uses RAFT for cluster membership and coordination, which itself relies on low-latency networking for stable operation - everything else will be brittle.

For instance: running swarm cluster nodes in different availability zones inside a region of a cloud provider works like a charm, but running swarm cluster nodes in different regions is brittle, even worse if the nodes are spread accross different cloud providers… All that scenarios have in common that the nodes are in different subnets (like in your scenario).