Node takes a long time to join swarm

Docker swarm gets the following error:
“Error response from daemon: Timeout was reached before node joined. The attempt to join the swarm will continue in the background. Use the “docker info” command to see the current swarm status of your node.”

Node eventually joins the swarm, after a long time (several minutes)

This is an issue since we want to automate the building of the swarm.

What could cause this delay? Is it possible to extend the timeout period? What would be the best way to debug this?

By any chance, are you trying to create a swarm for nodes over a WAN connection? Swarm uses the RAFT consensus under the hood. RAFT requires low latency networks for stable operation.

If you want to join nodes in edge/wan locations to a swarm cluster, it’s not going to work reliable. You might want to look at Portainer and it’s Edge Agent if you want a single point of control for such a scenario. Its not going to give you a swarm cluster, but it will allow to controll all instances from a single Portainer instance…

The ping times of from from one manager server to two different worker servers are as follows:

64 bytes from worker1 ( icmp_seq=1 ttl=64 time=0.157 ms
64 bytes from worker1 ( icmp_seq=2 ttl=64 time=0.252 ms
64 bytes from worker1 ( icmp_seq=3 ttl=64 time=0.207 ms
64 bytes from worker1 ( icmp_seq=4 ttl=64 time=0.289 ms
64 bytes from worker1 ( icmp_seq=5 ttl=64 time=0.238 ms

64 bytes from worker2 ( icmp_seq=2 ttl=64 time=0.277 ms
64 bytes from worker2 ( icmp_seq=3 ttl=64 time=0.392 ms
64 bytes from worker2 ( icmp_seq=4 ttl=64 time=0.289 ms
64 bytes from worker2 ( icmp_seq=5 ttl=64 time=0.303 ms

The joining of workers to the swarm does not time out, but when we try to join a second Manager to the swarm the timeout occurs.

We are running docker version 20.10.2.

The only think that commes to mind are that firewalls block required ports (or security groups if the nodes are in the cloud)

See: Getting started with swarm mode | Docker Documentation.

Note: you didn’t share the ping result of the 2nd machager instance that fails to join. It says nothing about the connectivity of the affected node.