I had this problem a few times already. I start a few hundred VMs on AWS, and have a cloud-init script with a command to join the swarm: “docker swarm join --token …”
Sometimes, the swarm master is unresponsive or unreachable, and some nodes are not able to join with a timeout. But they don’t appear to try again ever, and I’m lost with a few dozen instances that didn’t join.
What is the correct way to deal with it? I have tried to wait, but nodes doen’t seem to retry joining. Rebooting the instances that didn’t join causes the instances to forget about the swarm.
Do I need to script a loop until it joins or is there any built-in way to make nodes retry by themselves every couple minutes at least?