Docker swarm tries to connect to removed managers

Issue type

Question or bug report: our docker swarm tries to connect to manager nodes that have been removed from the swarm.

time="2020-06-02T04:04:53.360082825Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {172.31.65.20:2377 0  <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.31.65.20:2377: connect: no route to host\". Reconnecting..." module=grpc

OS Version/build

Ubuntu 19.10, docker engine: 19.03.11

App version

NA

Steps to reproduce

  1. Create a new swarm with 1 manager and n workers
  2. Add one new manager and n workers
  3. Remove the oldest manager and workers from the swarm.

That’s basically the process we are doing to update our swarm. We remove managers from the swarm with a script on the manager node which demotes the manager and then leaves the swarm; as for workers they are paused and then leave the swarm. We do not have an issue with the worker nodes, but we are running into an issue with the manager nodes where they appear to be removed from the swarm with docker node ls as well as the demote/remove commands work, but later on when we go through this process again (rolling the swarm) we get the following:

time="2020-06-02T04:04:53.360082825Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {172.31.65.20:2377 0  <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.31.65.20:2377: connect: no route to host\". Reconnecting..." module=grpc

We get one of these warnings for each manager that was part of the swarm and then removed. For example, if we rolled the swarm 3 times then we would have three of these warnings. 172.31.65.20 is the ip of the old manager node.

After rolling a swarm several times we run into the issue where trying to join the new manager ends up timing out because docker is trying to connect to several old managers that no longer exists.

We’ve ruled out this is not a firewall issue. The swarm is in a VPC with a security group which allows all communication between nodes besides the fact that our process works until several managers have left and eventually docker seems to become overwhelmed connecting to missing managers.

Hello. I have the same error (bug?) with latest docker and I found a (slightly hacky) way to fix it:
1. Stop your docker instance on server with error.
2. Open /var/lib/docker/swarm/docker-state.json and REPLACE OLD IP in RemoteAddr attribute to IP of one of current swarm managers.
3. Start docker instance. Now error will gone from logs :slight_smile: