Docker swarm tries to connect to removed managers

sgilman · June 2, 2020, 7:32am

Issue type

Question or bug report: our docker swarm tries to connect to manager nodes that have been removed from the swarm.

time="2020-06-02T04:04:53.360082825Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {172.31.65.20:2377 0  <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.31.65.20:2377: connect: no route to host\". Reconnecting..." module=grpc

OS Version/build

Ubuntu 19.10, docker engine: 19.03.11

App version

NA

Steps to reproduce

Create a new swarm with 1 manager and n workers
Add one new manager and n workers
Remove the oldest manager and workers from the swarm.

That’s basically the process we are doing to update our swarm. We remove managers from the swarm with a script on the manager node which demotes the manager and then leaves the swarm; as for workers they are paused and then leave the swarm. We do not have an issue with the worker nodes, but we are running into an issue with the manager nodes where they appear to be removed from the swarm with docker node ls as well as the demote/remove commands work, but later on when we go through this process again (rolling the swarm) we get the following:

time="2020-06-02T04:04:53.360082825Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {172.31.65.20:2377 0  <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.31.65.20:2377: connect: no route to host\". Reconnecting..." module=grpc

We get one of these warnings for each manager that was part of the swarm and then removed. For example, if we rolled the swarm 3 times then we would have three of these warnings. 172.31.65.20 is the ip of the old manager node.

After rolling a swarm several times we run into the issue where trying to join the new manager ends up timing out because docker is trying to connect to several old managers that no longer exists.

We’ve ruled out this is not a firewall issue. The swarm is in a VPC with a security group which allows all communication between nodes besides the fact that our process works until several managers have left and eventually docker seems to become overwhelmed connecting to missing managers.

rangerx · May 6, 2021, 4:45am

Hello. I have the same error (bug?) with latest docker and I found a (slightly hacky) way to fix it:
1. Stop your docker instance on server with error.
2. Open /var/lib/docker/swarm/docker-state.json and REPLACE OLD IP in RemoteAddr attribute to IP of one of current swarm managers.
3. Start docker instance. Now error will gone from logs

Topic		Replies	Views
Docker Swarm :: possible splitbrain issue Swarm docker , swarm	1	2138	February 23, 2022
Swarm don't add a managers Swarm	1	1804	March 15, 2018
Worker unable to connect to Docker swarm Manager Swarm docker , swarm	4	7957	February 20, 2019
Broken swarm, cannot get past errors after errors Swarm	1	2220	November 15, 2018
Existing swarm failed and unable to add managers after reinitialisation of swarm Swarm	0	992	December 6, 2017

Docker swarm tries to connect to removed managers

Issue type

OS Version/build

App version

Steps to reproduce

Related topics