Docker Swarm :: possible splitbrain issue

jboisdequin · December 22, 2021, 5:25pm

Hi everyone,

I’m having a strange manager issue with Docker Swarm, that I can’t reproduce on other environments.

I start from a situation where I have a Swarm of 26 workers and 3 managers (I call them host-manager-0{1,2,3}). I had to decomission the 3 managers at some point and I did build 3 new vms as replacement (I call them vm-manager-0{1,2,3}).
So using the manager token of the leader, I added the 3new vms in the Swarm, having a situation like this:

HOSTNAME            STATUS      AVAILABILITY    MANAGER STATUS      ENGINE VERSION
(... workers ...)
host-manager-01     Ready         Drain               Leader           19.03.12
host-manager-02     Ready         Drain               Reachable        19.03.12
host-manager-03     Ready         Drain               Reachable        19.03.12
vm-manager-01       Ready         Drain               Reachable        19.03.12
vm-manager-02       Ready         Drain               Reachable        19.03.12
vm-manager-03       Ready         Drain               Reachable        19.03.12

At this point, I removed the 3 host-manager-0{1,2,3} running on vm-manager-01

docker node rm host-manager-01 --force
docker node rm host-manager-02 --force
docker node rm host-manager-03 --force

and on the hosts:

docker swarm leave

This went well and I had 3 new managers:

HOSTNAME            STATUS      AVAILABILITY    MANAGER STATUS      ENGINE VERSION
(... workers ...)
vm-manager-01       Ready         Drain               Leader           19.03.12
vm-manager-02       Ready         Drain               Reachable        19.03.12
vm-manager-03       Ready         Drain               Reachable        19.03.12

I tried to add 2 new managers to a total of 5 using the manager token of the leader, vm-manager-01 in this case.

HOSTNAME            STATUS      AVAILABILITY    MANAGER STATUS      ENGINE VERSION
(... workers ...)
vm-manager-01       Ready         Drain               Leader           19.03.12
vm-manager-02       Ready         Drain               Reachable        19.03.12
vm-manager-03       Ready         Drain               Reachable        19.03.12
vm-manager-04       Ready         Drain               Reachable        19.03.12
vm-manager-05       Ready         Drain               Reachable        19.03.12

but I realized on vm-manager-0{4,5} running manager operations, I had:

Error response from daemon: rpc error: code = 4 desc = context deadline exceeded

I followed this documentation to recover, running the following on vm-manager-01

 docker swarm init --force-new-cluster

Now I have a single manager, using the manager token I tried to add the other vm-manager-0{2,5}
but adding them as manager, I loose the quorum, leading to:

Error response from daemon: rpc error: code = 2 desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.

It seems the new manager joining the cluster tries to reach old managers host-manager-0{1,2,3}
Even workers tries to connect to ips of old host-manager-0{1,2,3}

Dec 13 20:34:53 docker-host-017 dockerd[517]: time="2021-12-13T20:34:53.964726550Z" level=error msg="Failed to join memberlist [10.2.33.27 10.2.0.17 10.2.33.28] on retry: 3 errors occurred:\n\t* Failed to join 10.2.33.27: dial tcp 10.2.33.27:7946: connect: connection refused\n\t* Failed to join 10.2.0.17: dial tcp 10.2.0.17:7946: connect: connection refused\n\t* Failed to join 10.2.33.28: dial tcp 10.2.33.28:7946: i/o timeout\n\n"

I am now stuck with a single manager, I can’t add new managers and it seems the cluster is in bad shape trying to reach old managers that have been removed from the cluster.

Do you have any idea to recover from this situation?

stbdrn · February 23, 2022, 8:41pm

Have you tried adding them as workers, and then promoting them to managers? After that, remove and re-add the workers.

Topic		Replies	Views
Swarm don't add a managers Swarm	1	1782	March 15, 2018
Will docker swarm 1.12 support multiple managers? Swarm docker , beta	9	13090	December 23, 2018
Docker swarm tries to connect to removed managers Swarm	1	2415	May 6, 2021
Docker 19.03.12 : The swarm does not have a leader aferter swarm upgrade General docker , swarm	2	12478	May 18, 2021
Restore a docker swarm Swarm docker , swarm	0	624	February 23, 2018

Docker Swarm :: possible splitbrain issue

Related topics