Docker Community Forums

Share and learn in the Docker community.

Docker Swarm :: possible splitbrain issue

Hi everyone,

I’m having a strange manager issue with Docker Swarm, that I can’t reproduce on other environments.

I start from a situation where I have a Swarm of 26 workers and 3 managers (I call them host-manager-0{1,2,3}). I had to decomission the 3 managers at some point and I did build 3 new vms as replacement (I call them vm-manager-0{1,2,3}).
So using the manager token of the leader, I added the 3new vms in the Swarm, having a situation like this:

HOSTNAME            STATUS      AVAILABILITY    MANAGER STATUS      ENGINE VERSION
(... workers ...)
host-manager-01     Ready         Drain               Leader           19.03.12
host-manager-02     Ready         Drain               Reachable        19.03.12
host-manager-03     Ready         Drain               Reachable        19.03.12
vm-manager-01       Ready         Drain               Reachable        19.03.12
vm-manager-02       Ready         Drain               Reachable        19.03.12
vm-manager-03       Ready         Drain               Reachable        19.03.12

At this point, I removed the 3 host-manager-0{1,2,3} running on vm-manager-01

docker node rm host-manager-01 --force
docker node rm host-manager-02 --force
docker node rm host-manager-03 --force

and on the hosts:

docker swarm leave

This went well and I had 3 new managers:

HOSTNAME            STATUS      AVAILABILITY    MANAGER STATUS      ENGINE VERSION
(... workers ...)
vm-manager-01       Ready         Drain               Leader           19.03.12
vm-manager-02       Ready         Drain               Reachable        19.03.12
vm-manager-03       Ready         Drain               Reachable        19.03.12

I tried to add 2 new managers to a total of 5 using the manager token of the leader, vm-manager-01 in this case.

HOSTNAME            STATUS      AVAILABILITY    MANAGER STATUS      ENGINE VERSION
(... workers ...)
vm-manager-01       Ready         Drain               Leader           19.03.12
vm-manager-02       Ready         Drain               Reachable        19.03.12
vm-manager-03       Ready         Drain               Reachable        19.03.12
vm-manager-04       Ready         Drain               Reachable        19.03.12
vm-manager-05       Ready         Drain               Reachable        19.03.12

but I realized on vm-manager-0{4,5} running manager operations, I had:

Error response from daemon: rpc error: code = 4 desc = context deadline exceeded

I followed this documentation to recover, running the following on vm-manager-01

 docker swarm init --force-new-cluster

Now I have a single manager, using the manager token I tried to add the other vm-manager-0{2,5}
but adding them as manager, I loose the quorum, leading to:

Error response from daemon: rpc error: code = 2 desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.

It seems the new manager joining the cluster tries to reach old managers host-manager-0{1,2,3}
Even workers tries to connect to ips of old host-manager-0{1,2,3}

Dec 13 20:34:53 docker-host-017 dockerd[517]: time="2021-12-13T20:34:53.964726550Z" level=error msg="Failed to join memberlist [10.2.33.27 10.2.0.17 10.2.33.28] on retry: 3 errors occurred:\n\t* Failed to join 10.2.33.27: dial tcp 10.2.33.27:7946: connect: connection refused\n\t* Failed to join 10.2.0.17: dial tcp 10.2.0.17:7946: connect: connection refused\n\t* Failed to join 10.2.33.28: dial tcp 10.2.33.28:7946: i/o timeout\n\n"

I am now stuck with a single manager, I can’t add new managers and it seems the cluster is in bad shape trying to reach old managers that have been removed from the cluster.

Do you have any idea to recover from this situation?