Hi everyone,
I’m having a strange manager issue with Docker Swarm, that I can’t reproduce on other environments.
I start from a situation where I have a Swarm of 26 workers and 3 managers (I call them host-manager-0{1,2,3}). I had to decomission the 3 managers at some point and I did build 3 new vms as replacement (I call them vm-manager-0{1,2,3}).
So using the manager token of the leader, I added the 3new vms in the Swarm, having a situation like this:
HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
(... workers ...)
host-manager-01 Ready Drain Leader 19.03.12
host-manager-02 Ready Drain Reachable 19.03.12
host-manager-03 Ready Drain Reachable 19.03.12
vm-manager-01 Ready Drain Reachable 19.03.12
vm-manager-02 Ready Drain Reachable 19.03.12
vm-manager-03 Ready Drain Reachable 19.03.12
At this point, I removed the 3 host-manager-0{1,2,3} running on vm-manager-01
docker node rm host-manager-01 --force
docker node rm host-manager-02 --force
docker node rm host-manager-03 --force
and on the hosts:
docker swarm leave
This went well and I had 3 new managers:
HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
(... workers ...)
vm-manager-01 Ready Drain Leader 19.03.12
vm-manager-02 Ready Drain Reachable 19.03.12
vm-manager-03 Ready Drain Reachable 19.03.12
I tried to add 2 new managers to a total of 5 using the manager token of the leader, vm-manager-01 in this case.
HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
(... workers ...)
vm-manager-01 Ready Drain Leader 19.03.12
vm-manager-02 Ready Drain Reachable 19.03.12
vm-manager-03 Ready Drain Reachable 19.03.12
vm-manager-04 Ready Drain Reachable 19.03.12
vm-manager-05 Ready Drain Reachable 19.03.12
but I realized on vm-manager-0{4,5} running manager operations, I had:
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded
I followed this documentation to recover, running the following on vm-manager-01
docker swarm init --force-new-cluster
Now I have a single manager, using the manager token I tried to add the other vm-manager-0{2,5}
but adding them as manager, I loose the quorum, leading to:
Error response from daemon: rpc error: code = 2 desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
It seems the new manager joining the cluster tries to reach old managers host-manager-0{1,2,3}
Even workers tries to connect to ips of old host-manager-0{1,2,3}
Dec 13 20:34:53 docker-host-017 dockerd[517]: time="2021-12-13T20:34:53.964726550Z" level=error msg="Failed to join memberlist [10.2.33.27 10.2.0.17 10.2.33.28] on retry: 3 errors occurred:\n\t* Failed to join 10.2.33.27: dial tcp 10.2.33.27:7946: connect: connection refused\n\t* Failed to join 10.2.0.17: dial tcp 10.2.0.17:7946: connect: connection refused\n\t* Failed to join 10.2.33.28: dial tcp 10.2.33.28:7946: i/o timeout\n\n"
I am now stuck with a single manager, I can’t add new managers and it seems the cluster is in bad shape trying to reach old managers that have been removed from the cluster.
Do you have any idea to recover from this situation?