How do I recover cluster manager automatically after losing consensus?

cristianpereyra · January 21, 2023, 7:51pm

(please, sorry for my enlgish)

Docker Swarm sey:
if a swarm loses the quorum of managers, swarm tasks on existing worker nodes continue to run. However, swarm nodes cannot be added, updated, or removed, and new or existing tasks cannot be started, stopped, moved, or updated.

Situation:
three out of five admin nodes go down, so consensus is lost. Two manager nodes remain active but the swarm cluster cannot be managed

Questions:

Is there a setting in the swarm cluster that automates consensus recovery when consensus is lost?
Is there a tool that automatically solves the consensus problem? (i.e. force the election of a new leader)

Answer from docker doc:
Force the creation of a new cluster from a manager node and then add the other nodes. (clearly this is manual)

I hope you have understood the situation raised.
I appreciate the help.
thank you

meyay · January 21, 2023, 9:08pm

Please be more specific: are the manager nodes permanently lost or just temporary unavailable. If it’s only temporary just make sure at least one of the manager nodes gets up and running (so you have at least 3 healthy manager nodes in total) so the nodes align their RAFT logs and gain consensus again. This is a fully automatic process.

Leader election is also an automatic process. See https://raft.github.io.

If all the three manager nodes are permanently lost, then indeed you will have to manually resolve the cluster and reinitiate a new one.

cristianpereyra · January 24, 2023, 3:47pm

@meyay Thanks for answering.

When I mentioned that three nodes fell, I meant permanently, since if it were temporary, when they are active again, the consensus is reactivated and the leader is elected again as you mentioned, automatically.

In the case of permanent failure, I built five vms being all administrators (although it is not recommended that they be all, I did it to test the consensus), and when dismantling three nodes, the consensus was lost. Although the application was working, the swarm cluster was unmanageable. I had to force create a new cluster from one of the active manager nodes and then add the other node. Everything manually.

My question is to know if what I did manually with the two active nodes can be done automatically either with a script or with some HA (High Availability) tool.

I hope I have explained myself better.

meyay · January 24, 2023, 5:14pm

Indeed, you did. It is clear now. I am not aware of a tool that automatically reinitializes a cluster. Though, it shouldn’t be hard to write an ansible playbook and roles that automate the task. I would understand if such a role does not already exist on ansible galaxy, as this is no use case people expect to encounter on a regular basis.

Personally, I would follow the IaC and immutable infrastructure approach in this situation instead: spin up new compute nodes/vms, initiate a new swarm cluster, restore the volumes from a backup if necessary (if they are not on a remote share anyway) then deploy the stacks again.

cristianpereyra · February 6, 2023, 4:11pm

I understand you.

For now I will consider automating the creation of the new cluster.
Thank you for the time and I apologize for the delay in responding.

Thank you so much.

Topic		Replies	Views
Can't understand Docker Recover from losing the quorum General docker	2	1994	October 6, 2018
For Docker Swarm with two nodes, one manager and one worker, what happens if the manager node goes out? General swarm	8	50	September 19, 2024
Question reboot Docker Swarm General docker , swarm	7	284	May 14, 2024
Detect leader manager is down Swarm docker , swarm	3	4460	June 12, 2020
What if the leader is down in Swarm cluster and what about histories Swarm	3	605	October 1, 2021

How do I recover cluster manager automatically after losing consensus?

Related topics