Docker Swarm sey:
if a swarm loses the quorum of managers, swarm tasks on existing worker nodes continue to run. However, swarm nodes cannot be added, updated, or removed, and new or existing tasks cannot be started, stopped, moved, or updated.
Situation:
three out of five admin nodes go down, so consensus is lost. Two manager nodes remain active but the swarm cluster cannot be managed
Questions:
Is there a setting in the swarm cluster that automates consensus recovery when consensus is lost?
Is there a tool that automatically solves the consensus problem? (i.e. force the election of a new leader)
Answer from docker doc:
Force the creation of a new cluster from a manager node and then add the other nodes. (clearly this is manual)
I hope you have understood the situation raised.
I appreciate the help.
thank you
Please be more specific: are the manager nodes permanently lost or just temporary unavailable. If it’s only temporary just make sure at least one of the manager nodes gets up and running (so you have at least 3 healthy manager nodes in total) so the nodes align their RAFT logs and gain consensus again. This is a fully automatic process.
When I mentioned that three nodes fell, I meant permanently, since if it were temporary, when they are active again, the consensus is reactivated and the leader is elected again as you mentioned, automatically.
In the case of permanent failure, I built five vms being all administrators (although it is not recommended that they be all, I did it to test the consensus), and when dismantling three nodes, the consensus was lost. Although the application was working, the swarm cluster was unmanageable. I had to force create a new cluster from one of the active manager nodes and then add the other node. Everything manually.
My question is to know if what I did manually with the two active nodes can be done automatically either with a script or with some HA (High Availability) tool.
Indeed, you did. It is clear now. I am not aware of a tool that automatically reinitializes a cluster. Though, it shouldn’t be hard to write an ansible playbook and roles that automate the task. I would understand if such a role does not already exist on ansible galaxy, as this is no use case people expect to encounter on a regular basis.
Personally, I would follow the IaC and immutable infrastructure approach in this situation instead: spin up new compute nodes/vms, initiate a new swarm cluster, restore the volumes from a backup if necessary (if they are not on a remote share anyway) then deploy the stacks again.