Detect leader manager is down

shereenelsayed · June 11, 2020, 4:47pm

I am running a service over docker swarm with 5 nodes (3 managers and 2 workers). I have a question:
What happens if the host machine for the leader manager is down for some reason, is there a way to detect this? I noticed that no other manager is now a leader (other managers become leaders only if the leader is drained) I need to run a script once the leader manager is down to make some things related to elastic IP configs

meyay · June 11, 2020, 8:24pm

Are you sure about your observations regarding the leader?

Swarm uses Raft. Raft has a periodic health check between nodes and calls for leader election in case the current leader is not responding.

I am quite sure if you use the sdk and write some code that uses the sdk to listen on the docker event stream, you should get all the wanted informations

shereenelsayed · June 11, 2020, 9:35pm

Yes, I get some behavior that I cannot understand actually and maybe you can help me understand. In the below image, I created a network of 3 managers (just to test the leadership part) then I drained the leader node but it was kept as a leader, I expected another one will be the leader.

I returned this leader to active state then shutdown this node, now I get Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online. and none of the managers is now a leader.

I will have a look over the sdk, thanks

meyay · June 12, 2020, 6:26am

Your expectation regarding drain is not correct.

The colummn AVAILABILITYaddresses wether a node will be used for swarm service deployments. Drain simply tells the scheduler to stop all running tasks on the node and prevent new task deployments. Plain docker containers are not affected by Drain

The leader will change, once the STATUS of the node beeing leader is anything else then Ready

Are you running firewalls on the hosts that might prevent cluster internal communication?
The error message indicates that your cluster has too few nodes for a quorum. Since you have three master nodes and none of them actualy is out of service, I would assume that something is disturbing the cluster internal communication. Raft requires low latency network connections. So if you nodes are desributed over WAN connetions or Regions cluster management will fail.

Thus said, something must be generaly wrong in your swarm cluster. What you experience is NOT how swarm membership detection and leadership election generaly works.

Topic		Replies	Views
What if the leader is down in Swarm cluster and what about histories Swarm	3	604	October 1, 2021
No elected cluster leader General	2	1822	September 13, 2016
For Docker Swarm with two nodes, one manager and one worker, what happens if the manager node goes out? General swarm	8	47	September 19, 2024
New leader election failed Swarm	0	5626	September 7, 2017
Docker 19.03.12 : The swarm does not have a leader aferter swarm upgrade General docker , swarm	2	11063	May 18, 2021

Detect leader manager is down

Related topics