I am running a service over docker swarm with 5 nodes (3 managers and 2 workers). I have a question:
What happens if the host machine for the leader manager is down for some reason, is there a way to detect this? I noticed that no other manager is now a leader (other managers become leaders only if the leader is drained) I need to run a script once the leader manager is down to make some things related to elastic IP configs
Are you sure about your observations regarding the leader?
Swarm uses Raft. Raft has a periodic health check between nodes and calls for leader election in case the current leader is not responding.
I am quite sure if you use the sdk and write some code that uses the sdk to listen on the docker event stream, you should get all the wanted informations
Yes, I get some behavior that I cannot understand actually and maybe you can help me understand. In the below image, I created a network of 3 managers (just to test the leadership part) then I drained the leader node but it was kept as a leader, I expected another one will be the leader.
I returned this leader to active state then shutdown this node, now I get Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
and none of the managers is now a leader.
I will have a look over the sdk, thanks
Your expectation regarding drain is not correct.
The colummn AVAILABILITY
addresses wether a node will be used for swarm service deployments. Drain simply tells the scheduler to stop all running tasks on the node and prevent new task deployments. Plain docker containers are not affected by Drain
The leader will change, once the STATUS
of the node beeing leader is anything else then Ready
Are you running firewalls on the hosts that might prevent cluster internal communication?
The error message indicates that your cluster has too few nodes for a quorum. Since you have three master nodes and none of them actualy is out of service, I would assume that something is disturbing the cluster internal communication. Raft requires low latency network connections. So if you nodes are desributed over WAN connetions or Regions cluster management will fail.
Thus said, something must be generaly wrong in your swarm cluster. What you experience is NOT how swarm membership detection and leadership election generaly works.