From the docs here, it would appear the expectation is that a 3-manager setup can handle 2 failures. But in my experimentation, a 2-manager setup where I kill the node the primary manager is running on does not recover. I assume this means the secondary manager does not have connectivity to a majority of the managers it knows about, so it can’t elect a leader (assuming Raft works like Zookeeper’s consensus protocol, which I’m more familiar with). Is this correct? Can someone edify my understanding?
The docs claims to support 2 availability zones failure on 3 availability zones deployment. Also curious how Swarm supports it. What if network partition happens?
For example, az1 is still alive, but az1’s network is broken to az2 and az3, az2 and az3 could talk with each other. The swarm nodes on az2 and az3 will definitely work well, as they have the majority. But according to the doc, the swarm nodes on az1 will still work well? Once the network recovers, how does swarm handle the conflict?
The doc link may be out of date. It is under “Superseded products and tools”. Docker swarm managers use Raft protocols, see the swarm raft page. For 3 swarm managers, when 2 are down, the swarm managers will go down. For 2 swarm managers, any one goes down will bring the swarm managers down. We should not use 2 swarm managers in production.