Docker Swarm High Availability in two datacenters

awsmman · December 22, 2023, 1:37pm

Hi, I have two data centers, and I want to set up a Docker Swarm cluster with fault tolerance and high availability. I’ve created three nodes, two in the first data center and the third in the second data center. There will be three manager nodes. However, I’m concerned that if the first data center goes down, the entire cluster might stop working. How can I set up high availability across two data centers with three nodes?

Thanks

rimelek · December 22, 2023, 2:36pm

I don’t have experience in HA Swarm clusters, so everyone feel free to correct me if I’m wrong. My answer will be a general one.

Everything depends on the level of HA you want to achieve. For HA managers you need at least three manager nodes. The best would be if those nodes would only be managers and not workers, so you would need more than three nodes to run workers too. Let’s say you run nodes that are managers and workers too.

You would have 2 nodes in one datacenter and one in another. If the communication between the two datacenters breaks, the containers in the single node datacenter would probably keep running, but you couldn’t communicate with that as manager. The two nodes in the other datacenter would still work as managers, but could not change anything in the single node datacenter. If you have containers trying to communicate between the datacenters, that would make them fail too.

Let’s say something happens in one datacenter. For example power loss or fire and the nodes cannot be restored. If the problem happens in the single node datacenter, you could lose the node and containers forever and add a new node to the remained two managers and reschedule the missing containers.

If something happens in the two node datacenter which cannot be restored, you have lost your cluster forever.

Of course a cluster is not highly available if only the managers are in HA but not the applications like a database and the requorements for those could be different.

So I guess you could practice HA mode in two datacenters using 3 nodes and it would be probably better than only one datcenter, but it would not be really highly available in all senses. Not to mention if the network between the two datacenters are not really fast so it would even slow your cluster down or make your applications unstable

So whether it is good for you or not depends on what you want to use it for and what you want to run in the cluster.

meyay · December 22, 2023, 9:29pm

The Raft consensus algorithm used by Docker Swarm (and Kubernetes) requires low latency network connections in order to work reliable. For instance, it works reliable across availability zones in the same region of cloud providers, but does not work reliable due higher latency across regions of the same cloud provider.

Let’s assume for a minute your datacenters have a low latency network connection, so that it doesn’t really matter if all nodes are in the same DC or not.

A single node in a three node cluster will always be headless. Raft requires at least floor(n/2)+1 healthy nodes for quorum on state changes within the cluster.

2 node cluster requires 2 healthy manager nodes.
3 node cluster requires 2 healthy manager nodes.
4 node cluster requires 3 healthy manager nodes.
5 node cluster still requires 3 healthy manager nodes.

Like @rimelek wrote: in a headless cluster containers will keep running, but overlay traffic to the other nodes can’t succeed. And of course no state change can be applied to the cluster.

Are you sure trying to solve this challenge on the container orchestrator level is the right approach?

awsmman · December 22, 2023, 9:48pm

I appreciate your insights. Let me outline my current project: I have two datacenters operating at Layer 2, with a monitoring server that includes Prometheus, Grafana, etc. I’m aiming to Dockerize this setup and ensure high availability using Docker Swarm.

My plan involves setting up a cluster with two nodes and consistently pausing one of them. However, I would prefer to establish a standard cluster with three nodes. My concern is that if one datacenter, housing two nodes, goes down, the third node might struggle to reach quorum, leading to a cluster outage.

I’m seeking advice on how to handle this scenario effectively. Any recommendations or insights would be greatly appreciated.

Thank you!

meyay · December 22, 2023, 10:39pm

If you run two manager nodes, turning off one of them will render the cluster headless. There is a reason why I mentioned it in my previous response.

This is not a might situation → it is guaranteed.

Please define outage. We already shared what will happen.

There is not much I can recommend. You can not run an odd number of nodes on an equal number of datacenters and have ha in both datacenters independently. It is not designed for that. You could run a cluster in each datacenter and handle the problem with replication, if the application supports it.

I assume you want something like a pilot light or warm standby solution (see Disaster recovery options in the cloud - Disaster Recovery of Workloads on AWS: Recovery in the Cloud).

It’s been a couple of years that I have seen anything else than Kubernetes clusters running across 3 availability zones in a cloud region. Typically combined with autos calling groups that allow the cluster to create/destroy nodes when necessary.

I still believe neither Swarm, nor Kubernetes alone are the answer to your challenge.

awsmman · December 23, 2023, 8:20am

Thank you for your response. Okay, I will attempt to set up a two-node configuration, with one node as active and the other as passive.

I believe I might face challenges setting up a Kubernetes cluster in my particular case. Can you please provide guidance or correct me if I’m mistaken?

bluepuma77 · December 24, 2023, 12:31pm

If you have two data centers and require a quorum, why not add a 3rd manager from an independent cloud VM?

Even if you do have the HA Docker Swarm setup, how do you ensure availability for external clients? You got a load balancer or reverse proxy in front of your workers?

awsmman · December 25, 2023, 6:03am

I have a load balancer like F5, but I can’t use any cloud VM. What do you mean by VM?

rimelek · December 25, 2023, 9:59am

Are you saying you have physical machines in the datacenters and not virtual machines?

bluepuma77 · December 25, 2023, 10:48am

The main question is what you want to achieve. You want to make your setup survive even a full datacenter outage, not only a server outage?

Then you need to have a third “datacenter”. Instead of a full datacenter you could just use a cloud VM (“virtual machine” from a third provider) to have a third Docker manager node, that ensures the surviving datacenter still has a majority.

Recently I saw a nice diagram how a startup created redundancy, it may have been in this forum. They used AWS and Google for a full setup each, additionally Azure for services requiring a quorum.

awsmman · December 25, 2023, 11:36am

Yes, however, using cloud services or virtual machines from a third-party provider is not allowed within our company. How can I resolve this issue?

awsmman · December 25, 2023, 11:37am

virtual machines in datacenters!

bluepuma77 · December 26, 2023, 2:31pm

I would say with 2 data centers you can’t really do HA across data centers.

It may seem possible with something like a PostgresDB, which can run in master/slave mode (have they updated wording?).

But if internet connectivity between the DCs fail, each DB will become M. You might save data to the former S. When connectivity is back, how do you sync between two Ms?

You need to ensure HA for all components (LB, proxy, app, DB), but some use raft and need 3 instances. I would say if you want HA across DCs, you need a third DC (or at least an additional tiny node to be the deciding factor for a quorum).

Topic		Replies	Views
Prometheus high availability General docker , swarm	2	403	December 8, 2023
High Availability in Docker Swarm General docker , build , swarm	23	2940	June 27, 2024
How to have docker swarm redundancy having 2 physical servers? General swarm	2	1565	August 29, 2023
Can a cluster with 2 managers handle failure? Swarm	1	2065	September 16, 2017
Question reboot Docker Swarm General docker , swarm	7	837	May 14, 2024

Docker Swarm High Availability in two datacenters

Related topics