Docker High Availability Advice

Hi,

Have had docker running fine on a single RPi with multiple product containers for my home automation. Been worried about it failing, so now have 3 more RPis with the intention of establishing a High Availability cluster to ensure I don’t lose service during a node failure.

Would appreciate some advice to point me in the right direction as to which High Availability products/configuration are recommended I should use since the options seem endless. Requirements/thoughts are;

  • Assuming I should use Swarm, but I can only run one instance of each product ie. Home Assistant, therefore needs to auto failover to another node in the event of a failure. The container will need its own IP address since I can’t use the host IP since it will change in a failover. Use traefik?
  • Want to protect against Swarm Manager failing, therefore assuming I should run another Manager, but Consul seems to be required to ensure only one Manager is in control? Where should you host Consul?
  • Since the data will shared I’m planning to host it on a Synology NAS. Any recommendations of which product/method I should use to connect to it?

Appreciate any help. I’ve spent hours researching, but I only ever seem to get half the solution of what I’m looking for and the above seems quite a basic configuration requirement for HA to me.

Thanks

  • Run 3 RPIs as manager node in your swarm cluster. This allows your cluster to compensate one unhealty manager node.
  • Deploy Traefik as global service using a deployment constraint that limits it to the manager nodes
    • introduce a failoverip for the manager nodes (e.g. keepalivd or ucarp)
    • use the failover-ip to communicate with Traefik, Traefik will distribute the traffic to the target service, regardless on which node it is running.
  • Make sure you create volumes backed by remote nfsv4 shares of your NAS.

What does Consul do in your scenario? I am not aware that Traefik2 would require Consul.

Swarm and Consul both use the RAFT consensus algorithm to elect a leader and state changes on the cluster membership of their clusters. Each (Swarm Manager/Consul) cluster node accepts write operations/state change, though under the hood, follower nodes will forward the traffic to the leader node, which then performs the actual write operation or state change, which will then be propagated to the follower nodes.

Update:
I am afraid this wasn’t clear enough: Swarm does neither require, nor use Consul for anything related to the Swarm node state. I thought you want to use Consul in the context of Traefik, which is also not required.

You want all your services to be HA, but then host your data on a single NAS? :thinking:

You can use Docker Swarm for a HA cluster, need 3 managers nodes, then one can fail. You deploy to Swarm with docker stack deploy from a compose file.

You can set replica count to 1 to have only one instance in the Swarm running, use constraints to fix them to a certain node (not HA) or group if required.

We use Traefik as reverse proxy, it will even forward requests to multiple instances if available. (Simple Traefik Swarm example, note that LE only works with single Traefik CE instance)

The real challenge for HA is the entrypoint. What happens if Traefik dies? You can also have that replicated to all Docker Swarm manager nodes, but how do you change the IP of the entry node? Or how do you update your DNS? There are tools like keepalived, but full HA is not easy.

Many thanks for the replies with some great suggestions/recommendations.

I appreciate getting 100% HA can feel like it’s never ending from a scenario perspective and whilst a NAS failure would cause an outage, there is RAID redundancy. I suppose I’m mainly trying to protect against a RPi crash or failure.

Thanks

It can be done though. One of my lab environments runs a 3 manager node only swarm cluster with Traefik, and it uses keepalived for a failover ipv4 + ipv6 ULA. My setup uses a single Synlogy NAS for the NFSv4 shares.

Though, I run an old version of Portworx developer in the cluster, which itself acts as a cluster of storages. It takes care of replicating the volumes amongst the nodes. I use a dedicated 10G Network just for the Portworx Backend communication. BUT: Portworx stopped releasing px-dev. So I am not sure if it is something that can be still recommended.