Help and advice needed for my homelab setup

This is my current network setup:

Changes I want:

  • All services (Plex, Transmission, etc.) moved to docker (not an issue - I know how to do this, the other requirements are where I get stuck!)

  • Docker running in swarm mode on Server 1 and Server 2. Docker needs to have its own network with a virtual IP on the 192.168.1.x 1GB/s network and allow local communication on the 10.0.0.x 40GB/s InfiniBand network. Containers must be able to all use the 192.168.1.x network IP. I would also like to run 2 instanstances of PiHole on unique 192.168.1.x IPs, currently these are running on raspberry Pis, but stability is an issue and manual reboots a pain.

  • If Server 1 or Server 2 goes down, Docker Swarm on the unaffected server should relaunch all affected containers on the other server.

  • Persistent data for docker containers must be stored locally on both Server 1 and Server 2. There should be immediate duplication between the servers via the 40GB/s 10.0.0.x InfiniBand network. I would prefer to use the boot SSD drives for this data and run a scheduled task or cron within a docker container to backup this data to the drivepool. The drivepool would then have this data replicated to 2 or 3 physical drives.

  • Non persistant data for docker containers (I.E. container images) can be stored on the boot SSD drives but does not require duplication.

  • Use of VMs on Server 1 and/or Server 2 with the available hypervisors for Docker Swarm is optional but not preferred. I’ll only do this if it makes it much easier to configure.

  • Windows subsystem for linux is available on both Server 1 and Server 2 so running linux based docker images directly on the server is not an issue.

Questions:

1 Is it possible to set up docker to make use of a virtual IP on the 192.168.1.x subnet (while allowing communication between containers on the InfiniBand 10.0.0.x subnet) with automated failover if one server becomes unavailable? If this is possible, what would be the best way to do it?
2. Is it possible to run PiHole on dedicated IPs on the 192.168.1.x subnet (one copy running on each server)? If this is possible, how should I set this up? I’m happy to use dedicated VMs on both Server 1 and Server 2 rather than docker if this makes it easier.
3. For creating and using VMs with this setup, should I use Virtualbox or Hyper-V? VMs will not need failover and only need to be present on one server. VM image files would be stored on the DrivePool and thus could be launched on either server. What is the best way to set up a VM with normal networking on the 192.168.1.x subnet and private communication (data transfer etc) on the 10.0.0.x subnet?
4. I would like to use a container manager like Portainer with this setup to allow for simplicity. I am also not averse to using Docker Create yml files to specify container settings. What container manager would be suggested for my setup?
5. I would like to have additional malware protection and run ClamAV within either a docker container or VM and have it periodically scan the pooled drives. I’ve not seen a container image that will do this - any suggestions?
6. If I wanted to add ClamAV (or similar) scanning to the data in each of my running docker containers, is it possible to create a fork of these containers that adds a ClamAV service to scan the internal container data that will still automatically update when the master is updated? Alternatively is there a dedicated service container that will periodically connect to running docker containers and scan them? Depending on the answer what would be the best way to collated the logs from each container’s scans and flag issues? I would prefer a single “scanning container” that can connect to all other docker images.
7. If the answer to question 1 is “no it isn’t possible”, what alternatives do I have for automated failover. I’m planning to implement Traefik as a virtual edge router allowing domain name based access to containers without worrying about ports, would this work better and simplify things?
8. What would be the simplest method through which to make the pooled drives accessible from outside the network so I can access all the data while not at home? VPN? Dedicated file sharing container?

I am going to address a point, which is not addressed in your list of questions:

Swarm uses the raft consensus for quorum, which will require floor(n/2)+1 healthy nodes to manage the cluster. Without a third node, if either one of your nodes becomes unhealthy, the cluster is headless. Though, you could add your RPi as and additional manager (had it running like that a long time ago, it works) and do not deploy workloads on it.

I hope someone else is going to address your list of questions :slight_smile:

Does this mean there is always a single point of failure? Is there any way around this? I’m aware that in swarm mode I cannot mix arm and x64 servers but hadn’t thought of using one of the raspberry pi’s as controller since that can be on different architecture. I was originally planning to use VMs as controller, or is this not a good idea?

Just of that curriousity: what makes your think that?

Of course it works. Been there, done that: I had a three node manager only cluster, which constisted of 2x x64_86 nodes and 1x Arm64 node. I used the same Ubuntu and Docker-CE version on all nodes. Worked just fine. In this setup, the loss of one manager node (regardless on which machine) can be compensated and the cluster remains controllable. Though, eventually I replaced it with another x64_86 machine in order to reliably operate replicated services that rely on consensus as well (etcd, consul,…). An old i3 Intel Nuc is cheap and just fine for that task, or any other mini pc that can be bough cheap on the second hand market…

May I suggest these excellent free self paced trainings to get a better understanding of Docker and Swarm:
Introduction to Docker and Containers
Container Orchestration with Docker and Swarm

Portainer is just fine. Though, advanced users either create a makefile and use make to controll their deployement or use a ci/cd server and create deployment jobs within it - of course everything versioned in a scm like git.

You can create custom network in docker that use the infiband controller as parent, then assign this network to your services, so the controller will have a virtual interface which will be a child interface of the inifiniband parent interface. Though, its unclear what you mean with virtual ip. If you mean a failover ip, then this is not covered by Docker and needs to handled on the OS level. You might want to take a look at keepalived for this - if dns is handled by a consumer grade router, chances are high, that they will be confused when a failover happens, resulting in “flapping” name resoultions.

I would neither use Virtualbox, nor Hyper-V. You might want to consider if you run esxi or proxmox on your hosts. Proxomox brings a lot of fun stuff to the table.

Traefik does the job just fine: register a *.domain and point it to your public interface and let Treafik handle the subdomain specific routing to the containers.