We have a Docker Swarm cluster with 5 nodes hosted on DigitalOcean, one of which is the manager node. Occasionally, the manager node experiences networking issues, causing the worker nodes to lose connectivity to it. After connectivity is restored, all services on the nodes restart. Is this expected behavior? From what I understand of the ‘Losing Quorum’ documentation, I thought the cluster would continue to run normally without restarts, although administrative changes wouldn’t be possible
If the question is not clear, I can provide some logs.
On which node? The manager or the worker? Maybe some kind of health checks are involved to restart containers? Maybe the restart was initiated because of the network issue, but couldn’t be fnished until the connection was restored.
Note: I’m not a swarm user, so you cans till wait for answers from actual swarm users.
The services are primary webapps that has healthcheck on 0.0.0.0:8000/healthcheck/. It should not be connected to the manager then (at least I think). The services that were restarted are on the worker nodes (the 4 other nodes).
What is the difference between the network of worker nodes and the manager node? When the connection between the manager and the workers breaks, can you be sure that the worker nodes are running normally and have access to everything they need?
I could imagine for example a health check that relies on an external service like a DNS server. When that is not available, the healthcheck fails and the container is restarted.
I believe that what you experienced, shouldn’t happen normally. If it happens, to find out why, it would be important to know more about your deployment, because the reason could be something that just never happens to others, because there is something special in your environment, which isn’t in other’s.
You could test some scenarios like stopping the manager node while there is nothing wrong with the network, wait some time (the same amount that network issues usually take) and start again. If containers are not restarting, the problem was not the unavailability of the manager node, but the network issue itself.
As a heavy Swarm user by first question would be: why do you only have a single manager? For HA you should have 3 managers, you can still run workloads on them.