Single manager docker swarm networking issues - all services were restarted

bstepa · January 3, 2024, 10:02pm

We have a Docker Swarm cluster with 5 nodes hosted on DigitalOcean, one of which is the manager node. Occasionally, the manager node experiences networking issues, causing the worker nodes to lose connectivity to it. After connectivity is restored, all services on the nodes restart. Is this expected behavior? From what I understand of the ‘Losing Quorum’ documentation, I thought the cluster would continue to run normally without restarts, although administrative changes wouldn’t be possible

If the question is not clear, I can provide some logs.

Thank you for answer and clarification

rimelek · January 3, 2024, 10:45pm

On which node? The manager or the worker? Maybe some kind of health checks are involved to restart containers? Maybe the restart was initiated because of the network issue, but couldn’t be fnished until the connection was restored.

Note: I’m not a swarm user, so you cans till wait for answers from actual swarm users.

bstepa · January 3, 2024, 10:49pm

The services are primary webapps that has healthcheck on 0.0.0.0:8000/healthcheck/. It should not be connected to the manager then (at least I think). The services that were restarted are on the worker nodes (the 4 other nodes).

rimelek · January 4, 2024, 9:54am

What is the difference between the network of worker nodes and the manager node? When the connection between the manager and the workers breaks, can you be sure that the worker nodes are running normally and have access to everything they need?

I could imagine for example a health check that relies on an external service like a DNS server. When that is not available, the healthcheck fails and the container is restarted.

I believe that what you experienced, shouldn’t happen normally. If it happens, to find out why, it would be important to know more about your deployment, because the reason could be something that just never happens to others, because there is something special in your environment, which isn’t in other’s.

You could test some scenarios like stopping the manager node while there is nothing wrong with the network, wait some time (the same amount that network issues usually take) and start again. If containers are not restarting, the problem was not the unavailability of the manager node, but the network issue itself.

bluepuma77 · January 4, 2024, 12:30pm

As a heavy Swarm user by first question would be: why do you only have a single manager? For HA you should have 3 managers, you can still run workloads on them.

Topic		Replies	Views
Restarting single Docker manager of swarm Swarm	3	3080	August 17, 2019
Unnecessary rescheduling after node lost connection with manager Swarm swarm	5	55	September 17, 2024
Docker Swarm automatically restarts services General	0	486	April 14, 2021
Node does not rejoin swarm after restart Docker Desktop windows	1	3717	December 21, 2017
Can't understand Docker Recover from losing the quorum General docker	2	1991	October 6, 2018

Single manager docker swarm networking issues - all services were restarted

Related topics