Swarm cluster freezes when an EC2 node freezes

ricardobocchi · September 2, 2024, 10:14pm

Hello everyone,

I have a swarm cluster with 3 instances EC2s on Amazon. 2 machines are master and 1 slave. If one of these machines crashes/freezes for some reason, such as processing limit, the entire cluster crashes. So it only starts working again when the machine is turned off or restarted. It is as if the cluster is stuck due to the lack of return from the node that crashed. These crashes in EC2 instances are relatively common in my case, and sometimes it takes several minutes for the machine to shut down. This happens even if I promote the 3 instances with master. Has anyone experienced this? Is this behavior normal? Is there anything that can be done to prevent this cluster crash?

Thank you all.

Docker version 27.1.2, build d01f264

bluepuma77 · September 3, 2024, 11:40am

For a functioning Docker Swarm, you need at least 3 manager nodes (which can also act as workers).

This works fine for us, even when one node crashes.

ricardobocchi · September 3, 2024, 5:39pm

I understand, but I’ve already used it with the three nodes configured as master and I had the same problem.

meyay · September 3, 2024, 6:37pm

Please define what freeze means.

Did you try the forum search with “swarm raft headless” already?
It should find posts like this one: 3 Docker Engines - Number of Managers in Swarm - #2 by meyay

ricardobocchi · September 3, 2024, 7:43pm

I don’t know exactly what happens physically, but when I say freeze, it means that the machine stops responding for a period of time. If I connect via SSH, the connection waits and gives a timeout error. This happens when the machine stays at 100% processing for too long, so it becomes unstable, making it difficult to even restart. In this case, the entire swarm cluster freezes.

meyay · September 3, 2024, 7:59pm

Doesn’t really sound like a swarm problem to me. Sounds like the machine is overwhelmed.

I have never encountered such a situation on ec2 nodes. By any chance are you using too small instance types so that your resources are maxed out because they are not powerful enough to run the payload?

ricardobocchi · September 3, 2024, 9:57pm

It seems to me to be a swarm problem, since one machine in the cluster crashes and the swarm of the other two machines stops responding. It’s as if the swarm is waiting for a response from this crashed machine, affecting the entire cluster. If I remove the crashed node from the cluster, it starts responding again, this is when docker doesn’t crash together and also with an rpc error.

ricardobocchi · September 11, 2024, 2:34pm

Hi,

Apparently the problem was in my file system cluster. Swarm runs on three machines on top of a glusterfs to manage the files used by some services. The version of glusterfs was old (7.9) and had a bug in the case of a node crash that caused the entire cluster to crash. And this caused the swarm cluster to crash. I updated to the latest version of glusterfs (11) and so far I haven’t had any more crashing problems. So it looks like this resolved it. Thank you to those interested.

Topic		Replies	Views
No running container after node swarm failover General docker	0	782	March 9, 2018
Docker swarm completely freezes system General	2	1857	August 27, 2017
One node on a 3 node-cluster is randomly down and restarts all its services General docker , swarm	2	99	January 7, 2025
Swarm failure - voting does not work General aws	4	2441	August 29, 2016
No elected cluster leader General	2	1824	September 13, 2016

Swarm cluster freezes when an EC2 node freezes

Related topics