I have a swarm cluster with 3 instances EC2s on Amazon. 2 machines are master and 1 slave. If one of these machines crashes/freezes for some reason, such as processing limit, the entire cluster crashes. So it only starts working again when the machine is turned off or restarted. It is as if the cluster is stuck due to the lack of return from the node that crashed. These crashes in EC2 instances are relatively common in my case, and sometimes it takes several minutes for the machine to shut down. This happens even if I promote the 3 instances with master. Has anyone experienced this? Is this behavior normal? Is there anything that can be done to prevent this cluster crash?
I don’t know exactly what happens physically, but when I say freeze, it means that the machine stops responding for a period of time. If I connect via SSH, the connection waits and gives a timeout error. This happens when the machine stays at 100% processing for too long, so it becomes unstable, making it difficult to even restart. In this case, the entire swarm cluster freezes.
Doesn’t really sound like a swarm problem to me. Sounds like the machine is overwhelmed.
I have never encountered such a situation on ec2 nodes. By any chance are you using too small instance types so that your resources are maxed out because they are not powerful enough to run the payload?
It seems to me to be a swarm problem, since one machine in the cluster crashes and the swarm of the other two machines stops responding. It’s as if the swarm is waiting for a response from this crashed machine, affecting the entire cluster. If I remove the crashed node from the cluster, it starts responding again, this is when docker doesn’t crash together and also with an rpc error.