Thats a failed heartbeat of the consensus algorithm underneath. Might be nothing. Might be a problem with network latency and/or reliability. The RAFT concensus requires a low latency, low jitter network for proper operation.
Thanks for the response. We currently have a docker swarm with 12 stacks deployed, and on several occasions (at different times) we noticed that the stack (and all the containers were shutdown and rebuilt) causing service disruptions but we don’t know why
The only reference we have is to look through the syslog entries but we can’t find a definitive answer. We have increased the heartbeat from 5s to 20s (30s) and now 1m in order to prevent the stack to shutdown the containers
We have seen a spike in CPU usage when the stack is rebuilt but we don’t know what is the cause
Network bandwitdh, memory
We deployed docker swarm onto 1 server and had no issues for well over a year, and now we experience this issue every two days. (it seems)
So your solution to missing heartbeats is to make them appear less frequently? Brilliant!
Hearbeat concerns custer membership, though it does not influence the timeframe which is required to reach consensus amongst the manager nodes for changes. Though, since you seem to have only one node. This shouldn’t be the problem.
Ah, with rebuild you mean redeploy. Though, the stack itself shoudn’t be redeployed. If the number of tasks of a service matches the number of desired replicas, there shouldn’t be any redployment. Depending on your restart policy, the death of a container created by a task won’t do anything or deploy a new task to statisfy the number of desired replicas. Are you sure your containers are not oom killed (see: dmesg)? Of course deployments put stress on the ressource - do you expect that they don’t? Bootstrapping applications usualy is not a cheap task…
If you have a single server, consensus can’t be the problem. Neither can be the network.
I agree with your assessment, we are trying to figure out why on a server with little stress all of a sudden all the docker managers has to rebuild all of the containers at the same time, the application logs don’t show spikes in memory or cpu. We will begin taking a closer look at the dmesg, for hints or oom issues.
Is it one manager or more than one?! I feel like you do not share all details.
I pretty much loose interesst if the level of details is insufficient to get a fair chance to think thru the situation. Some brilliant minds try to run a swarm cluster with nodes at different locations - which would pretty much explain the situation. Others simply overprovision their cluster nodes, because they didn’t understand why it’s imperative to set resources reservervations and limits for cpus an memory.
before you ask again. this literally is the dmesg command.
Good luck with your troubleshooting. I will leave this one to others. I am not stasified with the level of provided details.
In any case we are moving to turn debug mode on and see if can narrow down the issue further. Unfortunately the dmesg entries do not show a driver error at the same time our docker daemon rebuilt the entire stack of applications.
May I suggest to install Prometheus, Grafana and a decent log management like Loki or ELK to your environment. What about the output from the command dmesg?
Running containers in a professional setting does not make sense without proper system monitoring and log management.
[Wed Sep 23 03:07:35 2020] IPVS: Creating netns size=2200 id=8176
[Wed Sep 23 03:07:35 2020] br0: port 8(veth5359) entered forwarding state
[Wed Sep 23 03:07:36 2020] br0: port 2(veth0) entered forwarding state
[Wed Sep 23 03:07:36 2020] br0: port 20(veth5370) entered forwarding state
[Wed Sep 23 03:07:36 2020] br0: port 3(veth1) entered forwarding state
[Wed Sep 23 03:07:36 2020] docker_gwbridge: port 15(veth8c355a4) entered forwarding state
[Wed Sep 23 03:07:36 2020] docker_gwbridge: port 26(vethd0cb96e) entered forwarding state
[Wed Sep 23 03:07:36 2020] docker_gwbridge: port 12(vethd51d233) entered forwarding state
[Wed Sep 23 03:07:36 2020] docker_gwbridge: port 27(vethea45d5e) entered forwarding state
But they don’t correlate directly with the time stamps where our docker host rebuilt all of the stacks.
So far the only theory we have is that VMWARE is creating issues with our docker host and they are marked to be recreated. We have deployed 12 different apps (node, Java Tomcat) and mongo_db containers and they are all stopped and recreated without a definitive reason.
We have also added memory limits on each stack deployment and have also increased the Ram and CPU on the virtual host.
We will include the grafana to continue to search for clues.