Docker swarm rebuilds all containers at different times

Hello,

We are trying to figure out why docker swarm is suddenly rebuilds

We see this entry in the log

Sep 17 02:10:00 rincewind dockerd[1328]: time=“2020-09-17T02:10:00.095327640-04:00” level=error msg=“heartbeat to manager { } failed” error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded” method="(*session).heartbeat" module=node/agent node.id=rvq7xnny86cq39caskqswtfnl session.id=x2sca7wk6j1gr5ui1295ope19 sessionID=x2sca7wk6j1gr5ui1295ope19

But can’t figure out why exactly the stack rebuild was started

docker version
Client: Docker Engine - Community
Version: 19.03.4
API version: 1.40
Go version: go1.12.10
Git commit: 9013bf583a
Built: Fri Oct 18 15:53:51 2019
OS/Arch: linux/amd64
Experimental: false

Server: Docker Engine - Community
Engine:
Version: 19.03.4
API version: 1.40 (minimum version 1.12)
Go version: go1.12.10
Git commit: 9013bf583a
Built: Fri Oct 18 15:52:23 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.10
GitCommit: b34a5c8af56e510852c35414db4c1f4fa6172339
runc:
Version: 1.0.0-rc8+dev
GitCommit: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
docker-init:
Version: 0.18.0
GitCommit: fec3683

and

means exactly what?

Thats a failed heartbeat of the consensus algorithm underneath. Might be nothing. Might be a problem with network latency and/or reliability. The RAFT concensus requires a low latency, low jitter network for proper operation.

Metin,

Thanks for the response. We currently have a docker swarm with 12 stacks deployed, and on several occasions (at different times) we noticed that the stack (and all the containers were shutdown and rebuilt) causing service disruptions but we don’t know why

The only reference we have is to look through the syslog entries but we can’t find a definitive answer. We have increased the heartbeat from 5s to 20s (30s) and now 1m in order to prevent the stack to shutdown the containers

We have seen a spike in CPU usage when the stack is rebuilt but we don’t know what is the cause

Network bandwitdh, memory

We deployed docker swarm onto 1 server and had no issues for well over a year, and now we experience this issue every two days. (it seems)

So your solution to missing heartbeats is to make them appear less frequently? Brilliant! :wink:

Hearbeat concerns custer membership, though it does not influence the timeframe which is required to reach consensus amongst the manager nodes for changes. Though, since you seem to have only one node. This shouldn’t be the problem.

Ah, with rebuild you mean redeploy. Though, the stack itself shoudn’t be redeployed. If the number of tasks of a service matches the number of desired replicas, there shouldn’t be any redployment. Depending on your restart policy, the death of a container created by a task won’t do anything or deploy a new task to statisfy the number of desired replicas. Are you sure your containers are not oom killed (see: dmesg)? Of course deployments put stress on the ressource - do you expect that they don’t? Bootstrapping applications usualy is not a cheap task…

If you have a single server, consensus can’t be the problem. Neither can be the network.

I agree with your assessment, we are trying to figure out why on a server with little stress all of a sudden all the docker managers has to rebuild all of the containers at the same time, the application logs don’t show spikes in memory or cpu. We will begin taking a closer look at the dmesg, for hints or oom issues.

is there a log where OOM killed containers (or services) be reported?

Is it one manager or more than one?! I feel like you do not share all details.
I pretty much loose interesst if the level of details is insufficient to get a fair chance to think thru the situation. Some brilliant minds try to run a swarm cluster with nodes at different locations - which would pretty much explain the situation. Others simply overprovision their cluster nodes, because they didn’t understand why it’s imperative to set resources reservervations and limits for cpus an memory.

before you ask again. this literally is the dmesg command.

Good luck with your troubleshooting. I will leave this one to others. I am not stasified with the level of provided details.

Metin,
Thanks for your help

I found this link to a similar issues we are facing

https://forums.docker.com/t/containers-rebooting-because-heartbeat-to-manager-failed/73590

In any case we are moving to turn debug mode on and see if can narrow down the issue further. Unfortunately the dmesg entries do not show a driver error at the same time our docker daemon rebuilt the entire stack of applications.

I’ve found this article that recommends the docker swarm heartbeat increased when running a swarm manager in a VMWare (with VMotion).

https://wynandbooysen.com/posts/2019-03-28-docker-swarm-heartbeat-timeout/

Is there a reason why this recommendation is made?

Taking a closer look at a container being killed they are exited with code 143 (OOM Killed false)

The syslog does not show exactly why all of the containers (set of stacks) were started at the same time

The syslog does show that the heartbeat remained at 5 seconds even though we specified in docker swarm update to be 1 minute

Sep 23 07:06:17 rincewind dockerd[1331]: time=“2020-09-23T07:06:17.815447239-04:00” level=debug msg=“sending heartbeat to manager { } with timeout 5s” method="(*session).heartbeat" module=node/agent node.id=rvq7xnny86cq39caskqswtfnl session.id=2n81is6v4ngyorh2pou036p9i sessionID=2n81is6v4ngyorh2pou036p9i

docker swarm info
Client:
Debug Mode: false

Server:
Containers: 97
Running: 48
Paused: 0
Stopped: 49
Images: 119
Server Version: 19.03.4
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: rvq7xnny86cq39caskqswtfnl
Is Manager: true
ClusterID: mjypevygkzjasfp59oaut6wxp
Managers: 1
Nodes: 1
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Data Path Port: 4789
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: About a minute

May I suggest to install Prometheus, Grafana and a decent log management like Loki or ELK to your environment. What about the output from the command dmesg?

Running containers in a professional setting does not make sense without proper system monitoring and log management.

We do see entries like this in dmest -T

[Wed Sep 23 03:07:35 2020] IPVS: Creating netns size=2200 id=8176
[Wed Sep 23 03:07:35 2020] br0: port 8(veth5359) entered forwarding state
[Wed Sep 23 03:07:36 2020] br0: port 2(veth0) entered forwarding state
[Wed Sep 23 03:07:36 2020] br0: port 20(veth5370) entered forwarding state
[Wed Sep 23 03:07:36 2020] br0: port 3(veth1) entered forwarding state
[Wed Sep 23 03:07:36 2020] docker_gwbridge: port 15(veth8c355a4) entered forwarding state
[Wed Sep 23 03:07:36 2020] docker_gwbridge: port 26(vethd0cb96e) entered forwarding state
[Wed Sep 23 03:07:36 2020] docker_gwbridge: port 12(vethd51d233) entered forwarding state
[Wed Sep 23 03:07:36 2020] docker_gwbridge: port 27(vethea45d5e) entered forwarding state

But they don’t correlate directly with the time stamps where our docker host rebuilt all of the stacks.

So far the only theory we have is that VMWARE is creating issues with our docker host and they are marked to be recreated. We have deployed 12 different apps (node, Java Tomcat) and mongo_db containers and they are all stopped and recreated without a definitive reason.

We have also added memory limits on each stack deployment and have also increased the Ram and CPU on the virtual host.

We will include the grafana to continue to search for clues.