Docker swarm rebuilds all containers at different times

ajozz13 · September 17, 2020, 5:18pm

Hello,

We are trying to figure out why docker swarm is suddenly rebuilds

We see this entry in the log

Sep 17 02:10:00 rincewind dockerd[1328]: time=“2020-09-17T02:10:00.095327640-04:00” level=error msg=“heartbeat to manager { } failed” error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded” method="(*session).heartbeat" module=node/agent node.id=rvq7xnny86cq39caskqswtfnl session.id=x2sca7wk6j1gr5ui1295ope19 sessionID=x2sca7wk6j1gr5ui1295ope19

But can’t figure out why exactly the stack rebuild was started

docker version
Client: Docker Engine - Community
Version: 19.03.4
API version: 1.40
Go version: go1.12.10
Git commit: 9013bf583a
Built: Fri Oct 18 15:53:51 2019
OS/Arch: linux/amd64
Experimental: false

Server: Docker Engine - Community
Engine:
Version: 19.03.4
API version: 1.40 (minimum version 1.12)
Go version: go1.12.10
Git commit: 9013bf583a
Built: Fri Oct 18 15:52:23 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.10
GitCommit: b34a5c8af56e510852c35414db4c1f4fa6172339
runc:
Version: 1.0.0-rc8+dev
GitCommit: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
docker-init:
Version: 0.18.0
GitCommit: fec3683

meyay · September 17, 2020, 9:34pm

and

means exactly what?

Thats a failed heartbeat of the consensus algorithm underneath. Might be nothing. Might be a problem with network latency and/or reliability. The RAFT concensus requires a low latency, low jitter network for proper operation.

ajozz13 · September 17, 2020, 9:44pm

Metin,

Thanks for the response. We currently have a docker swarm with 12 stacks deployed, and on several occasions (at different times) we noticed that the stack (and all the containers were shutdown and rebuilt) causing service disruptions but we don’t know why

The only reference we have is to look through the syslog entries but we can’t find a definitive answer. We have increased the heartbeat from 5s to 20s (30s) and now 1m in order to prevent the stack to shutdown the containers

We have seen a spike in CPU usage when the stack is rebuilt but we don’t know what is the cause

Network bandwitdh, memory

We deployed docker swarm onto 1 server and had no issues for well over a year, and now we experience this issue every two days. (it seems)

meyay · September 17, 2020, 10:01pm

So your solution to missing heartbeats is to make them appear less frequently? Brilliant!

Hearbeat concerns custer membership, though it does not influence the timeframe which is required to reach consensus amongst the manager nodes for changes. Though, since you seem to have only one node. This shouldn’t be the problem.

Ah, with rebuild you mean redeploy. Though, the stack itself shoudn’t be redeployed. If the number of tasks of a service matches the number of desired replicas, there shouldn’t be any redployment. Depending on your restart policy, the death of a container created by a task won’t do anything or deploy a new task to statisfy the number of desired replicas. Are you sure your containers are not oom killed (see: dmesg)? Of course deployments put stress on the ressource - do you expect that they don’t? Bootstrapping applications usualy is not a cheap task…

If you have a single server, consensus can’t be the problem. Neither can be the network.

ajozz13 · September 18, 2020, 2:04am

I agree with your assessment, we are trying to figure out why on a server with little stress all of a sudden all the docker managers has to rebuild all of the containers at the same time, the application logs don’t show spikes in memory or cpu. We will begin taking a closer look at the dmesg, for hints or oom issues.

ajozz13 · September 18, 2020, 2:19am

is there a log where OOM killed containers (or services) be reported?

meyay · September 18, 2020, 5:53am

Is it one manager or more than one?! I feel like you do not share all details.
I pretty much loose interesst if the level of details is insufficient to get a fair chance to think thru the situation. Some brilliant minds try to run a swarm cluster with nodes at different locations - which would pretty much explain the situation. Others simply overprovision their cluster nodes, because they didn’t understand why it’s imperative to set resources reservervations and limits for cpus an memory.

before you ask again. this literally is the dmesg command.

Good luck with your troubleshooting. I will leave this one to others. I am not stasified with the level of provided details.

ajozz13 · September 18, 2020, 3:08pm

Metin,
Thanks for your help

I found this link to a similar issues we are facing

https://forums.docker.com/t/containers-rebooting-because-heartbeat-to-manager-failed/73590

In any case we are moving to turn debug mode on and see if can narrow down the issue further. Unfortunately the dmesg entries do not show a driver error at the same time our docker daemon rebuilt the entire stack of applications.

ajozz13 · September 21, 2020, 3:14pm

I’ve found this article that recommends the docker swarm heartbeat increased when running a swarm manager in a VMWare (with VMotion).

https://wynandbooysen.com/posts/2019-03-28-docker-swarm-heartbeat-timeout/

Is there a reason why this recommendation is made?

ajozz13 · September 23, 2020, 3:06pm

Taking a closer look at a container being killed they are exited with code 143 (OOM Killed false)

The syslog does not show exactly why all of the containers (set of stacks) were started at the same time

The syslog does show that the heartbeat remained at 5 seconds even though we specified in docker swarm update to be 1 minute

Sep 23 07:06:17 rincewind dockerd[1331]: time=“2020-09-23T07:06:17.815447239-04:00” level=debug msg=“sending heartbeat to manager { } with timeout 5s” method="(*session).heartbeat" module=node/agent node.id=rvq7xnny86cq39caskqswtfnl session.id=2n81is6v4ngyorh2pou036p9i sessionID=2n81is6v4ngyorh2pou036p9i

docker swarm info
Client:
Debug Mode: false

Server:
Containers: 97
Running: 48
Paused: 0
Stopped: 49
Images: 119
Server Version: 19.03.4
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: rvq7xnny86cq39caskqswtfnl
Is Manager: true
ClusterID: mjypevygkzjasfp59oaut6wxp
Managers: 1
Nodes: 1
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Data Path Port: 4789
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: About a minute

meyay · September 23, 2020, 4:46pm

May I suggest to install Prometheus, Grafana and a decent log management like Loki or ELK to your environment. What about the output from the command dmesg?

Running containers in a professional setting does not make sense without proper system monitoring and log management.

ajozz13 · September 23, 2020, 6:12pm

We do see entries like this in dmest -T

[Wed Sep 23 03:07:35 2020] IPVS: Creating netns size=2200 id=8176
[Wed Sep 23 03:07:35 2020] br0: port 8(veth5359) entered forwarding state
[Wed Sep 23 03:07:36 2020] br0: port 2(veth0) entered forwarding state
[Wed Sep 23 03:07:36 2020] br0: port 20(veth5370) entered forwarding state
[Wed Sep 23 03:07:36 2020] br0: port 3(veth1) entered forwarding state
[Wed Sep 23 03:07:36 2020] docker_gwbridge: port 15(veth8c355a4) entered forwarding state
[Wed Sep 23 03:07:36 2020] docker_gwbridge: port 26(vethd0cb96e) entered forwarding state
[Wed Sep 23 03:07:36 2020] docker_gwbridge: port 12(vethd51d233) entered forwarding state
[Wed Sep 23 03:07:36 2020] docker_gwbridge: port 27(vethea45d5e) entered forwarding state

But they don’t correlate directly with the time stamps where our docker host rebuilt all of the stacks.

So far the only theory we have is that VMWARE is creating issues with our docker host and they are marked to be recreated. We have deployed 12 different apps (node, Java Tomcat) and mongo_db containers and they are all stopped and recreated without a definitive reason.

We have also added memory limits on each stack deployment and have also increased the Ram and CPU on the virtual host.

We will include the grafana to continue to search for clues.

aguida79 · May 8, 2025, 11:23am

Hi @ajozz13,

Did you find the solution or the reason for your problem?

It is happening exactly the same to me, on currrent Docker Swarm version. I have 3 nodes, and without any reason (that I know), the containers of the stacks shutdown and restart in the same node, at random hours, generating service downtime.

bluepuma77 · May 8, 2025, 12:53pm

Did you check your Docker daemon logs? Maybe you can find something about network issues and losing cluster sync.

aguida79 · May 8, 2025, 1:45pm

Hi @bluepuma77 ,

Yes, I checked the Docker daemon logs, and saw similar messages to @ajozz13, and that was the reason because I asked on this post:

May 08 07:43:09 lnx-01-dev dockerd[1223]: time="2025-05-08T07:43:09.033674335Z" level=error msg="node: cf067ff83c7e is unknown to memberlist"
May 08 07:43:38 lnx-01-dev dockerd[1223]: time="2025-05-08T07:43:38.222975454Z" level=error msg="Bulk sync to node dd987604274a timed out"
May 08 09:23:44 lnx-01-dev dockerd[1223]: time="2025-05-08T09:23:44.538824469Z" level=error msg="error while reading from stream" error="rpc error: code = Canceled desc = context canceled"
May 08 11:23:58 lnx-01-dev dockerd[1223]: time="2025-05-08T11:23:58.595499758Z" level=error msg="error receiving response" error="rpc error: code = Canceled desc = context canceled"

I reviewed the monitoring of the VMs in the Hypervisor too (three VMs on the same hipervisor), and they are not showing nothing in particular for the time the containers shutdown and restart.

bluepuma77 · May 8, 2025, 1:54pm

The log lines are 2h apart. When was the container restart? Can you match it to dmesg log?

aguida79 · May 8, 2025, 2:32pm

For example, between yesterday and today I had three shutdowns and restarts for etcd stack (this screenshot is from one stack, but on the others stacks happened the same, for the containers that were on the same nodes at that moment):

They were at 05:22 UTC May 8 for node1, 01:53 UTC May 8 for node2, and 13:22 UTC May 7 for node3.

dmesg for node1 at that time: no messages from 11:14 PM to 11:42 AM UTC
dmesg for node2 at that time: no messages from 11:15 PM to 08:53 AM UTC
dmesg for node3 at that time: no messages from 12:50 PM to 01:39 PM UTC

journalctl for docker.service for node1 level=error at that time, nothing, but an hour after:

May 08 06:22:55 lnx-01-dev dockerd[1223]: time="2025-05-08T06:22:55.016046000Z" level=error msg="heartbeat to manager { } failed" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" method="(*session).heartbeat" module=node/agent node.id=ueejqwu16hzmf57iq2gp9lw2c session.id=ifxb0cd5j6teasphpgsoh8rm6 sessionID=ifxb0cd5j6teasphpgsoh8rm6
May 08 06:22:55 lnx-01-dev dockerd[1223]: time="2025-05-08T06:22:55.016192692Z" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" module=node/agent node.id=ueejqwu16hzmf57iq2gp9lw2c

journalctl for docker.service for node2 level=error at that time:

May 08 01:53:07 lnx-02-dev dockerd[1510]: time="2025-05-08T01:53:07.027488067Z" level=error msg="heartbeat to manager { } failed" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" method="(*session).heartbeat" module=node/agent node.id=7gxrsvo994919b57t6jkkvf4z session.id=q29gxv7jjx7gnoop2gxudrkz1 sessionID=q29gxv7jjx7gnoop2gxudrkz1
May 08 01:53:07 lnx-02-dev dockerd[1510]: time="2025-05-08T01:53:07.027555620Z" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" module=node/agent node.id=7gxrsvo994919b57t6jkkvf4z
May 08 01:53:07 lnx-02-dev dockerd[1510]: time="2025-05-08T01:53:07.262102771Z" level=error msg="node: cf067ff83c7e is unknown to memberlist"
May 08 01:53:37 lnx-02-dev dockerd[1510]: time="2025-05-08T01:53:37.132013569Z" level=error msg="Bulk sync to node cc5e93658ef2 timed out"

journalctl for docker.service for node3 level=error at that time, nothing.

Let me know if I can provide more information.

Thanks in advance for your help.

Topic		Replies	Views
Docker swarm periodically restarts all services General swarm	6	12888	August 13, 2023
All docker service in docker swarm suddenly restarted General docker , swarm	1	3441	September 26, 2022
Containers rebooting because "heartbeat to manager failed" Swarm	3	10317	July 3, 2019
Swarm failure - voting does not work General aws	4	2463	August 29, 2016
Docker kills all processes after 5 min and then restarts again automatically General docker , swarm	2	819	July 30, 2024

Docker swarm rebuilds all containers at different times

Related topics