Docker is shutting down all my stacks frequently

I have several stacks running on my CentOS server (although I’m doing everything with Swarm, it’s deployment on a single machine, it’s not a cluster). But I’m noticing that some days all the containers go down (as you can see in the image) and come back up, which is causing me problems with some processes being interrupted without apparent cause.

Considerations:

  • These containers are different images with different behaviors. So it is not an error inside the image.
  • None of the containers have error logs.
  • I note that some containers I have raised with Docker Compose (docker compose up -d) did not shutdown, so I deduce that it is just a Docker Swarm problem.
  • Here are the logs I got with the command journalctl -u docker.service:

jun 19 03:52:03 myuser dockerd[27439]: time=“2023-06-19T03:52:02.867842622-03:00” level=error msg=“heartbeat to manager { } failed” error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded” method=“(*session).heartbeat” module=node/agent node.id=mftlnmohlt2asl8f982id3vyv session.id=notogbay6a58h88xhkt6ddvl8 sessionID=notogbay6a58h88xhkt6ddvl8
jun 19 03:52:14 myuser dockerd[27439]: time=“2023-06-19T03:52:14.307251296-03:00” level=error msg=“agent: session failed” backoff=100ms error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded” module=node/agent node.id=mftlnmohlt2asl8f982id3vyv
jun 19 03:52:14 myuser dockerd[27439]: time=“2023-06-19T03:52:14.319406267-03:00” level=info msg=“manager selected by agent for new session: { }” module=node/agent node.id=mftlnmohlt2asl8f982id3vyv
jun 19 03:52:14 myuser dockerd[27439]: time=“2023-06-19T03:52:14.339781506-03:00” level=info msg=“waiting 56.40044ms before registering session” module=node/agent node.id=mftlnmohlt2asl8f982id3vyv
jun 19 03:52:16 myuser dockerd[27439]: time=“2023-06-19T03:52:16.376226102-03:00” level=info msg=“worker mftlnmohlt2asl8f982id3vyv was successfully registered” method=“(*Dispatcher).register”
jun 19 03:52:33 myuser dockerd[27439]: time=“2023-06-19T03:52:32.165209518-03:00” level=error msg=“heartbeat to manager { } failed” error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded” method=“(*session).heartbeat” module=node/agent node.id=mftlnmohlt2asl8f982id3vyv session.id=q7zz8xladr450zu54jnef8pfv sessionID=q7zz8xladr450zu54jnef8pfv
jun 19 03:52:33 myuser dockerd[27439]: time=“2023-06-19T03:52:32.622792956-03:00” level=error msg=“agent: session failed” backoff=100ms error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded” module=node/agent node.id=mftlnmohlt2asl8f982id3vyv
jun 19 03:52:39 myuser dockerd[27439]: time=“2023-06-19T03:52:37.141884729-03:00” level=info msg=“manager selected by agent for new session: { }” module=node/agent node.id=mftlnmohlt2asl8f982id3vyv
jun 19 03:53:08 myuser dockerd[27439]: time=“2023-06-19T03:52:50.646468486-03:00” level=info msg=“waiting 91.824135ms before registering session” module=node/agent node.id=mftlnmohlt2asl8f982id3vyv
jun 19 03:53:25 myuser dockerd[27439]: time=“2023-06-19T03:53:13.968378401-03:00” level=error msg=“failed deregistering node after heartbeat expiration” error=“node mftlnmohlt2asl8f982id3vyv is not found in local storage”
jun 19 03:53:29 myuser dockerd[27439]: time=“2023-06-19T03:53:26.702734867-03:00” level=error msg=“agent: session failed” backoff=300ms error=“session initiation timed out” module=node/agent node.id=mftlnmohlt2asl8f982id3vyv
jun 19 03:53:52 myuser dockerd[27439]: time=“2023-06-19T03:53:44.234405701-03:00” level=info msg=“manager selected by agent for new session: { }” module=node/agent node.id=mftlnmohlt2asl8f982id3vyv
jun 19 03:53:53 myuser dockerd[27439]: time=“2023-06-19T03:53:52.716848728-03:00” level=info msg=“waiting 125.136975ms before registering session” module=node/agent node.id=mftlnmohlt2asl8f982id3vyv
jun 19 03:53:56 myuser dockerd[27439]: time=“2023-06-19T03:53:55.505274123-03:00” level=error msg=“Attempting to transfer leadership” raft_id=7ed6a6a73ad485b0
jun 19 03:53:59 myuser dockerd[27439]: time=“2023-06-19T03:53:59.104766225-03:00” level=error msg=“agent: session failed” backoff=700ms error=“session initiation timed out” module=node/agent node.id=mftlnmohlt2asl8f982id3vyv
jun 19 03:53:59 myuser dockerd[27439]: time=“2023-06-19T03:53:59.105003983-03:00” level=info msg=“manager selected by agent for new session: { }” module=node/agent node.id=mftlnmohlt2asl8f982id3vyv
jun 19 03:53:59 myuser dockerd[27439]: time=“2023-06-19T03:53:59.105058043-03:00” level=info msg=“waiting 151.692502ms before registering session” module=node/agent node.id=mftlnmohlt2asl8f982id3vyv

If someone could shed some light on the issue I would appreciate it.

My original guess was that the swarm node does not have enough memory or cpu resources. Then I found this which says the same when I searched for "CVE-2016-6595 · Issue #25629 · moby/moby · GitHub

“failed deregistering node after heartbeat expiration”

It was the first result so this might not be the same issue, so you could try to search for more results. I guess Compose is not affected because only swarm would deregister a node since Docker compose doesn’t have this concept, but in case of swarm, services could go to another node. I am not using swarm, so these are just guesses.

Thanks! I’ll take a look!