Snapshots/Backups Causing service and node restarts

Hi All,

We have a 9 node docker cluster running Docker version CE on Ubuntu 20.04.6 LTS. We have an issue where by when we take snapshots or back ups of the server. The server get’s stunned and exceeds the heartbeat time out to the manager nodes. This results in service restarts and in some cases node restarts or crashes. We have tried increasing the heartbeat timeout but that didn’t help that much as we still experience these restarts or crashes in some cases. Below is an example of the logs we see on the server when an event takes place.

Error:

Sep 28 04:08:46 server1 dockerd[7780]: time="2024-09-28T04:08:46.328807090+02:00" level=warning msg="memberlist: Refuting a suspect message (from: db2

60e00d280)"

Sep 28 04:08:51 server1 dockerd[7780]: time="2024-09-28T04:08:51.146332472+02:00" level=error msg="heartbeat to manager { } failed" error="rpc error:

code = NotFound desc = node not registered" method="(*session).heartbeat" module=node/agent node.id=lnt24pajf0wi4sxx30ozjufp6 session.id=qt4r86c8e0laeg6m

9fcotpeun sessionID=qt4r86c8e0laeg6m9fcotpeun

Sep 28 04:08:51 server1 dockerd[7780]: time="2024-09-28T04:08:51.146432360+02:00" level=error msg="agent: session failed" backoff=100ms error="node no

t registered" module=node/agent node.id=lnt24pajf0wi4sxx30ozjufp6

Sep 28 04:08:51 server1 dockerd[7780]: time="2024-09-28T04:08:51.146476793+02:00" level=info msg="manager selected by agent for new session: { }" modu

le=node/agent node.id=lnt24pajf0wi4sxx30ozjufp6

Sep 28 04:08:51 server1 dockerd[7780]: time="2024-09-28T04:08:51.146518258+02:00" level=info msg="waiting 81.73657ms before registering session" modul

e=node/agent node.id=lnt24pajf0wi4sxx30ozjufp6

Sep 28 04:08:56 server1 dockerd[7780]: time="2024-09-28T04:08:56.228575059+02:00" level=error msg="agent: session failed" backoff=300ms error="session

initiation timed out" module=node/agent node.id=lnt24pajf0wi4sxx30ozjufp6

Sep 28 04:08:56 server1 dockerd[7780]: time="2024-09-28T04:08:56.228653052+02:00" level=info msg="manager selected by agent for new session: { }" modu

le=node/agent node.id=lnt24pajf0wi4sxx30ozjufp6

Sep 28 04:08:56 server1 dockerd[7780]: time="2024-09-28T04:08:56.228683727+02:00" level=info msg="waiting 189.69744ms before registering session" modu

le=node/agent node.id=lnt24pajf0wi4sxx30ozjufp6

Sep 28 04:08:58 server1 dockerd[7780]: time="2024-09-28T04:08:58.260326989+02:00" level=warning msg="7d3c2a689a61a2b6 stepped down to follower since q

uorum is not active" module=raft node.id=lnt24pajf0wi4sxx30ozjufp6

Sep 28 04:08:58 server1 dockerd[7780]: time="2024-09-28T04:08:58.260388178+02:00" level=info msg="7d3c2a689a61a2b6 became follower at term 266159" mod

ule=raft node.id=lnt24pajf0wi4sxx30ozjufp6

Sep 28 04:08:58 server1 dockerd[7780]: time="2024-09-28T04:08:58.260400991+02:00" level=info msg="raft.node: 7d3c2a689a61a2b6 lost leader 7d3c2a689a61

a2b6 at term 266159" module=raft node.id=lnt24pajf0wi4sxx30ozjufp6

Sep 28 04:08:58 server1 dockerd[7780]: time="2024-09-28T04:08:58.260467630+02:00" level=error msg="soft state changed, node no longer a leader, resett

ing and cancelling all waits" raft_id=7d3c2a689a61a2b6

Sep 28 04:08:58 server1 dockerd[7780]: time="2024-09-28T04:08:58.260523611+02:00" level=info msg="dispatcher stopping" method="(*Dispatcher).Stop" mod

ule=dispatcher node.id=lnt24pajf0wi4sxx30ozjufp6

Sep 28 04:08:58 server1 dockerd[7780]: time="2024-09-28T04:08:58.260581061+02:00" level=info msg="worker lnt24pajf0wi4sxx30ozjufp6 was successfully re

gistered" method="(*Dispatcher).register"

Sep 28 04:08:58 server1 dockerd[7780]: time="2024-09-28T04:08:58.260680802+02:00" level=info msg="dispatcher session dropped, marking node tvjqjkqywh2

gn41uvhw1mrft7 down" forwarder.id=n1gp98qhx56anqdaue7xvt7p7 method="(*Dispatcher).Session" node.id=tvjqjkqywh2gn41uvhw1mrft7 node.session=ok7q3a61rhtb56b

vczzq3g5x2

Sep 28 04:08:58 server1 dockerd[7780]: time="2024-09-28T04:08:58.260734463+02:00" level=error msg="failed to remove node" error="rpc error: code = Abo

rted desc = dispatcher is stopped" forwarder.id=n1gp98qhx56anqdaue7xvt7p7 method="(*Dispatcher).Session" node.id=tvjqjkqywh2gn41uvhw1mrft7 node.session=o

k7q3a61rhtb56bvczzq3g5x2

I was hoping to find out if anyone has faced any similar issue/case and what’s been a work around?

your help would be highly appreciated.

I have no experience with this situation. Though, if your snapshot or backup solution uses something like fsfreeze it might explain the behavior.

Before you ask how to work around it: I have no idea.

I would guess it depends on the filesystem and backup tools you use.

We use VMware snapshots and back ups.