Hi All,
We have a 9 node docker cluster running Docker version CE on Ubuntu 20.04.6 LTS. We have an issue where by when we take snapshots or back ups of the server. The server get’s stunned and exceeds the heartbeat time out to the manager nodes. This results in service restarts and in some cases node restarts or crashes. We have tried increasing the heartbeat timeout but that didn’t help that much as we still experience these restarts or crashes in some cases. Below is an example of the logs we see on the server when an event takes place.
Error:
Sep 28 04:08:46 server1 dockerd[7780]: time="2024-09-28T04:08:46.328807090+02:00" level=warning msg="memberlist: Refuting a suspect message (from: db2
60e00d280)"
Sep 28 04:08:51 server1 dockerd[7780]: time="2024-09-28T04:08:51.146332472+02:00" level=error msg="heartbeat to manager { } failed" error="rpc error:
code = NotFound desc = node not registered" method="(*session).heartbeat" module=node/agent node.id=lnt24pajf0wi4sxx30ozjufp6 session.id=qt4r86c8e0laeg6m
9fcotpeun sessionID=qt4r86c8e0laeg6m9fcotpeun
Sep 28 04:08:51 server1 dockerd[7780]: time="2024-09-28T04:08:51.146432360+02:00" level=error msg="agent: session failed" backoff=100ms error="node no
t registered" module=node/agent node.id=lnt24pajf0wi4sxx30ozjufp6
Sep 28 04:08:51 server1 dockerd[7780]: time="2024-09-28T04:08:51.146476793+02:00" level=info msg="manager selected by agent for new session: { }" modu
le=node/agent node.id=lnt24pajf0wi4sxx30ozjufp6
Sep 28 04:08:51 server1 dockerd[7780]: time="2024-09-28T04:08:51.146518258+02:00" level=info msg="waiting 81.73657ms before registering session" modul
e=node/agent node.id=lnt24pajf0wi4sxx30ozjufp6
Sep 28 04:08:56 server1 dockerd[7780]: time="2024-09-28T04:08:56.228575059+02:00" level=error msg="agent: session failed" backoff=300ms error="session
initiation timed out" module=node/agent node.id=lnt24pajf0wi4sxx30ozjufp6
Sep 28 04:08:56 server1 dockerd[7780]: time="2024-09-28T04:08:56.228653052+02:00" level=info msg="manager selected by agent for new session: { }" modu
le=node/agent node.id=lnt24pajf0wi4sxx30ozjufp6
Sep 28 04:08:56 server1 dockerd[7780]: time="2024-09-28T04:08:56.228683727+02:00" level=info msg="waiting 189.69744ms before registering session" modu
le=node/agent node.id=lnt24pajf0wi4sxx30ozjufp6
Sep 28 04:08:58 server1 dockerd[7780]: time="2024-09-28T04:08:58.260326989+02:00" level=warning msg="7d3c2a689a61a2b6 stepped down to follower since q
uorum is not active" module=raft node.id=lnt24pajf0wi4sxx30ozjufp6
Sep 28 04:08:58 server1 dockerd[7780]: time="2024-09-28T04:08:58.260388178+02:00" level=info msg="7d3c2a689a61a2b6 became follower at term 266159" mod
ule=raft node.id=lnt24pajf0wi4sxx30ozjufp6
Sep 28 04:08:58 server1 dockerd[7780]: time="2024-09-28T04:08:58.260400991+02:00" level=info msg="raft.node: 7d3c2a689a61a2b6 lost leader 7d3c2a689a61
a2b6 at term 266159" module=raft node.id=lnt24pajf0wi4sxx30ozjufp6
Sep 28 04:08:58 server1 dockerd[7780]: time="2024-09-28T04:08:58.260467630+02:00" level=error msg="soft state changed, node no longer a leader, resett
ing and cancelling all waits" raft_id=7d3c2a689a61a2b6
Sep 28 04:08:58 server1 dockerd[7780]: time="2024-09-28T04:08:58.260523611+02:00" level=info msg="dispatcher stopping" method="(*Dispatcher).Stop" mod
ule=dispatcher node.id=lnt24pajf0wi4sxx30ozjufp6
Sep 28 04:08:58 server1 dockerd[7780]: time="2024-09-28T04:08:58.260581061+02:00" level=info msg="worker lnt24pajf0wi4sxx30ozjufp6 was successfully re
gistered" method="(*Dispatcher).register"
Sep 28 04:08:58 server1 dockerd[7780]: time="2024-09-28T04:08:58.260680802+02:00" level=info msg="dispatcher session dropped, marking node tvjqjkqywh2
gn41uvhw1mrft7 down" forwarder.id=n1gp98qhx56anqdaue7xvt7p7 method="(*Dispatcher).Session" node.id=tvjqjkqywh2gn41uvhw1mrft7 node.session=ok7q3a61rhtb56b
vczzq3g5x2
Sep 28 04:08:58 server1 dockerd[7780]: time="2024-09-28T04:08:58.260734463+02:00" level=error msg="failed to remove node" error="rpc error: code = Abo
rted desc = dispatcher is stopped" forwarder.id=n1gp98qhx56anqdaue7xvt7p7 method="(*Dispatcher).Session" node.id=tvjqjkqywh2gn41uvhw1mrft7 node.session=o
k7q3a61rhtb56bvczzq3g5x2
I was hoping to find out if anyone has faced any similar issue/case and what’s been a work around?
your help would be highly appreciated.