Containers rebooting because "heartbeat to manager failed"

Hi,

Currently, my Docker Swarm of size one (a single manager node) is failing randomly (between zero to two times a day), causing a reboot of all three containers. It seems that the dockerd (daemon) keeps running, but all container processes are restarted. The syslog shows an rpc error when sending a heartbeat to the manager (probably raft related). I have checked the firewalls and all connectivity is OK. I also tried rebooting and upgrading Docker (currently using 18.09.5). In the syslog, I found the following events around the time of rebooting containers:

Apr 23 04:43:36 moji dockerd[1028]: time="2019-04-23T04:43:36.998060034+02:00" level=error msg="heartbeat to manager { } failed" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" method="(*session).heartbeat" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3 session.id=ibdf9pf8d7kqpj4sde8xe65y4 sessionID=ibdf9pf8d7kqpj4sde8xe65y4
Apr 23 04:43:40 moji dockerd[1028]: time="2019-04-23T04:43:39.656046295+02:00" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:43:40 moji dockerd[1028]: time="2019-04-23T04:43:39.656114767+02:00" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:43:40 moji dockerd[1028]: time="2019-04-23T04:43:39.656148174+02:00" level=info msg="waiting 84.528164ms before registering session" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:46:35 moji dockerd[1028]: time="2019-04-23T04:46:35.142610538+02:00" level=info msg="NetworkDB stats moji(c886990ac975) - netID:vyjqxsku2usz1s5afv1d8x0t4 leaving:false netPeers:1 entries:6 Queue qLen:0 netMsg/s:0"
Apr 23 04:47:03 moji dockerd[1028]: time="2019-04-23T04:46:35.143125228+02:00" level=info msg="NetworkDB stats moji(c886990ac975) - netID:t33xqei65zzgmgnmx7ihkdlyc leaving:false netPeers:1 entries:7 Queue qLen:0 netMsg/s:0"
Apr 23 04:50:36 moji dockerd[1028]: time="2019-04-23T04:50:36.806407155+02:00" level=error msg="heartbeat to manager { } failed" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" method="(*session).heartbeat" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3 session.id=ibdf9pf8d7kqpj4sde8xe65y4 sessionID=ibdf9pf8d7kqpj4sde8xe65y4
Apr 23 04:50:36 moji dockerd[1028]: time="2019-04-23T04:50:36.806484088+02:00" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:36 moji dockerd[1028]: time="2019-04-23T04:50:36.806520497+02:00" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:36 moji dockerd[1028]: time="2019-04-23T04:50:36.806562415+02:00" level=info msg="waiting 2.771051ms before registering session" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:41.809465638+02:00" level=error msg="agent: session failed" backoff=300ms error="session initiation timed out" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:41.809522069+02:00" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:41.809547040+02:00" level=info msg="waiting 170.952142ms before registering session" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:46.980786802+02:00" level=error msg="agent: session failed" backoff=700ms error="session initiation timed out" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:46.980847514+02:00" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:46.980870905+02:00" level=info msg="waiting 590.898832ms before registering session" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: sync duration of 3.183702675s, expected less than 1s
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.507172566+02:00" level=info msg="worker 9xweikfkwzpb63mx5piecnyh3 was successfully registered" method="(*Dispatcher).register"
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.508033816+02:00" level=info msg="worker 9xweikfkwzpb63mx5piecnyh3 was successfully registered" method="(*Dispatcher).register"
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.508117640+02:00" level=info msg="worker 9xweikfkwzpb63mx5piecnyh3 was successfully registered" method="(*Dispatcher).register"
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.509039632+02:00" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = InvalidArgument desc = session invalid" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.509071441+02:00" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.509090904+02:00" level=info msg="waiting 62.31046ms before registering session" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.509127589+02:00" level=error msg="heartbeat to manager { } failed" error="rpc error: code = Canceled desc = context canceled" method="(*session).heartbeat" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3 session.id=aj3b9l26xygg3pg2lhr1h39ev sessionID=aj3b9l26xygg3pg2lhr1h39ev
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.509156150+02:00" level=error msg="closing session after fatal error" error="rpc error: code = InvalidArgument desc = session invalid" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.509169879+02:00" level=error msg="status reporter failed to report status to agent" error="rpc error: code = InvalidArgument desc = session invalid" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.618844113+02:00" level=info msg="worker 9xweikfkwzpb63mx5piecnyh3 was successfully registered" method="(*Dispatcher).register"

BTW, I have an 8-node swarm with a similar (or identical) setup running without problems for months, so I am kinda lost here.

Thanks for your reply!

Kind regards,
Jarno

1 Like

This is what it looks like with debug mode enabled. The heartbeat between lines 3 and 4 is not sent (or just too late) for unknown reason, it seems…

Apr 23 20:56:50 moji dockerd[1028]: time="2019-04-23T20:56:50.465729255+02:00" level=debug msg="sending heartbeat to manager { } with timeout 5s" method="(*session).heartbeat" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3 session.id=klp26vyhjkr93cx9v34krj33w sessionID=klp26vyhjkr93cx9v34krj33w
Apr 23 20:56:50 moji dockerd[1028]: time="2019-04-23T20:56:50.466309811+02:00" level=debug msg="received heartbeat from worker {[swarm-manager] rvhxsfdpr12l2uguoztrho5l3 9xweikfkwzpb63mx5piecnyh3 <nil> 10.133.97.64:2377}, expect next heartbeat in 4.968486182s" method="(*Dispatcher).Heartbeat"
Apr 23 20:56:50 moji dockerd[1028]: time="2019-04-23T20:56:50.466487732+02:00" level=debug msg="heartbeat successful to manager { }, next heartbeat period: 4.968486182s" method="(*session).heartbeat" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3 session.id=klp26vyhjkr93cx9v34krj33w sessionID=klp26vyhjkr93cx9v34krj33w
Apr 23 20:57:08 moji dockerd[1028]: time="2019-04-23T20:57:08.815815901+02:00" level=debug msg="heartbeat expiration for worker 9xweikfkwzpb63mx5piecnyh3, setting worker status to NodeStatus_DOWN "
Apr 23 20:57:08 moji dockerd[1028]: time="2019-04-23T20:57:08.815942078+02:00" level=debug msg="added node 9xweikfkwzpb63mx5piecnyh3 to down nodes list" method="(*Dispatcher).markNodeNotReady" module=dispatcher node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 20:57:08 moji dockerd[1028]: time="2019-04-23T20:57:08.815962023+02:00" level=debug msg="deleted node 9xweikfkwzpb63mx5piecnyh3 from node store" method="(*Dispatcher).markNodeNotReady" module=dispatcher node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 20:57:08 moji dockerd[1028]: time="2019-04-23T20:57:08.816036561+02:00" level=debug msg="sending heartbeat to manager { } with timeout 5s" method="(*session).heartbeat" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3 session.id=klp26vyhjkr93cx9v34krj33w sessionID=klp26vyhjkr93cx9v34krj33w
Apr 23 20:57:08 moji dockerd[1028]: time="2019-04-23T20:57:08.816294902+02:00" level=debug msg="received heartbeat from worker {[swarm-manager] rvhxsfdpr12l2uguoztrho5l3 9xweikfkwzpb63mx5piecnyh3 <nil> 10.133.97.64:2377}, expect next heartbeat in 0s" method="(*Dispatcher).Heartbeat"
Apr 23 20:57:08 moji dockerd[1028]: time="2019-04-23T20:57:08.816316856+02:00" level=debug msg="error handling rpc" error="rpc error: code = NotFound desc = node not registered" rpc=/docker.swarmkit.v1.Dispatcher/Heartbeat
Apr 23 20:57:08 moji dockerd[1028]: time="2019-04-23T20:57:08.816394605+02:00" level=error msg="heartbeat to manager { } failed" error="rpc error: code = NotFound desc = node not registered" method="(*session).heartbeat" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3 session.id=klp26vyhjkr93cx9v34krj33w sessionID=klp26vyhjkr93cx9v34krj33w
Apr 23 20:57:08 moji dockerd[1028]: time="2019-04-23T20:57:08.816422099+02:00" level=error msg="agent: session failed" backoff=100ms error="node not registered" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 20:57:08 moji dockerd[1028]: time="2019-04-23T20:57:08.816442543+02:00" level=debug msg="agent: rebuild session" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 20:57:08 moji dockerd[1028]: time="2019-04-23T20:57:08.816457084+02:00" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 20:57:08 moji dockerd[1028]: time="2019-04-23T20:57:08.816480758+02:00" level=info msg="waiting 11.350442ms before registering session" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 20:57:08 moji dockerd[1028]: time="2019-04-23T20:57:08.816574260+02:00" level=debug msg="error handling streaming rpc" error="context canceled" rpc=/docker.swarmkit.v1.Dispatcher/Assignments
Apr 23 20:57:08 moji dockerd[1028]: time="2019-04-23T20:57:08.816611839+02:00" level=debug msg="error handling streaming rpc" error="context canceled" rpc=/docker.swarmkit.v1.Dispatcher/Session
Apr 23 20:57:08 moji dockerd[1028]: time="2019-04-23T20:57:08.816643688+02:00" level=debug msg="error handling streaming rpc" error="context canceled" rpc=/docker.swarmkit.v1.LogBroker/ListenSubscriptions
Apr 23 20:57:08 moji dockerd[1028]: time="2019-04-23T20:57:08.832545014+02:00" level=debug msg="node status updated" method="(*Dispatcher).processUpdates" module=dispatcher node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 20:57:08 moji dockerd[1028]: time="2019-04-23T20:57:08.837151840+02:00" level=debug msg="(*session).start" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 20:57:09 moji dockerd[1028]: time="2019-04-23T20:57:09.033777314+02:00" level=info msg="worker 9xweikfkwzpb63mx5piecnyh3 was successfully registered" method="(*Dispatcher).register"

Hi, Have you tried this yet?


I am getting same heartbeat issue as well.
I have a 3 manager (different geolocation and server/network) and 1 worker situation.
One of my manager who resides in same network as the worker, plays the role of worker too. And that manager sometimes panics and starts leader election by its own (most probably for network reliability issue). I have set the following just now to stop it from panicking randomly and restarting swarm stack by itself. I’ll observe this for a week and will post update,
docker swarm update --dispatcher-heartbeat 120s

1 Like

Thank you for your reply, Faiaz.

The problem has been gone for about 2 months here. It stopped happening without any changes on my side, so I think it indeed had to do with network connectivity issues. Good to read there is an option to change the heartbeat timeout. If I encounter the problem again, I’ll try to tweak this setting. Would be nice if Docker Swarm becomes more resilient to (simple) network connectivity issues in time.