Hi,
Currently, my Docker Swarm of size one (a single manager node) is failing randomly (between zero to two times a day), causing a reboot of all three containers. It seems that the dockerd (daemon) keeps running, but all container processes are restarted. The syslog shows an rpc error when sending a heartbeat to the manager (probably raft related). I have checked the firewalls and all connectivity is OK. I also tried rebooting and upgrading Docker (currently using 18.09.5). In the syslog, I found the following events around the time of rebooting containers:
Apr 23 04:43:36 moji dockerd[1028]: time="2019-04-23T04:43:36.998060034+02:00" level=error msg="heartbeat to manager { } failed" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" method="(*session).heartbeat" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3 session.id=ibdf9pf8d7kqpj4sde8xe65y4 sessionID=ibdf9pf8d7kqpj4sde8xe65y4
Apr 23 04:43:40 moji dockerd[1028]: time="2019-04-23T04:43:39.656046295+02:00" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:43:40 moji dockerd[1028]: time="2019-04-23T04:43:39.656114767+02:00" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:43:40 moji dockerd[1028]: time="2019-04-23T04:43:39.656148174+02:00" level=info msg="waiting 84.528164ms before registering session" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:46:35 moji dockerd[1028]: time="2019-04-23T04:46:35.142610538+02:00" level=info msg="NetworkDB stats moji(c886990ac975) - netID:vyjqxsku2usz1s5afv1d8x0t4 leaving:false netPeers:1 entries:6 Queue qLen:0 netMsg/s:0"
Apr 23 04:47:03 moji dockerd[1028]: time="2019-04-23T04:46:35.143125228+02:00" level=info msg="NetworkDB stats moji(c886990ac975) - netID:t33xqei65zzgmgnmx7ihkdlyc leaving:false netPeers:1 entries:7 Queue qLen:0 netMsg/s:0"
Apr 23 04:50:36 moji dockerd[1028]: time="2019-04-23T04:50:36.806407155+02:00" level=error msg="heartbeat to manager { } failed" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" method="(*session).heartbeat" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3 session.id=ibdf9pf8d7kqpj4sde8xe65y4 sessionID=ibdf9pf8d7kqpj4sde8xe65y4
Apr 23 04:50:36 moji dockerd[1028]: time="2019-04-23T04:50:36.806484088+02:00" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:36 moji dockerd[1028]: time="2019-04-23T04:50:36.806520497+02:00" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:36 moji dockerd[1028]: time="2019-04-23T04:50:36.806562415+02:00" level=info msg="waiting 2.771051ms before registering session" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:41.809465638+02:00" level=error msg="agent: session failed" backoff=300ms error="session initiation timed out" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:41.809522069+02:00" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:41.809547040+02:00" level=info msg="waiting 170.952142ms before registering session" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:46.980786802+02:00" level=error msg="agent: session failed" backoff=700ms error="session initiation timed out" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:46.980847514+02:00" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:46.980870905+02:00" level=info msg="waiting 590.898832ms before registering session" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: sync duration of 3.183702675s, expected less than 1s
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.507172566+02:00" level=info msg="worker 9xweikfkwzpb63mx5piecnyh3 was successfully registered" method="(*Dispatcher).register"
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.508033816+02:00" level=info msg="worker 9xweikfkwzpb63mx5piecnyh3 was successfully registered" method="(*Dispatcher).register"
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.508117640+02:00" level=info msg="worker 9xweikfkwzpb63mx5piecnyh3 was successfully registered" method="(*Dispatcher).register"
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.509039632+02:00" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = InvalidArgument desc = session invalid" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.509071441+02:00" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.509090904+02:00" level=info msg="waiting 62.31046ms before registering session" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.509127589+02:00" level=error msg="heartbeat to manager { } failed" error="rpc error: code = Canceled desc = context canceled" method="(*session).heartbeat" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3 session.id=aj3b9l26xygg3pg2lhr1h39ev sessionID=aj3b9l26xygg3pg2lhr1h39ev
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.509156150+02:00" level=error msg="closing session after fatal error" error="rpc error: code = InvalidArgument desc = session invalid" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.509169879+02:00" level=error msg="status reporter failed to report status to agent" error="rpc error: code = InvalidArgument desc = session invalid" module=node/agent node.id=9xweikfkwzpb63mx5piecnyh3
Apr 23 04:50:49 moji dockerd[1028]: time="2019-04-23T04:50:49.618844113+02:00" level=info msg="worker 9xweikfkwzpb63mx5piecnyh3 was successfully registered" method="(*Dispatcher).register"
BTW, I have an 8-node swarm with a similar (or identical) setup running without problems for months, so I am kinda lost here.
Thanks for your reply!
Kind regards,
Jarno