Docker Community Forums

Share and learn in the Docker community.

Docker Swarm connection errors while VMware Snapshots


(Grimm514) #1

Hi,

We have a Docker Swarm with three manager nodes and a few worker nodes running on RHEL 7 on VMware ESX VM’s.
According to some SLA’s, we have to backup all VM’s with ESX snapshots.
When ESX takes a snapshot of a swarm node, the node some kind of freezes for 1-3 seconds. Unfortunately, the Docker Swarm then restarts services, running on this node and if the node is on the same time is a manager and swarm leader, the swarm managers start a new leader election.

The Logs are full of this messages:
[…]
level=info msg="2017/02/28 02:02:43 [INFO] memberlist: Marking manager2-45b39f968a59 as failed, suspect timeout reached\n"
level=info msg="2017/02/28 20:06:49 [INFO] memberlist: Suspect node4-460911c3b833 has failed, no acks received\n"
level=info msg="2017/02/28 20:06:52 [INFO] memberlist: Suspect node4-460911c3b833 has failed, no acks received\n"
level=info msg="2017/02/28 20:09:19 [INFO] memberlist: Suspect manager1-7bb29d4fc616 has failed, no acks received\n"
level=info msg="2017/02/28 20:10:46 [INFO] memberlist: Suspect node3-72f61ad66c1e has failed, no acks received\n"
level=info msg="2017/02/28 20:10:47 [INFO] memberlist: Marking node3-72f61ad66c1e as failed, suspect timeout reached\n"
level=info msg="2017/02/28 20:12:42 [INFO] memberlist: Suspect manager1-7bb29d4fc616 has failed, no acks received\n"
level=info msg="2017/02/28 20:12:47 [INFO] memberlist: Marking manager1-7bb29d4fc616 as failed, suspect timeout reached\n"
level=error msg=“agent: session failed” error=“rpc error: code = 5 desc = node not registered” module=agent
level=warning msg="2017/02/28 20:15:49 [WARN] memberlist: Refuting a suspect message (from: manager0-886908d6bcc5)\n"
level=warning msg="2017/02/28 20:18:18 [WARN] memberlist: Refuting a suspect message (from: node3-72f61ad66c1e)\n"
level=error msg=“agent: session failed” error=“rpc error: code = 5 desc = node not registered” module=agent
level=warning msg=“2017/02/28 20:18:25 [WARN] memberlist: Refuting a suspect message (from: manager00-886908d6bcc5)\n”
[…]
level=info msg="6e77433fb955f74c is starting a new election at term 32"
level=info msg="6e77433fb955f74c became candidate at term 33"
level=info msg="6e77433fb955f74c received vote from 6e77433fb955f74c at term 33"
level=info msg="6e77433fb955f74c [logterm: 31, index: 617] sent vote request to 658b4650226314b7 at term 33"
level=info msg="6e77433fb955f74c [logterm: 31, index: 617] sent vote request to 4354d2dcd864722a at term 33"
level=info msg="6e77433fb955f74c received vote from 658b4650226314b7 at term 33"
level=info msg="6e77433fb955f74c [quorum:2] has received 2 votes and 0 vote rejections"
level=info msg=“6e77433fb955f74c became leader at term 33”
[…]

We played around with --dispatcher-heartbeeat but without success.

Does anybody have experience with Docker Swarm and VMware Snapshots?
Is there a way to prevent the services to restart, when a node freezes for 1-3 seconds?

Issue type: error - unintentionally behavior
OS Version/build: Red Hat Enterprise Linux 7.3, VMware ESXi 6.0.0
App version: Docker version 1.12.5, build 047e51b/1.12.5
Steps to reproduce: Build Docker Swarm with 3 Managers on VMware ESXi and take a snaphot of a VM

Any help is highly appreciated!

Thanks a lot!


(Simon Templer) #2

Did you ever find a solution or workaround for this?


(Arnydo) #3

A year and a half later and still no progress on this issue? :frowning:
I have been seeing the same thing and just now realized that is most likely the snapshots and/or migrating between hosts…