Docker Swarm connection errors while VMware Snapshots

Hi,

We have a Docker Swarm with three manager nodes and a few worker nodes running on RHEL 7 on VMware ESX VM’s.
According to some SLA’s, we have to backup all VM’s with ESX snapshots.
When ESX takes a snapshot of a swarm node, the node some kind of freezes for 1-3 seconds. Unfortunately, the Docker Swarm then restarts services, running on this node and if the node is on the same time is a manager and swarm leader, the swarm managers start a new leader election.

The Logs are full of this messages:
[…]
level=info msg="2017/02/28 02:02:43 [INFO] memberlist: Marking manager2-45b39f968a59 as failed, suspect timeout reached\n"
level=info msg="2017/02/28 20:06:49 [INFO] memberlist: Suspect node4-460911c3b833 has failed, no acks received\n"
level=info msg="2017/02/28 20:06:52 [INFO] memberlist: Suspect node4-460911c3b833 has failed, no acks received\n"
level=info msg="2017/02/28 20:09:19 [INFO] memberlist: Suspect manager1-7bb29d4fc616 has failed, no acks received\n"
level=info msg="2017/02/28 20:10:46 [INFO] memberlist: Suspect node3-72f61ad66c1e has failed, no acks received\n"
level=info msg="2017/02/28 20:10:47 [INFO] memberlist: Marking node3-72f61ad66c1e as failed, suspect timeout reached\n"
level=info msg="2017/02/28 20:12:42 [INFO] memberlist: Suspect manager1-7bb29d4fc616 has failed, no acks received\n"
level=info msg="2017/02/28 20:12:47 [INFO] memberlist: Marking manager1-7bb29d4fc616 as failed, suspect timeout reached\n"
level=error msg=“agent: session failed” error=“rpc error: code = 5 desc = node not registered” module=agent
level=warning msg="2017/02/28 20:15:49 [WARN] memberlist: Refuting a suspect message (from: manager0-886908d6bcc5)\n"
level=warning msg="2017/02/28 20:18:18 [WARN] memberlist: Refuting a suspect message (from: node3-72f61ad66c1e)\n"
level=error msg=“agent: session failed” error=“rpc error: code = 5 desc = node not registered” module=agent
level=warning msg=“2017/02/28 20:18:25 [WARN] memberlist: Refuting a suspect message (from: manager00-886908d6bcc5)\n”
[…]
level=info msg="6e77433fb955f74c is starting a new election at term 32"
level=info msg="6e77433fb955f74c became candidate at term 33"
level=info msg="6e77433fb955f74c received vote from 6e77433fb955f74c at term 33"
level=info msg="6e77433fb955f74c [logterm: 31, index: 617] sent vote request to 658b4650226314b7 at term 33"
level=info msg="6e77433fb955f74c [logterm: 31, index: 617] sent vote request to 4354d2dcd864722a at term 33"
level=info msg="6e77433fb955f74c received vote from 658b4650226314b7 at term 33"
level=info msg="6e77433fb955f74c [quorum:2] has received 2 votes and 0 vote rejections"
level=info msg=“6e77433fb955f74c became leader at term 33”
[…]

We played around with --dispatcher-heartbeeat but without success.

Does anybody have experience with Docker Swarm and VMware Snapshots?
Is there a way to prevent the services to restart, when a node freezes for 1-3 seconds?

Issue type: error - unintentionally behavior
OS Version/build: Red Hat Enterprise Linux 7.3, VMware ESXi 6.0.0
App version: Docker version 1.12.5, build 047e51b/1.12.5
Steps to reproduce: Build Docker Swarm with 3 Managers on VMware ESXi and take a snaphot of a VM

Any help is highly appreciated!

Thanks a lot!

Did you ever find a solution or workaround for this?

A year and a half later and still no progress on this issue? :frowning:
I have been seeing the same thing and just now realized that is most likely the snapshots and/or migrating between hosts…

Hey Guys,
I know a very long time ago.

Has anyone a hint or solution, how to fix this behaviour?

Thanks

Does it help to increase --dispatcher-heartbeat duration ?

docker swarm update --dispatcher-heartbeat=10s

No, I have tested several settings, currently it is set to 60s and no change.
Yesterday I was able to figure out why the services are restarting. The main problem was that when the node that is actually the leader restarts, all services on all nodes also restart.
Now I have reduced the number of cluster managers from 4 to 3. Then I removed the leader manually and as expected only the services of the removed node are restarted. And not all services on all nodes. For me, this has solved the problem.

This is the expected behavior in ESXi when creating a storage snapshot. During the snapshot process, the virtual machine goes through the Fast Suspend Resume (FSR) process and the guest operating system is unresponsive. See → VMware Knowledge Base

This is the normal and expected behavior for Docker Swarm because node is unresponsive.

If you have a schedule for the snapshots of the individual guest operating systems, you could possibly also plan the maintenance for each node.

Me suggestion would be a cronjob that demote and/or drain a node before taking the snapshot.
And then a cronjob that promotes and/or reactivates the node from which the snapshot was taken.

Further information from the documentation.
Drain a node on the swarm:Drain a node on the swarm | Docker Docs
Promote or demote a nodeManage nodes in a swarm | Docker Docs

For example, you could use two shell scripts like these:
before_maintenance.sh

#!/bin/sh

node_name="docker-swarm-node-name"

node_manager_id=$(docker node ls -f "name=$node_name" -f "role=manager" -q)
node_worker_id=$(docker node ls -f "name=$node_name" -f "role=worker" -q)

echo "manager id: $node_manager_id"
echo "worker id: $node_worker_id"

if [ -n "$node_manager_id" ]
then
   docker node demote "$node_manager_id"
   docker node update --availability drain "$node_manager_id"
   echo "manager node is now in maintainace mode"
fi

if [ -n "$node_worker_id" ]
then
   docker node update --availability drain "$node_worker_id"
   echo "worker node is now in maintainace mode"
fi

after_maintenance.sh

#!/bin/sh

node_name="docker-swarm-node-name"
is_manager=true

node_id=$(docker node ls -f "name=$node_name" -q)

echo "node id: $node_id"
echo "node is manager: $is_manager"

if [ -n "$node_id" ]
then
  if [ "$is_manager" = "true" ]
  then
    docker node promote "$node_id"
  fi
  docker node update --availability active "$node_id"
  echo "node is active again"
fi

Keep in mind that the scripts must be executed on a manager and that the manager should not execute the “demote” command on itself, because the manager that has been demoted to a worker cannot set availability to drain and also cannot make himself a manager again.

As I understand it, this changes the time intervals between the heartbeats that the nodes send to each other. But the server can be interrupted by the snapshot at any time. If the heartbeat duration is set to 10 seconds and the interruption occurs 9 seconds after the last heartbeat, then the problem exists again, even if the probability decreases the longer period of time.

1 Like