Graceful restart of swarm manager leader

I have a 3-node swarm with all managers and I’m wondering what’s the best practice for taking the leader offline for maintenance without service distruption.

For example:

$ docker node ls
ID                            HOSTNAME   STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
xxxxxxxxxxxxxxxxxxxxxxxxx *   node1      Ready     Active         Leader           20.10.8
xxxxxxxxxxxxxxxxxxxxxxxxx     node2      Ready     Active         Reachable        20.10.8
xxxxxxxxxxxxxxxxxxxxxxxxx     node3      Ready     Active         Reachable        20.10.8

^ I want to restart node1 there.

The control plan will see no outage if one of three nodes is unavailable.

Assume all deployed containers are swarm services, which use no placement constraint unique to node1, and have their volumes placed on a globaly available storage (cifs, nfsv4, portworx), have an additional replica running on one of the other nodes, then it should be enough to drain the node:

docker node update --availability drain node1

Make sure to wait until the last container is drained before you beginn your maintance task.

Once maintainance is done set the node active again:

docker node update --availability active node1

Global type services will immediate be scheduled on node1 again. Replica type services won’t be re-balanced to node1, until they get redeployed.

Finaly, if you have services with a desired replica count of 1 running on node1, there is no way arround service disruption during the redeployment of the service to a different node. The downtime can be somewhere between seconds up to 2 minutes.

1 Like