I have a 3-node swarm with all managers and I’m wondering what’s the best practice for taking the leader offline for maintenance without service distruption.
For example:
$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
xxxxxxxxxxxxxxxxxxxxxxxxx * node1 Ready Active Leader 20.10.8
xxxxxxxxxxxxxxxxxxxxxxxxx node2 Ready Active Reachable 20.10.8
xxxxxxxxxxxxxxxxxxxxxxxxx node3 Ready Active Reachable 20.10.8
The control plan will see no outage if one of three nodes is unavailable.
Assume all deployed containers are swarm services, which use no placement constraint unique to node1, and have their volumes placed on a globaly available storage (cifs, nfsv4, portworx), have an additional replica running on one of the other nodes, then it should be enough to drain the node:
docker node update --availability drain node1
Make sure to wait until the last container is drained before you beginn your maintance task.
Once maintainance is done set the node active again:
docker node update --availability active node1
Global type services will immediate be scheduled on node1 again. Replica type services won’t be re-balanced to node1, until they get redeployed.
Finaly, if you have services with a desired replica count of 1 running on node1, there is no way arround service disruption during the redeployment of the service to a different node. The downtime can be somewhere between seconds up to 2 minutes.