Docker Swarm - Proper Shutdown/Startup Process

fusionx86 · May 21, 2018, 2:32pm

Hello,

I’m currently playing around with a trial of Docker EE 2.0. I setup a 3 manager, 2 worker cluster which was very simple. This was running in VMs on a laptop which I needed to eventually shutdown. I shut down the workers one at a time and then the managers one at a time.

Later I booted the managers up in reverse order, but several of the supporting containers for ucp, etcd, and others were in an unhealthy or restarting status and never recovered.

In situations where there is a need to shutdown and startup an entire cluster, what should the process or order be? Looked around, but other than this, I couldn’t find any info on the subject. This is an uncommon situation in a production setting, but one I’d like to know how to handle if it comes up.

trajano · May 21, 2018, 2:44pm

Properly would be

TL;DR: destroy the swarm and recreate it

to stop all the service
for each worker on the master set availability to drain
for each worker remove them from the swarm
for each worker on the master docker node rm
for each master leave the swarm until there are none.

On restart just rebuild the swarm from your scripts.

Of course saying all that isn’t really practical, but it is the cleanest way.

For me I would rather figure out WHY those supporting containers are failing. It is most likely some health check that would need some refinement. In my earlier setups where I had to deal with proper ordering of things I found that the volumes are not being mounted properly as such I had to develop GitHub - trajano/docker-volume-plugins: Managed docker volume plugins so that they work in a way I expect them to work and when. In addition I had to alter my health checks for some containers so they would be “stricter” and not presume that any missing connections could sately result in “default” behaviour.

fusionx86 · May 21, 2018, 9:43pm

Thanks for the quick reply and insight @trajano, I appreciate it. I agree on finding out why there were issues after bringing the cluster back online. I looked into the logs and did a little troubleshooting, but with the number of errors and different container issues, I finally decided to just redeploy the cluster. If I run into this again, I’ll have to spend more time and figure out what is going on.

I’ll also play around with the idea of removing all nodes and rebuilding as you suggested and take a look at the volume plugins you create. Thanks again.

Topic		Replies	Views
No running container after node swarm failover General docker	0	788	March 9, 2018
Help with understanding number of managers General	0	578	January 14, 2018
Why does docker swarm set desired status to Shutdown? Swarm	5	5714	August 9, 2024
Swarm host availability Swarm	0	956	October 19, 2016
Restarting single Docker manager of swarm Swarm	3	3297	August 17, 2019

Docker Swarm - Proper Shutdown/Startup Process

Related topics