Docker Swarm - Proper Shutdown/Startup Process

Hello,

I’m currently playing around with a trial of Docker EE 2.0. I setup a 3 manager, 2 worker cluster which was very simple. This was running in VMs on a laptop which I needed to eventually shutdown. I shut down the workers one at a time and then the managers one at a time.

Later I booted the managers up in reverse order, but several of the supporting containers for ucp, etcd, and others were in an unhealthy or restarting status and never recovered.

In situations where there is a need to shutdown and startup an entire cluster, what should the process or order be? Looked around, but other than this, I couldn’t find any info on the subject. This is an uncommon situation in a production setting, but one I’d like to know how to handle if it comes up.

Properly would be

TL;DR: destroy the swarm and recreate it

  1. to stop all the service
  2. for each worker on the master set availability to drain
  3. for each worker remove them from the swarm
  4. for each worker on the master docker node rm
  5. for each master leave the swarm until there are none.

On restart just rebuild the swarm from your scripts.

Of course saying all that isn’t really practical, but it is the cleanest way.

For me I would rather figure out WHY those supporting containers are failing. It is most likely some health check that would need some refinement. In my earlier setups where I had to deal with proper ordering of things I found that the volumes are not being mounted properly as such I had to develop GitHub - trajano/docker-volume-plugins: Managed docker volume plugins so that they work in a way I expect them to work and when. In addition I had to alter my health checks for some containers so they would be “stricter” and not presume that any missing connections could sately result in “default” behaviour.

Thanks for the quick reply and insight @trajano, I appreciate it. I agree on finding out why there were issues after bringing the cluster back online. I looked into the logs and did a little troubleshooting, but with the number of errors and different container issues, I finally decided to just redeploy the cluster. If I run into this again, I’ll have to spend more time and figure out what is going on.

I’ll also play around with the idea of removing all nodes and rebuilding as you suggested and take a look at the volume plugins you create. Thanks again.