We have been facing a recurring issue with our Docker swarm cluster particularly after service updates, stack redeploys and nightly server restarts. Docker server version is 17.03.2-ce deployed on Azure Container Service. It does not happen every time, but is repetitive enough to cause concern.
We have about 30 containers(these are Spring Boot and Netflix OSS applications) hosted across a 4 node swarm(total 8 CPUs about 28GB memory). Occasionally, we find service tasks hosted on one node become unreachable from other nodes in the swarm. It is unclear why it happens - restarting the docker daemons on the manager as well as the workers usually solves the problem.
Any pointers on configuration we may be missing or similar experiences are welcome.
Thanks,
Avinash