Major problems caused by network outage

I haven’t found any other posts about this but would like to know if theres a solution. I’ve been running a 4 node docker swarm in dev continually for some time and ran into a couple of problems.

  1. Firstly if there is a network outage it seems that something in docker causes an endless loop to occur causing over 100 tasks in each node and effectively overloading the node cpu.

  2. Secondly, I try to reinstall docker and run into problems with locked files in /overlay2/ which require a reboot of the server (if I try to stop docker running it just goes into an endless loop and never stops running)

  3. Thirdly, I had a fully working connection between node 1 and node 3 (application to mariadb database) and after the network outage, the app no longer finds the database with exactly the same settings I was using previously. No firewall settings were changed.

Edit: After reinstall the application finds the database immediately

I’ve run into this same problem twice which involved a complete refresh of the servers and basically start again but this would not be the ideal solution for production.

Does anybody have any knowledge about how to get the docker installation fully working after having a network outage, node overload and locked docker files/surplass files left after a docker system prune?

hi there.

can you define a network outage - does that mean all nodes were isolated or some node? how many of the 4 were managers and workers?

I’d imagine that the extra tasks were the managers trying to recreate containers, however without a quorum that wouldnt happen…

Did the IP addresses change when the network was restored?

Are you able to reproduce the issue by taking down the network?

The entire network dropped for all the nodes - so no connectivity between any of them. There were 3 managers and one worker

The IP address didn’t change. I haven’t tried to replicate the issue by taking down the network as of yet.