Network down sometimes when deploying/updating a stack

ursweiss · October 1, 2019, 12:00pm

Hi

We ran into a quite weird problem two weeks ago. One week before, we rebuilt our three node swarm (all managers) and installed 18.09.9 (unfortunately we couldn’t install the most current version yet). The initial deployment of our workload went perfectly fine. Most of them have one app and one database service, around 30 stacks.

Then, two week ago, a colleague had to deploy an updated image used by one of the existing stacks (we use Jenkins to deploy them). Right after the deployment stage started (simple docker deploy command), all the network connections from external to any container running on the swarm went down. The entry point for all the network traffic to the stacks is a HAProxy container running on each of the nodes. After around 10 minutes it recovered itself. Few minutes later, he started another deployment of the same stack, and the same thing happened again. As it is a production swarm, we couldn’t play around more there. The HAProxy containers were running all the time and didn’t restart.

But then we faced the same problem on our test swarm running 18.09.0 (and a different kernel version, RHEL7 btw.).
The bad thing is, that we couldn’t find a way to reproduce it. No matter what we do, most of the times it works just fine.

Some days ago, I updated our Jenkins image on the test swarm (docker service update --image …) on the command line and all the network connections went down again. Couldn’t access any of the services from external. The containers were still running and SSH to the node itself was fine. 10 minutes later, everything was fine again.

The only interesting log messages we see is the following:

Oct  1 11:26:32 server01 kernel: IPVS: rr: FWM 4100 0x00001004 - no destination available
Oct  1 11:26:32 server01 kernel: IPVS: rr: FWM 4100 0x00001004 - no destination available
Oct  1 11:26:33 server01 kernel: IPVS: rr: FWM 4100 0x00001004 - no destination available
Oct  1 11:26:33 server01 kernel: IPVS: rr: FWM 4100 0x00001004 - no destination available
Oct  1 11:26:35 server01 kernel: IPVS: rr: FWM 4100 0x00001004 - no destination available

But these also show up when the deployment worked without the network issue.
Aside from these messages, there is absolutely nothing which would give a hint. No logs from HAProxy, Docker or the running applications. The network is just dead for around 10 minutes and then back again. Total weird.

Any help is appreciated, as I’m quite out of ideas…

Thank you
Urs

ursweiss · October 17, 2019, 6:07am

Mystery solved…
At the end it was our HAProxy creating an infinite loop as soon as a backend went down.
Seems that DNS handling of Docker and/or HAProxy has changed over time. We never had this problem within the last two years with very similar configurations.
We now prevented the loop by improving the check of the backend hosts.

racoder · January 14, 2020, 10:50am

Have same problem with host in hosting,
some times all containers on the host are down/restart

and similar in log
sudo journalctl -k | grep “IPVS”
Jan 10 17:27:18 speech kernel: IPVS: rr: FWM 47197 0x0000B85D - no destination available
Jan 10 17:27:23 speech kernel: IPVS: rr: FWM 47196 0x0000B85C - no destination available
Jan 13 17:44:57 speech kernel: IPVS: rr: FWM 43824 0x0000AB30 - no destination available
Jan 13 17:44:58 speech kernel: IPVS: rr: FWM 43824 0x0000AB30 - no destination available

csouzaf · July 26, 2021, 11:20am

Same problem. Strangely, 2AM all the connections to a swarm cluster running a database is dropped.

Topic		Replies	Views
Loss of connection between Docker Swarm cluster nodes during deploy Swarm docker , swarm , docker-compose	1	258	March 6, 2024
Swarm Timeouts Intermittent Swarm	0	1177	March 23, 2018
Major problems caused by network outage General docker , swarm	2	713	February 22, 2021
Swarm Windows - services published ports not reachable after service update Docker Desktop docker , swarm , windows-container , windows	1	74	July 16, 2024
Docker Swarm Node Status Down Swarm	3	10274	March 7, 2018

Network down sometimes when deploying/updating a stack

Related topics