Network down sometimes when deploying/updating a stack

Hi

We ran into a quite weird problem two weeks ago. One week before, we rebuilt our three node swarm (all managers) and installed 18.09.9 (unfortunately we couldn’t install the most current version yet). The initial deployment of our workload went perfectly fine. Most of them have one app and one database service, around 30 stacks.

Then, two week ago, a colleague had to deploy an updated image used by one of the existing stacks (we use Jenkins to deploy them). Right after the deployment stage started (simple docker deploy command), all the network connections from external to any container running on the swarm went down. The entry point for all the network traffic to the stacks is a HAProxy container running on each of the nodes. After around 10 minutes it recovered itself. Few minutes later, he started another deployment of the same stack, and the same thing happened again. As it is a production swarm, we couldn’t play around more there. The HAProxy containers were running all the time and didn’t restart.

But then we faced the same problem on our test swarm running 18.09.0 (and a different kernel version, RHEL7 btw.).
The bad thing is, that we couldn’t find a way to reproduce it. No matter what we do, most of the times it works just fine.

Some days ago, I updated our Jenkins image on the test swarm (docker service update --image …) on the command line and all the network connections went down again. Couldn’t access any of the services from external. The containers were still running and SSH to the node itself was fine. 10 minutes later, everything was fine again.

The only interesting log messages we see is the following:

Oct  1 11:26:32 server01 kernel: IPVS: rr: FWM 4100 0x00001004 - no destination available
Oct  1 11:26:32 server01 kernel: IPVS: rr: FWM 4100 0x00001004 - no destination available
Oct  1 11:26:33 server01 kernel: IPVS: rr: FWM 4100 0x00001004 - no destination available
Oct  1 11:26:33 server01 kernel: IPVS: rr: FWM 4100 0x00001004 - no destination available
Oct  1 11:26:35 server01 kernel: IPVS: rr: FWM 4100 0x00001004 - no destination available

But these also show up when the deployment worked without the network issue.
Aside from these messages, there is absolutely nothing which would give a hint. No logs from HAProxy, Docker or the running applications. The network is just dead for around 10 minutes and then back again. Total weird.

Any help is appreciated, as I’m quite out of ideas…

Thank you
Urs

1 Like

Mystery solved…
At the end it was our HAProxy creating an infinite loop as soon as a backend went down.
Seems that DNS handling of Docker and/or HAProxy has changed over time. We never had this problem within the last two years with very similar configurations.
We now prevented the loop by improving the check of the backend hosts.

Have same problem with host in hosting,
some times all containers on the host are down/restart

and similar in log
sudo journalctl -k | grep “IPVS”
Jan 10 17:27:18 speech kernel: IPVS: rr: FWM 47197 0x0000B85D - no destination available
Jan 10 17:27:23 speech kernel: IPVS: rr: FWM 47196 0x0000B85C - no destination available
Jan 13 17:44:57 speech kernel: IPVS: rr: FWM 43824 0x0000AB30 - no destination available
Jan 13 17:44:58 speech kernel: IPVS: rr: FWM 43824 0x0000AB30 - no destination available

Same problem. Strangely, 2AM all the connections to a swarm cluster running a database is dropped.