We ran into a quite weird problem two weeks ago. One week before, we rebuilt our three node swarm (all managers) and installed 18.09.9 (unfortunately we couldn’t install the most current version yet). The initial deployment of our workload went perfectly fine. Most of them have one app and one database service, around 30 stacks.
Then, two week ago, a colleague had to deploy an updated image used by one of the existing stacks (we use Jenkins to deploy them). Right after the deployment stage started (simple docker deploy command), all the network connections from external to any container running on the swarm went down. The entry point for all the network traffic to the stacks is a HAProxy container running on each of the nodes. After around 10 minutes it recovered itself. Few minutes later, he started another deployment of the same stack, and the same thing happened again. As it is a production swarm, we couldn’t play around more there. The HAProxy containers were running all the time and didn’t restart.
But then we faced the same problem on our test swarm running 18.09.0 (and a different kernel version, RHEL7 btw.).
The bad thing is, that we couldn’t find a way to reproduce it. No matter what we do, most of the times it works just fine.
Some days ago, I updated our Jenkins image on the test swarm (docker service update --image …) on the command line and all the network connections went down again. Couldn’t access any of the services from external. The containers were still running and SSH to the node itself was fine. 10 minutes later, everything was fine again.
The only interesting log messages we see is the following:
Oct 1 11:26:32 server01 kernel: IPVS: rr: FWM 4100 0x00001004 - no destination available Oct 1 11:26:32 server01 kernel: IPVS: rr: FWM 4100 0x00001004 - no destination available Oct 1 11:26:33 server01 kernel: IPVS: rr: FWM 4100 0x00001004 - no destination available Oct 1 11:26:33 server01 kernel: IPVS: rr: FWM 4100 0x00001004 - no destination available Oct 1 11:26:35 server01 kernel: IPVS: rr: FWM 4100 0x00001004 - no destination available
But these also show up when the deployment worked without the network issue.
Aside from these messages, there is absolutely nothing which would give a hint. No logs from HAProxy, Docker or the running applications. The network is just dead for around 10 minutes and then back again. Total weird.
Any help is appreciated, as I’m quite out of ideas…