Docker swarm loses network connectivity

Hi there, I have really strange issue with Docker Swarm, it works as expected for days or even weeks, then something happens and the cluster start to drop packages.

For instance I checked the traefik ingress log, as it is an entrypoint of our service, but it even don’t complain in the logs about timing out, when trying to send packages to the backend, it looks like the packages are whether don’t leave the interface or don’t arrive at the final destination.

Started thinking about IP conflict, because the whole stack starts losing packages, not completely shut off, but lagging..

I’m really open for any ideas for troubleshooting, thanks

Are you running your Swarm cluster on latest Linux and Docker? Any vSwitch/VLAN/VPN involved?

Yeah, it runs on Ubuntu 24.04. Regarding Docker, not the latest one, but a recent version, I mean there are some updates pending. No vSwitch/VLAN/VPN involved, the config is straight forward.

Maybe it is worth mentioning that it runs on AWS EC2 instances. I also noticed that for some reasons the main interface of these instances has 9000 MTU set, but this shouldn’t be the problem.

Try a ping with payload > 9000 between the nodes to check if your network is working correctly.

Already checked, it works as expected.

In addition, the problem is not presented all the time, but after certain amount of time (even not time, but conditions). So if it was MTU issue, should persist all the time I think.

The MTU issue is tricky, as it will only break transmission of large packets. A standard ping and http request to whoami service may go through, and only larger requests (script, image) might fail.

It’s been a couple of years since I run swarm on AWS, but apart from the security groups required between cluster nodes, nothing special was necessary to make it work.

When you say the cluster start to drop packages, does it mean amongst cluster nodes, for outgoing traffic, for incoming traffic?

Is your cluster running in a single region, spun across multiple availability zones? If only incoming traffic is affected, are clients experiencing it, accessing it through a NLB?

Ip collision could be a thing if at least one of the subnet cidrs and a swarm overlay network would overlap.