Docker Overlay Network issue version-20.10.7

We are facing docker networking issues in our production systems. Whenever a node reboots due to out of memory issue or is rebooted by AWS due to hardware issue , the connectivity between docker containers get lost when the node comes back up. Some of the services(running as containers) are able to connect with each other but few are not able to connect as the containers ip is not getting resolved properly. I have gone through similar issues and it was said to be fixed in 1.12.2 version but current version being 20.x we still face these container connectivity issues.

We are only able to solve this issue by removing the entire cluster , restart the docker daemon and re-create the whole cluster which is difficult as the number of nodes in the cluster increase and it also makes the application down.

Client: Docker Engine - Community
Version: 20.10.7
API version: 1.41
Go version: go1.13.15
Git commit: f0df350
Built: Wed Jun 2 11:56:47 2021
OS/Arch: linux/amd64
Context: default
Experimental: true

Server: Docker Engine - Community
Engine:
Version: 20.10.7
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: b0f5bc3
Built: Wed Jun 2 11:54:58 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.6
GitCommit: d71fcd7d8303cbf684402823e425e9dd2e99285d
runc:
Version: 1.0.0-rc95
GitCommit: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
docker-init:
Version: 0.19.0
GitCommit: de40ad0

Please let me know if there are any solutions for this other than removing the entire swarm and recreating it.

Just to be sure: the containers address containers of other swarm services using their service name, right?
Does this mean the dns-name based service discovery doesn’t resolve ip’s of the task containers?

Some application cache dns resolutions indefinitly (like nginx does by default), and can point to a stale ip resolution of a service vip or if endpoint mode dnsrr is used to one of the task containers.

Before you resolve the swarm, you can always try to remove a stack and redeploy it.

Also, please share your complete compose stack file. If you don’t use a compose stack file, please share the docker create commands involved to create any networks and services involved in the situation and consider migrating to compse stack files instead, as it makes life much easier.

Hi @meyay ,

Thanks for responding!!. We are using compose files for service deployments!! Yes we are using dns based services discovery and on times when a node goes down(due to memory issue etc) and comes up back , these services (containers) would get restarted right!! During this the docker doesn’t resolve the service ip properly and and leads to loss of communication between different services when the node comes back up online.

Is there any way to overcome this cache dns concept you explained? Can you elaborate on this and is there any ttl we can set for this cache to get cleared? What solution do we have to overcome this stale condition?

Redeploying the stack is not a solution and cannot be performed in production environment. Please let us know if there are any other best practices that can be done to prevent this issue.

Is there

Hi @meyay ,

Thanks for responding!!. We are using compose files for service deployments!! Yes we are using dns based services discovery and on times when a node goes down(due to memory issue etc) and comes up back , these services (containers) would get restarted right!! During this the docker doesn’t resolve the service ip properly and and leads to loss of communication between different services when the node comes back up online.

Is there any way to overcome this cache dns concept you explained? Can you elaborate on this and is there any ttl we can set for this cache to get cleared? What solution do we have to overcome this stale condition?

Redeploying the stack is not a solution and cannot be performed in production environment. Please let us know if there are any other best practices that can be done to prevent this issue.

I missed out on your last response.

I’d like to see an example swarm stack to get an idea how you are using things – so I can respond to what you actually use.