Issue type: Docker Swarm Overlay Network OS Version: Ubuntu Xenial 16.04.3 LTS App Version: Docker version 20.10.7, build f0df350
EDIT: This appears to be an ongoing bug that’s been reported. The quick-fixes listed in this issue tracker (linked below) haven’t seemed to work for me however. Any other ideas?
I have created a custom overlay network to connect a Python Celery task queue across 9 of my machines, each running identical OS, Docker version, and with the required ports open for Docker swarm. One machine is set up as the manager with the other 8 being workers.
To handle my task queue, I am launching RabbitMQ on the manager within the overlay network I have created. Afterwards, I launch services across my 8 worker nodes, 1 per node, to consume tasks being placed into the queue. These workers consume these task by connecting to RabbitMQ to get the next task in the queue.
Here is my issue: 5 of my 8 worker services on my worker nodes fail to connect to RabbitMQ on the manager node. The other 3 connect fine. it’s not a configuration issue within the specifics of my application, as I can confirm that telnet fails to connect to the RabbitMQ service when attempting to do some from the command line inside these containers, but I can telnet into RabbitMQ from the containers which are working without issue.
All worker services are connecting to RabbitMQ by the service name, and I can confirm that they are all resolving to the correct IP address within the overlay network.
I’ve had this issue for 2 weeks now with no idea how to resolve it. I’ve tried leaving the swarm on the nodes giving me issue and rejoining, re-creating the overlay network, checking my ports to make sure nothing is blocking, and manually launching the application by recreating it within an unrelated running container.
All services were created by using the command docker service create and were joined with the --network option to make sure they connect over the overlay network I made.
I’m at my wits end. Any help would be greatly appreciated.