I’ve tried searching around to see if anyone has this same issue, but as far as I can tell, no one seems to have this issue. Basically, I’m unable to get two containers running on different nodes in a swarm to communicate with each other (can’t even ping each other). Only containers on the same node can communicate with each other. This is true whether I use the containers name (dns), or the IP address assigned to the container within the overlay network.
If the containers are on the same node, there’s no problem. I’m also able to get everything to work if I have a single node swarm (a single node doesn’t make sense of course, but just for testing purposes I tried it out).
I was able to distill the reproduction of this problem with these steps using ping or fping to make it easier to debug:
-
First, I’m running Docker on ARM HW. The image I’m using is aarch64/ubuntu:16.04
-
I create an overlay network with the following command:
docker network create --opt encrypted --driver overlay my-network
I basically got the above straight out of the Docker webpage which explains how to setup a swarm. -
On the manager node I run this:
docker swarm init --advertise-addr 192.168.123.5
The result of this command is a command which I have to issue to worker nodes I’d like to have join the swarm.
the output looks something like this:
docker swarm join \
–token SWMTKN-1-3xynxgr4i4e6yli6mhzsxpuxmdobmwb3wps9behfk5z1gwsygc-00ebys5uvaj7ygv2v7iec4hip \
192.168.123.5:2377
-
So far, this is all pretty standard per the swarm instruction on the docker page.
-
I SSH into the worker node I want to join the swarm using the above command. I get a response saying that the worker has joined the swarm.
-
I confirm the state of the nodes by running the following command:
docker node ls
I see that the two nodes are up and in the ready state. Everything looks awesome. -
I run two containers based off the aarch64/ubuntu:16.04 images. I run the following commands on the manage node:
docker service create --with-registry-auth --name first --network my-network aarch64/ubuntu:16.04 sleep 99999999999999999
docker service create --with-registry-auth --name second --network my-network aarch64/ubuntu:16.04 sleep 99999999999999999
I run the sleep command because this prevents the container from exiting immediately after it starts up. -
I confirm that the first container is executing on the manager node, and the second contain is executing on the worker node by running the following commands:
docker service ps first
docker service ps second
Everything looks good at this point. -
From the manager node, I connect to first node by running the following command:
docker exec -ti first…long_name_assigned to the container… bash -
From within this bash shell, I install fping (ping would work too of course). Then I run the following commands
ping localhost
ping first
ping second
The first two commands work (response with “host is alive”). The 3rd command say’s that the machine is unreachable. Rather than using the name of the container, I put the IP address that the overlay network assigned to second, and see the same issue.
- I can connect to the second container on the worker node and do the following:
ping localhost
ping first
ping second
The second command fails. Same thing happens if I use the IP address instead.
As I said before, if I have a one manager node, which means all containers will run on a single node, the above commands work without issue.
What am I missing here? I’ve found nothing online that seems to indicate what could be the problem.