Swarm: Containers on different nodes unable to communicate on overlay network

I’ve tried searching around to see if anyone has this same issue, but as far as I can tell, no one seems to have this issue. Basically, I’m unable to get two containers running on different nodes in a swarm to communicate with each other (can’t even ping each other). Only containers on the same node can communicate with each other. This is true whether I use the containers name (dns), or the IP address assigned to the container within the overlay network.

If the containers are on the same node, there’s no problem. I’m also able to get everything to work if I have a single node swarm (a single node doesn’t make sense of course, but just for testing purposes I tried it out).

I was able to distill the reproduction of this problem with these steps using ping or fping to make it easier to debug:

  • First, I’m running Docker on ARM HW. The image I’m using is aarch64/ubuntu:16.04

  • I create an overlay network with the following command:
    docker network create --opt encrypted --driver overlay my-network
    I basically got the above straight out of the Docker webpage which explains how to setup a swarm.

  • On the manager node I run this:
    docker swarm init --advertise-addr 192.168.123.5
    The result of this command is a command which I have to issue to worker nodes I’d like to have join the swarm.
    the output looks something like this:

docker swarm join \
–token SWMTKN-1-3xynxgr4i4e6yli6mhzsxpuxmdobmwb3wps9behfk5z1gwsygc-00ebys5uvaj7ygv2v7iec4hip \
192.168.123.5:2377

  • So far, this is all pretty standard per the swarm instruction on the docker page.

  • I SSH into the worker node I want to join the swarm using the above command. I get a response saying that the worker has joined the swarm.

  • I confirm the state of the nodes by running the following command:
    docker node ls
    I see that the two nodes are up and in the ready state. Everything looks awesome.

  • I run two containers based off the aarch64/ubuntu:16.04 images. I run the following commands on the manage node:
    docker service create --with-registry-auth --name first --network my-network aarch64/ubuntu:16.04 sleep 99999999999999999
    docker service create --with-registry-auth --name second --network my-network aarch64/ubuntu:16.04 sleep 99999999999999999
    I run the sleep command because this prevents the container from exiting immediately after it starts up.

  • I confirm that the first container is executing on the manager node, and the second contain is executing on the worker node by running the following commands:
    docker service ps first
    docker service ps second
    Everything looks good at this point.

  • From the manager node, I connect to first node by running the following command:
    docker exec -ti first…long_name_assigned to the container… bash

  • From within this bash shell, I install fping (ping would work too of course). Then I run the following commands
    ping localhost
    ping first
    ping second

The first two commands work (response with “host is alive”). The 3rd command say’s that the machine is unreachable. Rather than using the name of the container, I put the IP address that the overlay network assigned to second, and see the same issue.

  • I can connect to the second container on the worker node and do the following:
    ping localhost
    ping first
    ping second

The second command fails. Same thing happens if I use the IP address instead.

As I said before, if I have a one manager node, which means all containers will run on a single node, the above commands work without issue.

What am I missing here? I’ve found nothing online that seems to indicate what could be the problem.

I have the same problem!

ec2, centos7 (firewalld NOT enabled!), v1.12.3 experimental

I’m trying to use the flow of:-

  • docker-compose file to bundle
  • deploy bundle

Steps to reproduce

  • given this compose file
version: '2'
services:
  one:
    image: nginx:alpine
  two:
    image: nginx:alpine
    command: "sh -c 'ping one'"
  • from a manager node, on a new swarm (v1.12.3, experimental)
  • docker-compose pull one two
  • docker-compose bundle --push-images (I have already logged in / auth’d to my private registry)
  • docker deploy --with-registry-auth
  • service “one” comes up, service “two” starts and immediately stops … logs say:-
ping: bad address 'one' 

any pointers?
thanks
e

One piece of information I’d like to add. According to this:
issue 1429

ping won’t work. Instead use something like dig or nslookup.

That said, I still have the problem.

It is also mentioned at the very bottom of this page:
swarn networking

Hi,

Can you try opening up udp port 4789 and 7946?

When I did this, I can now see that nslookup <service_name> returns an IP address.

However, I’m seeing a really odd thing. The IP outputted by nslookup is off by 1 when compared to ifconfig. This part doesn’t make sense to me. The containers I’m running still appear to not be able to communicate with each other across nodes. I wonder if this has something to with it. Will need to continue investigating.

See below:

root@ec4080d2a2ca:/usr/src/app# ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:0a:0a:0a:03
          inet addr:10.10.10.3  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::42:aff:fe0a:a03/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:13 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1038 (1.0 KB)  TX bytes:648 (648.0 B)

eth1      Link encap:Ethernet  HWaddr 02:42:ac:12:00:03
          inet addr:172.18.0.3  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:acff:fe12:3/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1104 errors:0 dropped:0 overruns:0 frame:0
          TX packets:987 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:5443122 (5.4 MB)  TX bytes:73088 (73.0 KB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:4096  Metric:1
          RX packets:51 errors:0 dropped:0 overruns:0 frame:0
          TX packets:51 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3838 (3.8 KB)  TX bytes:3838 (3.8 KB)

root@ec4080d2a2ca:/usr/src/app# nslookup front-end
Server:         127.0.0.11
Address:        127.0.0.11#53

Non-authoritative answer:
Name:   front-end
Address: 10.10.10.2

After more investigation, I’m seeing that my service which is trying to talk to the front-end, us using IP 10.10.10.2 and not 10.10.10.3. It complains that it cannot reach the host at 10.10.10.2. I wonder if this is the root of my problem now.

1 Like

Hey,

thanks for that … in reading through your suggestions, I spotted my problem!! Fortunately and unforunately, is was a RTFM issue :frowning: !!!

I’m running in AWS, and I create all the infractructure, etc … a security group item I failed to add, which IS in the documentation, was to allow port 7946/tcp. Adding that in, it all works like a charm!!

thanks
e

Glad that helped. I’m finding my other issues are just with the configuration of nginx.

I think for this thread we can conclude that it’s important to check all your open ports.