Tasks can't communicate via overlay network when running on different swarm nodes

Hi everyone.

When I create an attachable overlay-network on my swarm, the tasks/containers attached to it can’t talk to each other when running on different swarm nodes. If they are running on the same node, everything works fine. There is no firewall between the nodes.

I have a swarm cluster of 3 manager and 3 worker nodes.
I created an attachable overlay - network on a manager like this:

docker network create --driver overlay --attachable rr_dev_net

It appears in the list of docker network ls on all 3 managers, immediatelly. Inspect looks like this:

[
    {
        "Name": "rr_dev_net",
        "Id": "05vvkdk3qw47v22i03bvusybc",
        "Created": "2022-11-25T10:21:03.472599272+01:00",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.0.2.0/24",
                    "Gateway": "10.0.2.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "344fcdb951d421accc3ef4fd711d924f54331e60a7fc2db0d1e4c19cdcd2c56c": {
                "Name": "rrauth_rrauth.1.yo2jfq9kjhiry8jvcucass5ea",
                "EndpointID": "5adf16977caf5d637f444f4585d78ba1834b9915015e81d2d74d8fa07e9e9a14",
                "MacAddress": "02:42:0a:00:02:1e",
                "IPv4Address": "10.0.2.30/24",
                "IPv6Address": ""
            },
            "lb-rr_dev_net": {
                "Name": "rr_dev_net-endpoint",
                "EndpointID": "52cc0c238e36fee1b042cd0e4e6f447957fef034b77bfde6eae0cac53de2d71d",
                "MacAddress": "02:42:0a:00:02:1f",
                "IPv4Address": "10.0.2.31/24",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4098"
        },
        "Labels": {},
        "Peers": [
            {
                "Name": "c8c7a34d8840",
                "IP": "2a01:4f8:191:600c:23:40:0:72"
            }
        ]
    }
]

For the sake of reproducible example, I deploy the very common nginx image with the following compose file:

---
version: '3.5'

services:
  web:
    image: nginx:alpine
    networks:
      net: null
    deploy:
      replicas: 1
      restart_policy:
        condition: any
      placement:
        constraints:
          - "node.role==worker"

networks:
  net:
    name: rr_dev_net
    attachable: true
    external: true

This hits a node named sbsdevdsm03:

# docker stack ps nginx_demo
ID             NAME                  IMAGE          NODE          DESIRED STATE   CURRENT STATE           ERROR     PORTS
5hc6b07sfzox   nginx_demo_web.1   nginx:alpine   sbsdevdsm03   Running         Running 7 seconds ago

inspect shows the task has IP 10.0.2.55:

# docker service inspect nginx_demo_web | jq '.[0].Endpoint.VirtualIPs'
[
  {
    "NetworkID": "05vvkdk3qw47v22i03bvusybc",
    "Addr": "10.0.2.55/24"
  }
]

Since the net is attachable, I can SSH into sbsdevdsm03 and try to reach 10.0.2.55:80, which works fine:

sbsdevdsm03:~# docker run -it --rm --network rr_dev_net debian /bin/bash
root@4566b06b3138:/# apt update ; apt install -y curl
...
root@4566b06b3138:/# curl --connect-timeout 20 10.0.2.55
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>
root@4566b06b3138:/#

But when I do the same on another worker node, say: sbsdevdsm02, it does not work:

root@sbsdevdsm02:~# docker run -it --rm --network rr_dev_net debian /bin/bash
root@b244e2633998:/# apt update ; apt install -y curl
...
root@b244e2633998:/# curl --connect-timeout 20 10.0.2.55
curl: (28) Connection timed out after 20001 milliseconds

Running nmap on Port 80 on IP 10.0.2.55 shows open from sbsdevdsm03 and filtered from sbsdevdsm02.
Ping works from sbsdevdsm03, but doesn’t from sbsdevdsm02.

What am I missing?

Is there a filewall on the nodes? (not between). I would check if firewalld or ufw is installed on the nodes. I am not a swarm user so it is possible that the problem is something else, but I remember that I had problems (not with swarm) when I had firewalld on centos and ufw on Ubuntu and the default configuration did not allow connection from outside.

No, there is no firewall involved. Neither between the nodes nor on any of the nodes. The nodes are Debian Bullseye.

I think I don’t get your point with attachable. Attachable means that normal container can be attached to the network.

The documenttion say:

Usually it’s either missing kernel modules or an enabled firewall on this host, or in case of a cloud compute instance at the security group level.

Have you tried whether it makes a difference, if you use the service name instead of an ip? If you inspect a task, it should return its container ip. If you inspect a service, you should get the vip ip. So you could try if this makes a difference aswell.

OK - This is solved. Seems as if I unintentionally selected a wrong interface, resulting in Docker Traffic to go via an external route via IPv6 …
Selecting the internal vSwitch IPv4 interface makes it work.

So: Nevermind and sorry for the noise.

I’m having exactly the same problem

Selecting the internal vSwitch IPv4 interface makes it work

What did you mean by this?

In summary, I definately had an issue that ports were not going through our network. Getting started with swarm mode | Docker Docs lists the following ports:

TCP port 2377 for cluster management communications
TCP and UDP port 7946 for communication among nodes
UDP port 4789 for overlay network traffic

When I initiated the swarm, I specified an IPv6 IP, which also was connected to the node, but for that IP, the firewall was only configured to permit ports 2377 and 7946, but it blocked 4789/UDP. When I checked if this port works, I used one of the IPv4 IPs of the node; for which the firewall would have permitted the packages via 4789/UDP. So I tested something different than Docker used.
To sum it up: Making 100% sure that communication between ALL swarm nodes via 4789/UDP works for the real IPs which are also used for swarm setup, was the solution.

I just checked manually that every port was reachable from every node. So the problem must be elsewhere…

Thank you for your help!