Unable to access swarm services from other node ip via routing mesh

Hiya, I’m trying to connect to my services running on docker swarm using the ip of my master node. In the past I’ve restricted them to run on a single node so I can access it using that node’s ip but that kind of defeats the point of docker swarm.

I’d like to be able to access all my services on 192.168.2.222 (master node) regardless of which node the service is deployed on. (e.g. pi-hole running on 192.168.2.223 but accessed on 192.168.2.222).

I found out docker swarm has a default routing mesh that should accomplish it, but I can’t get it to work. When I inspect my pi-hole service it shows that it does indeed run on the ingress network:

node@cluster-1:~ $ docker service inspect s5g03ic95s3l --format '{{json .Endpoint.Spec.Ports}}'
[{"Protocol":"tcp","TargetPort":53,"PublishedPort":53,"PublishMode":"ingress"},{"Protocol":"udp","TargetPort":53,"PublishedPort":53,"PublishMode":"ingress"},{"Protocol":"tcp","TargetPort":80,"PublishedPort":80,"PublishMode":"ingress"}]

The only strange thing I could find is when running docker network inspect ingress and checking the peers:

        "Peers": [
            {
                "Name": "ca035880af34",
                "IP": "192.168.2.222"
            },
            {
                "Name": "9c3baeb45993",
                "IP": "192.168.2.16" // should be 192.168.2.223?
            }
        ]

I’ve also made sure the correct ports are open using the following commands on both nodes:

node@cluster-1:~ $ sudo iptables -A INPUT -p tcp --dport 7946 -j ACCEPT
node@cluster-1:~ $ sudo iptables -A OUTPUT -p tcp --sport 7946 -j ACCEPT
node@cluster-1:~ $ sudo iptables -A INPUT -p udp --dport 7946 -j ACCEPT
node@cluster-1:~ $ sudo iptables -A OUTPUT -p udp --sport 7946 -j ACCEPT
node@cluster-1:~ $ sudo iptables -A INPUT -p udp --dport 4789 -j ACCEPT
node@cluster-1:~ $ sudo iptables -A OUTPUT -p udp --sport 4789 -j ACCEPT

and then rebooting them.

If the Ingress routing mesh is not working, usually something is not working properly with the overlay network.

If the overlay traffic is not working, usually those are the suspects:

  • Firewall needs following ports to be open on all nodes:
    • Port 2377 TCP for communication with and between manager nodes
    • Port 7946 TCP/UDP for overlay network node discovery
    • Port 4789 UDP (configurable) for overlay network traffic
  • The mtu size is not identical on all nodes
    • ip addr show scope global | grep mtu
  • The nodes don’t share a low latency network connection
  • Nodes are running in vms on VMware vSphere with NSX
    • Outgoing traffic to port 4789 UDP is silently dropped as it conflicts with VMware NSX’s communication port for VXLAN
    • Re-create the swarm with a different data-port:
      • docker swarm init --data-path-port=7789
  • Problems with checksum offloading
    • Disable checksum offloading for the network interface (eth0 is a placeholder):
    • ethtool -K eth0 tx-checksum-ip-generic off

I believe the ports are the issue, how can I resolve this? I think two issues occur: 1. 2377 isn’t open on my second node, and 2. I think tcp6 means they only run ipv6 networks (?), and I don’t think docker swarm can use ipv6.

node@cluster-1:~ $ sudo netstat -tuln | grep -E ':(2377|7946|4789)\s'
tcp6       0      0 :::7946                 :::*                    LISTEN
tcp6       0      0 :::2377                 :::*                    LISTEN
udp        0      0 0.0.0.0:4789            0.0.0.0:*
udp6       0      0 :::7946                 :::*

node@cluster-2:~ $ sudo netstat -tuln | grep -E ':(2377|7946|4789)\s'
tcp6       0      0 :::7946                 :::*                    LISTEN
udp        0      0 0.0.0.0:4789            0.0.0.0:*
udp6       0      0 :::7946                 :::*

To resolve 2377 not being open, can I just run sudo iptables -A INPUT -p tcp --dport 2377 -j ACCEPT?
nevermind, the second node isn’t a manager so it’s not needed :slight_smile:

The other issues seem to not be the case:

  • mtu is 1500 on both
  • they share a 1gig connection
  • not running on vms
  • checksum is off, still doesn’t work

My bad. I need to fix it in the template response I shared: port 2377 does not affect all nodes, it only affects manager nodes. Worker nodes do not bind port 2377.

When the AF_INET6 socket is used (like the docker demon does), udp6 and tcp6 will bind to the ipv6 and ipv4 stack. Only the vxlan port 4789 is only bound to ipv4.

What could be the problem then? Since the ports seem to be up, and the other usual suspects don’t seem to be the issue, I don’t know how to troubleshoot this further. I checked the network too, all the containers are attached to it, the peers seem to be correct, and it is in ingress mode…

node@cluster-1:~ $ nc -vzu 192.168.2.223 4789
Connection to 192.168.2.223 4789 port [udp/*] succeeded!
node@cluster-2:~ $ nc -vzu 192.168.2.222 4789
Connection to 192.168.2.222 4789 port [udp/*] succeeded!

If I had another Idea, I would have shared it already.

Especially on bare metal machines, regardless whether the firewall is disabled or the well known ports opened, it should work out of the box.

We recently had a case where someone used swarm with Windows containers, where the routing mash indeed is not working.

1 Like

Maybe share the Docker compose file of a service.

Sure, here is the configuration for my (custom) monitoring website. To clarify; I can access it on http://192.168.2.222:5001/ but not on http://192.168.2.223:5001/.

version: "3.8"
services:
  master: # website, runs on master (relevant)
    image: monitoring-master:latest
    ports:
      - "5001:5001"
    deploy:
      placement:
        constraints: [node.role == manager]
  worker: # sends monitoring data via /report to the master (irrelevant)
    image: 192.168.2.222:5000/monitoring-worker:latest # registry running on master
    environment:
      MASTER_URL: "http://192.168.2.222:5001/report"
      REPORT_INTERVAL: "10"
      HOST_HOSTNAME: "{{.Node.Hostname}}"

    deploy:
      mode: global

What’s the output of:

  • docker node ls
  • docker network ls
  • docker service ls
  • docker stack ls

Did you use docker stack deploy to run the Swarm stack?

(I have added another node since my first post)

node@cluster-1:~ $ docker node ls
ID                            HOSTNAME    STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
uo3jafwp8fwy58cemqnqb5fvm *   cluster-1   Ready     Active         Leader           27.5.1
ichdgsqxlzn8x95ivyzm57rqg     cluster-2   Ready     Active                          27.5.1
obfmv8eamlxl4ok74y6g83ebh     cluster-3   Ready     Active                          27.5.1

node@cluster-1:~ $ docker network ls
NETWORK ID     NAME                        DRIVER    SCOPE
101ea8319d7d   bridge                      bridge    local
8wyr8ocupawb   cache_proxy_network         overlay   swarm
21e63d70a1b2   docker_gwbridge             bridge    local
8dc1eac058ce   host                        host      local
zr7h7jysi1yc   ingress                     overlay   swarm
57njhx7qkmjl   monitoring-custom_default   overlay   swarm
f027d8b18a46   none                        null      local
q1tu5juaf0bw   pihole_default              overlay   swarm
b11f137df8ad   pihole_pihole-backend       bridge    local
u5p6wpw8w4b0   portainer_agent_network     overlay   swarm
0o2s1zj2ibwe   registry_default            overlay   swarm
vkwg8ezk4x88   traefik-net                 overlay   swarm

node@cluster-1:~ $ docker service ls
ID             NAME                       MODE         REPLICAS   IMAGE                                         PORTS
9w5eycv9evwe   monitoring-custom_master   replicated   1/1        monitoring-master:latest                      *:5001->5001/tcp
oyaxb6sa8nrj   monitoring-custom_worker   global       3/3        192.168.2.222:5000/monitoring-worker:latest
s5g03ic95s3l   pihole_pihole              replicated   1/1        pihole/pihole:latest                          *:53->53/tcp, *:80->80/tcp, *:53->53/udp
xddxndup5x6l   pihole_unbound             replicated   1/1        klutchell/unbound:latest                      *:5335->5335/tcp, *:5335->5335/udp
c5wmpcrv2h6t   portainer_agent            global       3/3        portainer/agent:2.11.1
4b2tjmvwiw1t   portainer_portainer        replicated   1/1        portainer/portainer-ce:2.11.1                 *:8000->8000/tcp, *:9000->9000/tcp, *:9443->9443/tcp
qim9frmfid6a   registry_registry          replicated   1/1        registry:latest                               *:5000->5000/tcp
zogv3a6v2on4   squid_squid                replicated   1/1        ubuntu/squid:latest                           *:3128->3128/tcp
jo8tvl5g8ei6   vizualizer                 replicated   1/1        ajeetraina/swarm-visualizer-armv7:latest      *:8080->8080/tcp

node@cluster-1:~ $ docker stack ls
NAME                SERVICES
monitoring-custom   2
pihole              2
portainer           2
registry            1
squid               1

Now that I look at it, the networks look messed up. But some of these like traefik-net don’t even show up in portainer so I think it might be related to that.

I use portainer to deploy stacks. I’m not sure if they do the same as docker stack deploy does.

If a node is member of a swarm cluster, Portainer should deploy swarm stacks. If the node is a standalone docker host, it will deploy docker compose projects.

All containers connected to an overlay network should allow traffic amongst containers on the swarm cluster nodes, and every bridge network is host only,

There is nothing in your compose file that prevents the ingress routing mesh.

Same problem with Ubuntu 24.04 after upgrade to docker 28.0.0
just downgrade to docker 27.5.1
and routing mesh is working