Docker Swarm Mesh Networking not working as expected on Ubuntu 22.04 Server

I should preface this that I’m really not an expert on Docker and Docker Swarm.

  • Issue type
    Docker Swarm Mesh Networking not working as expected.

  • OS Version/build
    Ubuntu 22.04.3 - Kernel 5.15.0-86-generic

  • App version
    24.0.6

  • Steps to reproduce

  1. 5 Docker hosts (3 managers, 2 workers) - Ubuntu 22.04 Server VMs running on the same subnet without anything inbetween that would block traffic on Swarm-relevant ports.
  2. Fresh up-to-date VMs (put relevant hosts in each nodes /etc/hosts), install Docker per the official documentation for Ubuntu, add non-root user in docker group.
  3. Make sure ufw is not blocking anything, then create Swarm per the official documentation. Join managers and workers.
  4. Then I create an attachable overlay network.
  5. Create an nginx service for testing mesh networking
    docker service create --name my-web --network testnet --publish published=8080,target=80 --replicas 2 nginx
  6. Try to curl <http://any node IP:8080>. Only works if I curl the specific node that a replica is running on. Curling other nodes results in connection timed out.

I’ve been bashing my head against this issue for two days now. It seems like mesh networking is not working properly. Am I wrong in thinking that I should see vxlan interfaces when I run ip a after creating an overlay network and attaching services to it? Because it is empty. I also checked with ipvsadm -L -n, but it is empty. I’ve made sure that IPVS, overlay, and vxlan kernel modules are loaded. Tried reinitializing the Swarm.

When scaling services, it scales as expected. So if I scale to 5 replicas, I can curl any node IP:8080 obviously. I’ve also tried inter-node communcation/inter-container communication by running an alpine debug service where I try to curl the VirtualIP:8080 - doesn’t work. I’ve made sure proper DNS functionality in the Swarm which works as expected, so I can run nslookup my-web gives me the address of one of the containers.

I’ve checked nftables and iptables configuration to see if anything looks out of order, but seeing that I haven’t modified anything there I can’t see why it would be broken. Checking Docker service logs on one of the managers I could see this ...level=info msg="initialized VXLAN UDP port to 4789 " which leads me to believe that VXLAN interfaces are created as expected.

One thing I haven’t tried yet is to set everything up on an older Ubuntu liveserver version (20.04).

I’m starting to think that I’m missing something extremely obvious, or am I crazy and just interpreting Dockers documentation wrong in that you should be able to access a service published on port 8080 (target 80) with any node IP:8080?

Super thankful for any help!

Hi, exact same issue here!
For the moment, i have no idea what the problem might be!
On another cluster on ubuntu 20.04.5 / kernel 5.4.0-149-generic, working like a charm… Hoping it’s not an os/kernel issue…
Still investigating.
Thanks for any guess.

You never can be sure about that. For Instance if you use ESXi with NSX then traffic on port 4789 will not reach any of the vms.

That expectation is correct, as long as the service task containers actually bind a service on port 80.

Have you checked it like this?

network_name=testnet
short_id=$(docker network ls  --format '{{ slice .ID 0 9 }}'  --filter name=${network_name})
sudo nsenter --net=/var/run/docker/netns/lb_${short_id} ipvsadm -l -n

Thank you for your reply Metin! These VMs run on VMware vSphere on a VXRail. Could it be a similar issue like with ESXi?

Regarding your last reply. I tried running the commands which gave me this:

IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
FWM  262 rr
  -> 10.0.2.8:0                   Masq    1      0          0
FWM  263 rr
  -> 10.0.2.3:0                   Masq    1      0          0
  -> 10.0.2.4:0                   Masq    1      0          0
  -> 10.0.2.10:0                  Masq    1      0          0
  -> 10.0.2.11:0                  Masq    1      0          0
  -> 10.0.2.12:0                  Masq    1      0          0`

I’m running the nginx service on published port 8080 and target port 80.

I am not strong when it commes to knowing the vmware products. But isn’t ESXi the hypervisor used in their offering?

The output of ipvsadm only shows the FWM groups (whatever the exact terminology might be) and lists the ips of the target containers.

You might want to take a look at this topic:

Hi Metin,

It was totally the data-path-port!! I changed it to 7789 and mesh networking is now working as expected!

Thank you so much for your help,

/David

1 Like

Thanks @davyd92 . I had the same issue and using a customized data-path-port also fixed it.

However I went a bit further to investigate.

I created 2 clusters. Cluster A is with a default data-path-port and cluster B with a customized one. Then I found out that cluster A is using ipv6:7946 and cluster B is using ipv4:7946. Maybe it’s a bug in docker swarm? Note that 7946 is not configurable. Wondering why having a different data-path-port will affect 7946. @meyay

Here are the outputs from cluster A:

$ netstat -tpln |grep 7946

tcp6 0 0 :::7946 :::* LISTEN

$ sudo netstat -anu

udp6 0 0 :::7946 :::*