Docker Swarm: Service Resolution Fails Between Ubuntu and CentOS Nodes

fgonzalezms · October 3, 2025, 8:15pm

I have an existing, working Swarm cluster with three nodes, and I want to add a new one.

The three existing nodes are running CentOS 7, and the new one is running Ubuntu 22.04.5 LTS. All four servers are on the same network. The existing nodes have Docker versions 25.0.0, 25.0.1, and 26.1.3. To rule out potential version conflicts, I installed Docker 26.1.3 on the new node.

The issue arises when a service running on the new node tries to access a service on one of the older nodes; it fails to resolve the service name.

For example, given this docker-compose.yml:

version: "3.8"

x-logging: &default-logging
  options:
    max-size: "10m"
    max-file: "5"
  driver: json-file

networks:
  traefik:
    external: true
  redis:
    external: true

secrets:
  stack-secrets:
    file: ${STACK_SECRETS_FILE:?STACK_SECRETS_FILE}
    name: stack-secrets_${STACK_SECRETS_MD5:?STACK_SECRETS_MD5}

services:
  redis:
    image: redis:${REDIS_VERSION:-latest} # 8
    command:
      - "redis-server"
      - "--save 300 1"
      - "--save 60 10"
      - "--appendonly yes"
    volumes:
      - redis-data:/data
    logging: *default-logging
    networks:
      - default
      - redis
    deploy:
      labels:
        - "swarmpit.service.deployment.autoredeploy=false"
      replicas: 1
      mode: replicated
      placement:
        constraints:
          - node.labels.safevolumes == true
          - node.platform.os == linux
          - node.labels.verifiednode == true
  api:
    image: ${DOCKER_REGISTRY-}/${API_NAME}:${VERSION:?}
    logging: *default-logging
    networks:
      - default
      - traefik
    deploy:
      placement:
        constraints:
          - node.platform.os == linux
          - node.labels.verifiednode == true
      replicas: ${API_REPLICAS:-1}
      mode: replicated
      restart_policy:
        condition: on-failure
  sync-worker:
    image: ${DOCKER_REGISTRY-}dopplerdock/${SERVICE_NAME}:${VERSION:?}
    environment:
      KeyFieldValueStorageSettings__ConnectionString: ${REDIS_CONNECTION_STRING:-redis}
    logging: *default-logging
    depends_on:
      - redis
    deploy:
      placement:
        constraints:
          - node.platform.os == linux
          - node.labels.verifiednode == true
      replicas: ${SERVICE_REPLICAS:-1}
      mode: replicated
      restart_policy:
        condition: on-failure
    secrets:
      - source: stack-secrets
        target: appsettings.Secret.json

volumes:
  redis-data:

The Redis service is always deployed on one of the older nodes, which handles all the volumes. When I deploy the sync-worker service, it works correctly if placed on an older node and can connect to Redis using the connection string "redis:6379,abortConnect=false". However, if the sync-worker service is deployed on the new Ubuntu node, it fails to resolve the ‘redis’ hostname.

This isn’t just a Redis issue; I’m also seeing strange behavior with Swarmpit and Portainer. It seems like the Swarm cluster recognizes the new node, but the overlay network communication is failing partially.

Of course, all the necessary ports for Docker Swarm are open between the nodes. I have tried many different approaches so far with no success.

Any help would be greatly appreciated.

bluepuma77 · October 4, 2025, 8:23pm

Any VLAN involved? MTU set correctly? Try a ping with a payload larger 2000 bytes.

fgonzalezms · October 6, 2025, 8:17pm

Thanks @bluepuma77

Each swarm node is in a different VM, all connected by a primary and secondary VLAN.
All nodes “see” each other (ping and telnet tests), and we’ve run connectivity tests on both sides.
MTU is set to 1500 in every node.

eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:b7:31:ef brd ff:ff:ff:ff:ff:ff
inet 172.25.20.218/22 brd 172.25.23.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:feb7:31ef/64 scope link
valid_lft forever preferred_lft forever

eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:b7:43:56 brd ff:ff:ff:ff:ff:ff
inet 10.139.15.75/27 brd 10.139.15.95 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:feb7:4356/64 scope link
valid_lft forever preferred_lft forever

Ping from the new node to another

ping -s 2500 172.25.20.8
PING 172.25.20.8 (172.25.20.8) 2500(2528) bytes of data.
2508 bytes from 172.25.20.8: icmp_seq=1 ttl=64 time=0.262 ms
2508 bytes from 172.25.20.8: icmp_seq=2 ttl=64 time=0.270 ms
2508 bytes from 172.25.20.8: icmp_seq=3 ttl=64 time=0.262 ms
2508 bytes from 172.25.20.8: icmp_seq=4 ttl=64 time=0.218 ms
^C
— 172.25.20.8 ping statistics —
4 packets transmitted, 4 received, 0% packet loss, time 3054ms
rtt min/avg/max/mdev = 0.218/0.253/0.270/0.020 ms

Ping inside a testing container:

ping -s 2500 redis
PING redis (10.0.25.10): 2500 data bytes
2508 bytes from 10.0.25.10: seq=0 ttl=64 time=0.125 ms
2508 bytes from 10.0.25.10: seq=1 ttl=64 time=0.141 ms
2508 bytes from 10.0.25.10: seq=2 ttl=64 time=0.144 ms
2508 bytes from 10.0.25.10: seq=3 ttl=64 time=0.129 ms
2508 bytes from 10.0.25.10: seq=4 ttl=64 time=0.137 ms
^C
— redis ping statistics —
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.125/0.135/0.144 ms

meyay · October 7, 2025, 6:11am

If the overlay traffic is not working, usually those are the suspects:

Firewall needs following ports to be open on all nodes:
- Port 2377 TCP for communication with and between manager nodes
- Port 7946 TCP/UDP for overlay network node discovery
- Port 4789 UDP (configurable) for overlay network traffic
The mtu size is not identical on all nodes
- ip addr show scope global | grep mtu
The nodes don’t share a low latency network connection
Nodes are running in vms on VMware vSphere with NSX
- Outgoing traffic to port 4789 UDP is silently dropped as it conflicts with VMware NSX’s communication port for VXLAN
- Re-create the swarm with a different data-port:
  - docker swarm init --data-path-port=7789
Problems with checksum offloading
- Disable checksum offloading for the network interface (eth0 is a placeholder):
- ethtool -K eth0 tx-checksum-ip-generic off

fgonzalezms · October 7, 2025, 12:34pm

Hello @meyay, thanks for your response!

It was indeed the checksum offloading; the issues disappeared after I disabled it.

Thanks a lot to both of you, @bluepuma77 and @meyay. I have been struggling with this for some time.

I will be doing some further testing, but everything seems to be working just fine for now.

Best regards

trajano · October 7, 2025, 7:34pm

Odd I recall @bretfisher talking about some experiment with Docker Swarm running wordpress and database cluster over many nodes across the Internet. IIRC result was it was slow but it actually worked.

meyay · October 8, 2025, 6:10pm

That’s surprising. RAFT, the consensus algorithm used by Swarm, does not work reliable without low latency networks. Would be great if you had a pointer on where to find the information, so I could add whatever settings he configured to make it work on high latency networks to the post template.

system · November 7, 2025, 6:11pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Swarm nodes timeout communicating General docker , swarm	4	1323	December 10, 2025
Docker swarm services cannot communicate across nodes Swarm swarm	7	15022	May 14, 2024
Unable to communicate between 2 service from different ubuntu22.04 nodes with ip adrress or taks.<service-name> General swarm	3	421	January 9, 2024
Docker Overlay Network issue version-20.10.7 General docker , swarm , docker-compose	4	1138	May 16, 2023
Docker swarm with one google compute engine vm and one home computer as nodes General swarm	5	1305	July 30, 2019

Docker Swarm: Service Resolution Fails Between Ubuntu and CentOS Nodes

Related topics