Docker Swarm: Service Resolution Fails Between Ubuntu and CentOS Nodes

I have an existing, working Swarm cluster with three nodes, and I want to add a new one.

The three existing nodes are running CentOS 7, and the new one is running Ubuntu 22.04.5 LTS. All four servers are on the same network. The existing nodes have Docker versions 25.0.0, 25.0.1, and 26.1.3. To rule out potential version conflicts, I installed Docker 26.1.3 on the new node.

The issue arises when a service running on the new node tries to access a service on one of the older nodes; it fails to resolve the service name.

For example, given this docker-compose.yml:

version: "3.8"

x-logging: &default-logging
  options:
    max-size: "10m"
    max-file: "5"
  driver: json-file

networks:
  traefik:
    external: true
  redis:
    external: true

secrets:
  stack-secrets:
    file: ${STACK_SECRETS_FILE:?STACK_SECRETS_FILE}
    name: stack-secrets_${STACK_SECRETS_MD5:?STACK_SECRETS_MD5}

services:
  redis:
    image: redis:${REDIS_VERSION:-latest} # 8
    command:
      - "redis-server"
      - "--save 300 1"
      - "--save 60 10"
      - "--appendonly yes"
    volumes:
      - redis-data:/data
    logging: *default-logging
    networks:
      - default
      - redis
    deploy:
      labels:
        - "swarmpit.service.deployment.autoredeploy=false"
      replicas: 1
      mode: replicated
      placement:
        constraints:
          - node.labels.safevolumes == true
          - node.platform.os == linux
          - node.labels.verifiednode == true
  api:
    image: ${DOCKER_REGISTRY-}/${API_NAME}:${VERSION:?}
    logging: *default-logging
    networks:
      - default
      - traefik
    deploy:
      placement:
        constraints:
          - node.platform.os == linux
          - node.labels.verifiednode == true
      replicas: ${API_REPLICAS:-1}
      mode: replicated
      restart_policy:
        condition: on-failure
  sync-worker:
    image: ${DOCKER_REGISTRY-}dopplerdock/${SERVICE_NAME}:${VERSION:?}
    environment:
      KeyFieldValueStorageSettings__ConnectionString: ${REDIS_CONNECTION_STRING:-redis}
    logging: *default-logging
    depends_on:
      - redis
    deploy:
      placement:
        constraints:
          - node.platform.os == linux
          - node.labels.verifiednode == true
      replicas: ${SERVICE_REPLICAS:-1}
      mode: replicated
      restart_policy:
        condition: on-failure
    secrets:
      - source: stack-secrets
        target: appsettings.Secret.json

volumes:
  redis-data:

The Redis service is always deployed on one of the older nodes, which handles all the volumes. When I deploy the sync-worker service, it works correctly if placed on an older node and can connect to Redis using the connection string "redis:6379,abortConnect=false". However, if the sync-worker service is deployed on the new Ubuntu node, it fails to resolve the ‘redis’ hostname.

This isn’t just a Redis issue; I’m also seeing strange behavior with Swarmpit and Portainer. It seems like the Swarm cluster recognizes the new node, but the overlay network communication is failing partially.

Of course, all the necessary ports for Docker Swarm are open between the nodes. I have tried many different approaches so far with no success.

Any help would be greatly appreciated.

Any VLAN involved? MTU set correctly? Try a ping with a payload larger 2000 bytes.

1 Like

Thanks @bluepuma77

Each swarm node is in a different VM, all connected by a primary and secondary VLAN.
All nodes “see” each other (ping and telnet tests), and we’ve run connectivity tests on both sides.
MTU is set to 1500 in every node.

eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:b7:31:ef brd ff:ff:ff:ff:ff:ff
inet 172.25.20.218/22 brd 172.25.23.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:feb7:31ef/64 scope link
valid_lft forever preferred_lft forever

eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:b7:43:56 brd ff:ff:ff:ff:ff:ff
inet 10.139.15.75/27 brd 10.139.15.95 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:feb7:4356/64 scope link
valid_lft forever preferred_lft forever

Ping from the new node to another

ping -s 2500 172.25.20.8
PING 172.25.20.8 (172.25.20.8) 2500(2528) bytes of data.
2508 bytes from 172.25.20.8: icmp_seq=1 ttl=64 time=0.262 ms
2508 bytes from 172.25.20.8: icmp_seq=2 ttl=64 time=0.270 ms
2508 bytes from 172.25.20.8: icmp_seq=3 ttl=64 time=0.262 ms
2508 bytes from 172.25.20.8: icmp_seq=4 ttl=64 time=0.218 ms
^C
— 172.25.20.8 ping statistics —
4 packets transmitted, 4 received, 0% packet loss, time 3054ms
rtt min/avg/max/mdev = 0.218/0.253/0.270/0.020 ms

Ping inside a testing container:

ping -s 2500 redis
PING redis (10.0.25.10): 2500 data bytes
2508 bytes from 10.0.25.10: seq=0 ttl=64 time=0.125 ms
2508 bytes from 10.0.25.10: seq=1 ttl=64 time=0.141 ms
2508 bytes from 10.0.25.10: seq=2 ttl=64 time=0.144 ms
2508 bytes from 10.0.25.10: seq=3 ttl=64 time=0.129 ms
2508 bytes from 10.0.25.10: seq=4 ttl=64 time=0.137 ms
^C
— redis ping statistics —
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.125/0.135/0.144 ms

If the overlay traffic is not working, usually those are the suspects:

  • Firewall needs following ports to be open on all nodes:
    • Port 2377 TCP for communication with and between manager nodes
    • Port 7946 TCP/UDP for overlay network node discovery
    • Port 4789 UDP (configurable) for overlay network traffic
  • The mtu size is not identical on all nodes
    • ip addr show scope global | grep mtu
  • The nodes don’t share a low latency network connection
  • Nodes are running in vms on VMware vSphere with NSX
    • Outgoing traffic to port 4789 UDP is silently dropped as it conflicts with VMware NSX’s communication port for VXLAN
    • Re-create the swarm with a different data-port:
      • docker swarm init --data-path-port=7789
  • Problems with checksum offloading
    • Disable checksum offloading for the network interface (eth0 is a placeholder):
    • ethtool -K eth0 tx-checksum-ip-generic off
1 Like

Hello @meyay, thanks for your response!

It was indeed the checksum offloading; the issues disappeared after I disabled it.

Thanks a lot to both of you, @bluepuma77 and @meyay. I have been struggling with this for some time.

I will be doing some further testing, but everything seems to be working just fine for now.

Best regards

1 Like

Odd I recall @bretfisher talking about some experiment with Docker Swarm running wordpress and database cluster over many nodes across the Internet. IIRC result was it was slow but it actually worked.

That’s surprising. RAFT, the consensus algorithm used by Swarm, does not work reliable without low latency networks. Would be great if you had a pointer on where to find the information, so I could add whatever settings he configured to make it work on high latency networks to the post template.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.