I have an existing, working Swarm cluster with three nodes, and I want to add a new one.
The three existing nodes are running CentOS 7, and the new one is running Ubuntu 22.04.5 LTS. All four servers are on the same network. The existing nodes have Docker versions 25.0.0, 25.0.1, and 26.1.3. To rule out potential version conflicts, I installed Docker 26.1.3 on the new node.
The issue arises when a service running on the new node tries to access a service on one of the older nodes; it fails to resolve the service name.
The Redis service is always deployed on one of the older nodes, which handles all the volumes. When I deploy the sync-worker service, it works correctly if placed on an older node and can connect to Redis using the connection string "redis:6379,abortConnect=false". However, if the sync-worker service is deployed on the new Ubuntu node, it fails to resolve the ‘redis’ hostname.
This isn’t just a Redis issue; I’m also seeing strange behavior with Swarmpit and Portainer. It seems like the Swarm cluster recognizes the new node, but the overlay network communication is failing partially.
Of course, all the necessary ports for Docker Swarm are open between the nodes. I have tried many different approaches so far with no success.
Each swarm node is in a different VM, all connected by a primary and secondary VLAN.
All nodes “see” each other (ping and telnet tests), and we’ve run connectivity tests on both sides.
MTU is set to 1500 in every node.
eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:50:56:b7:31:ef brd ff:ff:ff:ff:ff:ff inet 172.25.20.218/22 brd 172.25.23.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::250:56ff:feb7:31ef/64 scope link valid_lft forever preferred_lft forever
eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:50:56:b7:43:56 brd ff:ff:ff:ff:ff:ff inet 10.139.15.75/27 brd 10.139.15.95 scope global eth1 valid_lft forever preferred_lft forever inet6 fe80::250:56ff:feb7:4356/64 scope link valid_lft forever preferred_lft forever
Ping from the new node to another
ping -s 2500 172.25.20.8
PING 172.25.20.8 (172.25.20.8) 2500(2528) bytes of data.
2508 bytes from 172.25.20.8: icmp_seq=1 ttl=64 time=0.262 ms
2508 bytes from 172.25.20.8: icmp_seq=2 ttl=64 time=0.270 ms
2508 bytes from 172.25.20.8: icmp_seq=3 ttl=64 time=0.262 ms
2508 bytes from 172.25.20.8: icmp_seq=4 ttl=64 time=0.218 ms
^C
— 172.25.20.8 ping statistics —
4 packets transmitted, 4 received, 0% packet loss, time 3054ms
rtt min/avg/max/mdev = 0.218/0.253/0.270/0.020 ms
Ping inside a testing container:
ping -s 2500 redis
PING redis (10.0.25.10): 2500 data bytes
2508 bytes from 10.0.25.10: seq=0 ttl=64 time=0.125 ms
2508 bytes from 10.0.25.10: seq=1 ttl=64 time=0.141 ms
2508 bytes from 10.0.25.10: seq=2 ttl=64 time=0.144 ms
2508 bytes from 10.0.25.10: seq=3 ttl=64 time=0.129 ms
2508 bytes from 10.0.25.10: seq=4 ttl=64 time=0.137 ms
^C
— redis ping statistics —
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.125/0.135/0.144 ms
Odd I recall @bretfisher talking about some experiment with Docker Swarm running wordpress and database cluster over many nodes across the Internet. IIRC result was it was slow but it actually worked.
That’s surprising. RAFT, the consensus algorithm used by Swarm, does not work reliable without low latency networks. Would be great if you had a pointer on where to find the information, so I could add whatever settings he configured to make it work on high latency networks to the post template.