Docker swarm services cannot communicate across nodes

Hi!

For the past couple of days, I have been trying to setup my swarm with some more nodes (we have been running swarm on a single node for about a year).

However, I have been having a lot of issues with networking both not working and being unstable. I know this issue has been brought up alot (e.g. Service is not DNS resolvable from another one if containers run on different nodes · Issue #1429 · moby/swarmkit · GitHub), but there doesn’t seem to be a universal fix for these issues so I thought I’d ask here.

Setup

The setup consists of two VPSes running on hetzner cloud. For communication in the swarm, we are using a private network (also managed by hetzner). As seen below, the ports have been triple-checked as being open. I have also tried just disabling the firewall on all hosts with no changes.

This is also the ~3rd time I recreate the entire test (infrastructure, OS, fresh install of docker, new network) and the result is the same every time (mostly).

Debug info

Manager

# uname -a
Linux staging 5.4.0-125-generic #141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

# docker version
Client: Docker Engine - Community
 Version:           20.10.19
 API version:       1.41
 Go version:        go1.18.7
 Git commit:        d85ef84
 Built:             Thu Oct 13 16:46:17 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.19
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.7
  Git commit:       c964641
  Built:            Thu Oct 13 16:44:09 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.8
  GitCommit:        9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

# docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.9.1-docker)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 24
  Running: 23
  Paused: 0
  Stopped: 1
 Images: 29
 Server Version: 20.10.19
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: loki
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: kc8k7g021p7707oqfi92p0dp8
  Is Manager: true
  ClusterID: zvlqhppclv6tal9a97pabkh7p
  Managers: 1
  Nodes: 2
  Default Address Pool: 10.0.0.0/8
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.1.0.3
  Manager Addresses:
   159.69.215.2:2377
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-125-generic
 Operating System: Ubuntu 20.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 7.586GiB
 Name: staging
 ID: 4HNF:CGNV:T5Y4:WMOH:RPVY:QR7U:UCTM:UL3I:CQUV:L6IC:DJSE:3MEN
 Docker Root Dir: /mnt/HC_Volume_6311166/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Worker

# uname -a
Linux staging-worker-node-1 5.15.0-50-generic #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

# docker version
Client: Docker Engine - Community
 Version:           20.10.19
 API version:       1.41
 Go version:        go1.18.7
 Git commit:        d85ef84
 Built:             Thu Oct 13 16:46:58 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.19
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.7
  Git commit:       c964641
  Built:            Thu Oct 13 16:44:47 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.8
  GitCommit:        9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0

# On manager
docker swarm init --force-new-cluster --advertise-addr 10.1.0.3 # Note that the IP in the join command is the public ip and NOT the one specified here, I don't know if this matters and the code is just sloppy. Otherwise there is def. a bug in docker

# Then on worker
docker swarm join --advertise-addr 10.1.0.21 ...

The worker is connected to the internal network in hetzner:
Subnet: 10.1.0.0/32
Manager: 10.1.0.3/32
Worker: 10.1.0.3/32

Validation of network access between nodes

# Manager
nc -vz -u 10.1.0.21 7946
Connection to 10.1.0.21 7946 port [udp/*] succeeded!

nc -vz -u 10.1.0.21 4789
Connection to 10.1.0.21 4789 port [udp/*] succeeded!

nc -vz 10.1.0.21 7946
Connection to 10.1.0.21 7946 port [tcp/*] succeeded!

# Worker
nc -vz  10.1.0.3 2377
Connection to 10.1.0.3 2377 port [tcp/*] succeeded!

nc -vz -u 10.1.0.3 2377
Connection to 10.1.0.3 2377 port [udp/*] succeeded!

nc -vz  10.1.0.3 7946
Connection to 10.1.0.3 7946 port [tcp/*] succeeded!

nc -vz -u 10.1.0.3 4789
Connection to 10.1.0.3 4789 port [udp/*] succeeded!

# etc.. you get the point

Testing inter-node communication

# On manager
docker network create --driver overlay --attachable test

docker service create --name whoami --constraint "node.role=worker" --network test traefik/whoami 
docker service create --name whoami --constraint "node.role=manager" --network test traefik/whoami

So now we have one overlay network spanning the nodes, test. This network contains two services with one container each:

  • whoamiman running on the manager
  • whoami running on the worker

Trying to communicate with the containers on other nodes is where things break

# Manager
# Trying to connect to the service running on the worker node:
docker run -it --rm --network test nginx curl whoami
curl: (6) Could not resolve host: whoami

# The service running on the same node:
docker run -it --rm --network test nginx curl whoamiman
Hostname: 5491774d6de0
IP: 127.0.0.1
IP: 10.0.2.2
IP: 172.18.0.16
RemoteAddr: 10.0.2.3:50882
GET / HTTP/1.1
Host: whoamiman
User-Agent: curl/7.74.0
Accept: */*

# Works fine!

# -------------
# Worker, same thing
docker run -it --rm --network test nginx curl whoamiman
curl: (6) Could not resolve host: whoamiman


docker run -it --rm --network test nginx curl whoami
Hostname: cfc29d904c1d
IP: 127.0.0.1
IP: 10.0.2.6
IP: 172.18.0.4
RemoteAddr: 10.0.2.11:36618
GET / HTTP/1.1
Host: whoami
User-Agent: curl/7.74.0
Accept: */*

I can also mention that I at some point managed to make the dns work between nodes, but then we had the problem where connections would just time out between the containers on different hosts as well, but I’m having issues getting to that state again.

Usually either one of those causes problems:

You already addressed the first and last point.

What is this supposed to be? A 32bit cidr network mask would result in a network with a single ip.

Oops, I goofed on that one yeah (the post not the actual network). The subnet has a mask of 24 bits, not 32.

Running the script show that all the required features are present, The host just has these as missing:

CONFIG_MEMCG_SWAP_ENABLED: missing
...
CONFIG_RT_GROUP_SCHED: missing

And the worker has these missing:

CONFIG_RT_GROUP_SCHED: missing

For the low-latency network. Both hosts run in the same datacenter, with a private network local to the region for the cloud provider, so I would be very surprised of that’s not low-latency enough.

Those kernel modules should be irrelevant for the overlay network to work.
Indeed, it would be a surprise if hosts in the same DC wouldn’t have low-latency connection.

Usually availability zones in the same region of cloud providers are low-latency as well. Some people try to run swarm over wan connections or across cloud provider regions, which is everything but recommended. That’s why I mentioned it.

I honestly, never had problems running a swarm on Ubuntu machines using the docker-ce from docker’s repos.

Are you sure there are no ip-range conflicts between your networks and the overlay network?

UPDATE
I managed to fix the DNS issues. I tried creating a new swarm on a completely fresh machine.

It’s not very clear, but in the test, I had the swarm initialized using

docker swarm init --force-recreate --advertise-addr 10.1.0.3

It seems that using --force-recreate and changing the advertise-addr at the same time does not work. Hence why it’s spitting out the public IP in the join token command (this should probably be documented and/or fixed).

What did work, was deleting the entire swarm, and recreating it with the correct advertise-addr from the start:

docker swarm leave --force
docker swarm init --advertise-addr <internal_ip>

I think that solves all my issues?

I had alot of issues with network requests timing out between nodes as well earlier, but that might also have been a stability issue fixed by recreating everything. (things seem okay for now).

Now all my remaining swarm issues seems to be with traefik :slight_smile:
Scratch that. I tried creating a service with 1 replica running just on the worker node. This container is not not service discoverable by any other containers, regardless of the host it’s running on (I.e. the same issue), but I can resolve and connect to all other services from any container on any host.

Glad you found the culprit :slight_smile:

Shouldn’t be so hard to sort out Traefik. In case you want to use letsencrypt certificates: Traefik v1.x had support to share issued letsencrypt certificates amongst the nodes using consul, in v2.x this become an enterprise only feature. That’s why I still use Traefik v1.7.34 in my private swarm homelab.