Docker swarm services cannot communicate across nodes

ludvighz · October 18, 2022, 10:41am

Hi!

For the past couple of days, I have been trying to setup my swarm with some more nodes (we have been running swarm on a single node for about a year).

However, I have been having a lot of issues with networking both not working and being unstable. I know this issue has been brought up alot (e.g. Service is not DNS resolvable from another one if containers run on different nodes · Issue #1429 · moby/swarmkit · GitHub), but there doesn’t seem to be a universal fix for these issues so I thought I’d ask here.

Setup

The setup consists of two VPSes running on hetzner cloud. For communication in the swarm, we are using a private network (also managed by hetzner). As seen below, the ports have been triple-checked as being open. I have also tried just disabling the firewall on all hosts with no changes.

This is also the ~3rd time I recreate the entire test (infrastructure, OS, fresh install of docker, new network) and the result is the same every time (mostly).

Debug info

Manager

# uname -a
Linux staging 5.4.0-125-generic #141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

# docker version
Client: Docker Engine - Community
 Version:           20.10.19
 API version:       1.41
 Go version:        go1.18.7
 Git commit:        d85ef84
 Built:             Thu Oct 13 16:46:17 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.19
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.7
  Git commit:       c964641
  Built:            Thu Oct 13 16:44:09 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.8
  GitCommit:        9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

# docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.9.1-docker)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 24
  Running: 23
  Paused: 0
  Stopped: 1
 Images: 29
 Server Version: 20.10.19
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: loki
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: kc8k7g021p7707oqfi92p0dp8
  Is Manager: true
  ClusterID: zvlqhppclv6tal9a97pabkh7p
  Managers: 1
  Nodes: 2
  Default Address Pool: 10.0.0.0/8
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.1.0.3
  Manager Addresses:
   159.69.215.2:2377
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-125-generic
 Operating System: Ubuntu 20.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 7.586GiB
 Name: staging
 ID: 4HNF:CGNV:T5Y4:WMOH:RPVY:QR7U:UCTM:UL3I:CQUV:L6IC:DJSE:3MEN
 Docker Root Dir: /mnt/HC_Volume_6311166/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Worker

# uname -a
Linux staging-worker-node-1 5.15.0-50-generic #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

# docker version
Client: Docker Engine - Community
 Version:           20.10.19
 API version:       1.41
 Go version:        go1.18.7
 Git commit:        d85ef84
 Built:             Thu Oct 13 16:46:58 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.19
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.7
  Git commit:       c964641
  Built:            Thu Oct 13 16:44:47 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.8
  GitCommit:        9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0

# On manager
docker swarm init --force-new-cluster --advertise-addr 10.1.0.3 # Note that the IP in the join command is the public ip and NOT the one specified here, I don't know if this matters and the code is just sloppy. Otherwise there is def. a bug in docker

# Then on worker
docker swarm join --advertise-addr 10.1.0.21 ...

The worker is connected to the internal network in hetzner:
Subnet: 10.1.0.0/32
Manager: 10.1.0.3/32
Worker: 10.1.0.3/32

Validation of network access between nodes

# Manager
nc -vz -u 10.1.0.21 7946
Connection to 10.1.0.21 7946 port [udp/*] succeeded!

nc -vz -u 10.1.0.21 4789
Connection to 10.1.0.21 4789 port [udp/*] succeeded!

nc -vz 10.1.0.21 7946
Connection to 10.1.0.21 7946 port [tcp/*] succeeded!

# Worker
nc -vz  10.1.0.3 2377
Connection to 10.1.0.3 2377 port [tcp/*] succeeded!

nc -vz -u 10.1.0.3 2377
Connection to 10.1.0.3 2377 port [udp/*] succeeded!

nc -vz  10.1.0.3 7946
Connection to 10.1.0.3 7946 port [tcp/*] succeeded!

nc -vz -u 10.1.0.3 4789
Connection to 10.1.0.3 4789 port [udp/*] succeeded!

# etc.. you get the point

Testing inter-node communication

# On manager
docker network create --driver overlay --attachable test

docker service create --name whoami --constraint "node.role=worker" --network test traefik/whoami 
docker service create --name whoami --constraint "node.role=manager" --network test traefik/whoami

So now we have one overlay network spanning the nodes, test. This network contains two services with one container each:

whoamiman running on the manager
whoami running on the worker

Trying to communicate with the containers on other nodes is where things break

# Manager
# Trying to connect to the service running on the worker node:
docker run -it --rm --network test nginx curl whoami
curl: (6) Could not resolve host: whoami

# The service running on the same node:
docker run -it --rm --network test nginx curl whoamiman
Hostname: 5491774d6de0
IP: 127.0.0.1
IP: 10.0.2.2
IP: 172.18.0.16
RemoteAddr: 10.0.2.3:50882
GET / HTTP/1.1
Host: whoamiman
User-Agent: curl/7.74.0
Accept: */*

# Works fine!

# -------------
# Worker, same thing
docker run -it --rm --network test nginx curl whoamiman
curl: (6) Could not resolve host: whoamiman


docker run -it --rm --network test nginx curl whoami
Hostname: cfc29d904c1d
IP: 127.0.0.1
IP: 10.0.2.6
IP: 172.18.0.4
RemoteAddr: 10.0.2.11:36618
GET / HTTP/1.1
Host: whoami
User-Agent: curl/7.74.0
Accept: */*

I can also mention that I at some point managed to make the dns work between nodes, but then we had the problem where connections would just time out between the containers on different hosts as well, but I’m having issues getting to that state again.

meyay · October 18, 2022, 4:30pm

Usually either one of those causes problems:

firewalls block required ports (see: required ports/protocoll)
no low latency network connection amongst the nodes
host kernel does not support a required feature (curl and run https://raw.githubusercontent.com/moby/moby/master/contrib/check-config.sh)
containers are not in the same container network

You already addressed the first and last point.

What is this supposed to be? A 32bit cidr network mask would result in a network with a single ip.

ludvighz · October 18, 2022, 6:06pm

Oops, I goofed on that one yeah (the post not the actual network). The subnet has a mask of 24 bits, not 32.

Running the script show that all the required features are present, The host just has these as missing:

CONFIG_MEMCG_SWAP_ENABLED: missing
...
CONFIG_RT_GROUP_SCHED: missing

And the worker has these missing:

CONFIG_RT_GROUP_SCHED: missing

For the low-latency network. Both hosts run in the same datacenter, with a private network local to the region for the cloud provider, so I would be very surprised of that’s not low-latency enough.

meyay · October 18, 2022, 6:25pm

Those kernel modules should be irrelevant for the overlay network to work.
Indeed, it would be a surprise if hosts in the same DC wouldn’t have low-latency connection.

Usually availability zones in the same region of cloud providers are low-latency as well. Some people try to run swarm over wan connections or across cloud provider regions, which is everything but recommended. That’s why I mentioned it.

I honestly, never had problems running a swarm on Ubuntu machines using the docker-ce from docker’s repos.

Are you sure there are no ip-range conflicts between your networks and the overlay network?

ludvighz · October 18, 2022, 6:57pm

UPDATE
I managed to fix the DNS issues. I tried creating a new swarm on a completely fresh machine.

It’s not very clear, but in the test, I had the swarm initialized using

docker swarm init --force-recreate --advertise-addr 10.1.0.3

It seems that using --force-recreate and changing the advertise-addr at the same time does not work. Hence why it’s spitting out the public IP in the join token command (this should probably be documented and/or fixed).

What did work, was deleting the entire swarm, and recreating it with the correct advertise-addr from the start:

docker swarm leave --force
docker swarm init --advertise-addr <internal_ip>

I think that solves all my issues?

I had alot of issues with network requests timing out between nodes as well earlier, but that might also have been a stability issue fixed by recreating everything. (things seem okay for now).

~~Now all my remaining swarm issues seems to be with traefik~~
Scratch that. I tried creating a service with 1 replica running just on the worker node. This container is not not service discoverable by any other containers, regardless of the host it’s running on (I.e. the same issue), but I can resolve and connect to all other services from any container on any host.

meyay · October 18, 2022, 8:05pm

Glad you found the culprit

Shouldn’t be so hard to sort out Traefik. In case you want to use letsencrypt certificates: Traefik v1.x had support to share issued letsencrypt certificates amongst the nodes using consul, in v2.x this become an enterprise only feature. That’s why I still use Traefik v1.7.34 in my private swarm homelab.

bluepuma77 · August 30, 2023, 7:04pm

There is still an issue or not?

Traefik needs to run in the Docker Swarm manager nodes for Traefik configuration discovery to work. (Or use a Docker socket proxy, then Traefik can run everywhere.) Also the labels need to move to the deploy section in compose file. Check simple Traefik Swarm example.

zbalogh · May 14, 2024, 12:02pm

Hi,

I had the same issue, and it seems it is related to the following issue:

github.com/moby/moby

Upgrade to 20.10 breaks swarm network

opened 03:02PM - 10 Dec 20 UTC

oneumyvakin

area/networking area/swarm version/20.10

**Description**  **Steps to reproduce the issue:** 1. Install Docker 19.03 on Ubuntu 20 or CentOS 8 2. Init Swarm 3. Start some services by docker stack deploy 4. Upgrade docker from 19.03 to 20.10 **Describe the results you received:** Containers of services can't start there is error in logs ``` Dec 10 06:21:03 dockerd[3160859]: time="2020-12-10T06:21:03.150920367Z" level=error msg="fatal task error" error="starting container failed: container 9f93a21ac2e3be11a65c91f3cfde555a415eea47c636bef432d5d2e4b08afff4: endpoint create on GW Network failed: failed to create endpoint gateway_f8cabe848464 on network docker_gwbridge: network 28d599d44202f2acdc85e42437332ddb41a81bd7f0622bc0724761ec9b49082a does not exist" module=node/agent/taskmanager node.id=u7qdqny1doho69k3nariuo1ru service.id=vhtg6aoyt360k7mluiwmshqf0 task.id=aqgsfcw88m5kujnrly74o4wh4 ``` **Describe the results you expected:** containers are running **Additional information you deem important (e.g. issue happens only occasionally):** we have two installations with this issue which happened after upgrade to 20.10 recreating services didn't help re-initing swarm didn't help ``` # docker network list NETWORK ID NAME DRIVER SCOPE e45b9b63c4ae bridge bridge local 28d599d44202 docker_gwbridge bridge local 2aa80dc0cc04 host host local w9rpuika2x0d ingress overlay swarm 62f0fb2fdf28 none null local ``` **Output of `docker version`:** ``` Client: Docker Engine - Community Version: 20.10.0 API version: 1.41 Go version: go1.13.15 Git commit: 7287ab3 Built: Tue Dec 8 18:59:40 2020 OS/Arch: linux/amd64 Context: default Experimental: true Server: Docker Engine - Community Engine: Version: 20.10.0 API version: 1.41 (minimum version 1.12) Go version: go1.13.15 Git commit: eeddea2 Built: Tue Dec 8 18:57:45 2020 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.3 GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b runc: Version: 1.0.0-rc92 GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff docker-init: Version: 0.19.0 GitCommit: de40ad0 ``` **Output of `docker info`:** ``` Client: Context: default Debug Mode: false Plugins: app: Docker App (Docker Inc., v0.9.1-beta3) buildx: Build with BuildKit (Docker Inc., v0.4.2-docker) Server: Containers: 0 Running: 0 Paused: 0 Stopped: 0 Images: 12 Server Version: 20.10.0 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: active NodeID: u7qdqny1doho69k3nariuo1ru Is Manager: true ClusterID: rdq2vi44m2lkz34tdow1dvip4 Managers: 1 Nodes: 1 Default Address Pool: 10.0.0.0/8 SubnetSize: 24 Data Path Port: 4789 Orchestration: Task History Retention Limit: 5 Raft: Snapshot Interval: 10000 Number of Old Snapshots to Retain: 0 Heartbeat Tick: 1 Election Tick: 10 Dispatcher: Heartbeat Period: 5 seconds CA Configuration: Expiry Duration: 3 months Force Rotate: 0 Autolock Managers: false Root Rotation In Progress: false Node Address: 127.0.0.1 Manager Addresses: 127.0.0.1:2377 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc Default Runtime: runc Init Binary: docker-init containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff init version: de40ad0 Security Options: apparmor seccomp Profile: default Kernel Version: 5.4.0-56-generic Operating System: Ubuntu 20.04.1 LTS OSType: linux Architecture: x86_64 CPUs: 4 Total Memory: 7.749GiB Name: cloud.filesanctuary.net ID: JBWW:XVUE:3XW4:OQYT:HJHK:OSRV:PFHK:PFZP:S3DV:HPZ7:NYWD:OWQO Docker Root Dir: /var/lib/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false WARNING: No swap limit support WARNING: No blkio weight support WARNING: No blkio weight_device support ```

Disabling checksum offloading appears to resolve this issue:

ethtool -K eth0 tx-checksum-ip-generic off

where eth0 is my Ethernet interface.

Topic		Replies	Views
Docker Swarm: Master and Worker not communicating properly Swarm docker , swarm	10	3706	July 13, 2023
Docker Swarm Service unreachable from manager node Swarm swarm	18	7692	January 5, 2023
Overlay network ping works, but HTTP requests only work within same swarm node. Hangs as if messages dropped if to other node General docker , swarm	3	4534	October 24, 2022
Swam 1.12 Multi-host networking help Swarm	30	10691	March 29, 2017
Service in swarm running on different nodes are not reachable through overlay network General docker	7	7993	February 13, 2017