Hi!
For the past couple of days, I have been trying to setup my swarm with some more nodes (we have been running swarm on a single node for about a year).
However, I have been having a lot of issues with networking both not working and being unstable. I know this issue has been brought up alot (e.g. Service is not DNS resolvable from another one if containers run on different nodes · Issue #1429 · moby/swarmkit · GitHub), but there doesn’t seem to be a universal fix for these issues so I thought I’d ask here.
Setup
The setup consists of two VPSes running on hetzner cloud. For communication in the swarm, we are using a private network (also managed by hetzner). As seen below, the ports have been triple-checked as being open. I have also tried just disabling the firewall on all hosts with no changes.
This is also the ~3rd time I recreate the entire test (infrastructure, OS, fresh install of docker, new network) and the result is the same every time (mostly).
Debug info
Manager
# uname -a
Linux staging 5.4.0-125-generic #141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
# docker version
Client: Docker Engine - Community
Version: 20.10.19
API version: 1.41
Go version: go1.18.7
Git commit: d85ef84
Built: Thu Oct 13 16:46:17 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.19
API version: 1.41 (minimum version 1.12)
Go version: go1.18.7
Git commit: c964641
Built: Thu Oct 13 16:44:09 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.8
GitCommit: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
runc:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
GitCommit: de40ad0
# docker info
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Docker Buildx (Docker Inc., v0.9.1-docker)
scan: Docker Scan (Docker Inc., v0.17.0)
Server:
Containers: 24
Running: 23
Paused: 0
Stopped: 1
Images: 29
Server Version: 20.10.19
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: loki
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: kc8k7g021p7707oqfi92p0dp8
Is Manager: true
ClusterID: zvlqhppclv6tal9a97pabkh7p
Managers: 1
Nodes: 2
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Data Path Port: 4789
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 10.1.0.3
Manager Addresses:
159.69.215.2:2377
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
runc version: v1.1.4-0-g5fd4c4d
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.4.0-125-generic
Operating System: Ubuntu 20.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.586GiB
Name: staging
ID: 4HNF:CGNV:T5Y4:WMOH:RPVY:QR7U:UCTM:UL3I:CQUV:L6IC:DJSE:3MEN
Docker Root Dir: /mnt/HC_Volume_6311166/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Worker
# uname -a
Linux staging-worker-node-1 5.15.0-50-generic #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
# docker version
Client: Docker Engine - Community
Version: 20.10.19
API version: 1.41
Go version: go1.18.7
Git commit: d85ef84
Built: Thu Oct 13 16:46:58 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.19
API version: 1.41 (minimum version 1.12)
Go version: go1.18.7
Git commit: c964641
Built: Thu Oct 13 16:44:47 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.8
GitCommit: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
runc:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
# On manager
docker swarm init --force-new-cluster --advertise-addr 10.1.0.3 # Note that the IP in the join command is the public ip and NOT the one specified here, I don't know if this matters and the code is just sloppy. Otherwise there is def. a bug in docker
# Then on worker
docker swarm join --advertise-addr 10.1.0.21 ...
The worker is connected to the internal network in hetzner:
Subnet: 10.1.0.0/32
Manager: 10.1.0.3/32
Worker: 10.1.0.3/32
Validation of network access between nodes
# Manager
nc -vz -u 10.1.0.21 7946
Connection to 10.1.0.21 7946 port [udp/*] succeeded!
nc -vz -u 10.1.0.21 4789
Connection to 10.1.0.21 4789 port [udp/*] succeeded!
nc -vz 10.1.0.21 7946
Connection to 10.1.0.21 7946 port [tcp/*] succeeded!
# Worker
nc -vz 10.1.0.3 2377
Connection to 10.1.0.3 2377 port [tcp/*] succeeded!
nc -vz -u 10.1.0.3 2377
Connection to 10.1.0.3 2377 port [udp/*] succeeded!
nc -vz 10.1.0.3 7946
Connection to 10.1.0.3 7946 port [tcp/*] succeeded!
nc -vz -u 10.1.0.3 4789
Connection to 10.1.0.3 4789 port [udp/*] succeeded!
# etc.. you get the point
Testing inter-node communication
# On manager
docker network create --driver overlay --attachable test
docker service create --name whoami --constraint "node.role=worker" --network test traefik/whoami
docker service create --name whoami --constraint "node.role=manager" --network test traefik/whoami
So now we have one overlay network spanning the nodes, test
. This network contains two services with one container each:
-
whoamiman
running on the manager -
whoami
running on the worker
Trying to communicate with the containers on other nodes is where things break
# Manager
# Trying to connect to the service running on the worker node:
docker run -it --rm --network test nginx curl whoami
curl: (6) Could not resolve host: whoami
# The service running on the same node:
docker run -it --rm --network test nginx curl whoamiman
Hostname: 5491774d6de0
IP: 127.0.0.1
IP: 10.0.2.2
IP: 172.18.0.16
RemoteAddr: 10.0.2.3:50882
GET / HTTP/1.1
Host: whoamiman
User-Agent: curl/7.74.0
Accept: */*
# Works fine!
# -------------
# Worker, same thing
docker run -it --rm --network test nginx curl whoamiman
curl: (6) Could not resolve host: whoamiman
docker run -it --rm --network test nginx curl whoami
Hostname: cfc29d904c1d
IP: 127.0.0.1
IP: 10.0.2.6
IP: 172.18.0.4
RemoteAddr: 10.0.2.11:36618
GET / HTTP/1.1
Host: whoami
User-Agent: curl/7.74.0
Accept: */*
I can also mention that I at some point managed to make the dns work between nodes, but then we had the problem where connections would just time out between the containers on different hosts as well, but I’m having issues getting to that state again.