Swarm join problems: Timeout was reached before node joined

Hello community,

I’m struggling to join a specific worker to swarm. I followed the typical resolution advice to verify if something is blocking the related tcp/udp traffic on typical swarm ports, but the results indicate this may not be the problem:

  1. on manager node (ip 172.27.32.18):
j4care@pacslb2:~$ docker swarm init --advertise-addr 172.27.32.18:2377
  1. on worker node (ip 172.27.32.16)
j4care@pacsvwr2:~$ docker swarm join --token SWMTKN-1-5hb4nwggy2stt09me27d3jusxmwybjep7us1dbk7ljikqm7fm0-f3jzsk1bnltms7ch1foge7386 172.27.32.18:2377
Error response from daemon: Timeout was reached before node joined. The attempt to join the swarm will continue in the background.
  1. result - on manager:
j4care@pacslb2:~$ docker node ls
ID                            HOSTNAME   STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
3vvd52294czmefaredb1i8hjt     pacsvwr2   Ready     Active                          26.0.0

however it is not possible to deploy services:

docker service create --constraint node.hostname==pacsvwr2 nginx

… never finishes. In docker logs of the worker:

Mar 22 19:43:37 pacsvwr2 dockerd[78392]: time="2024-03-22T19:43:37.517077033+05:00" level=info msg="manager selected by agent for new session: { 172.27.32.18:2377}" module=node/agent node.id=3vvd52294czmefaredb1i8hjt
Mar 22 19:43:37 pacsvwr2 dockerd[78392]: time="2024-03-22T19:43:37.517399652+05:00" level=info msg="waiting 0s before registering session" module=node/agent node.id=3vvd52294czmefaredb1i8hjt
Mar 22 19:43:37 pacsvwr2 dockerd[78392]: time="2024-03-22T19:43:37.517486544+05:00" level=debug msg="(*session).start" module=node/agent node.id=3vvd52294czmefaredb1i8hjt
Mar 22 19:43:42 pacsvwr2 dockerd[78392]: time="2024-03-22T19:43:42.518528714+05:00" level=error msg="agent: session failed" backoff=100ms error="session initiation timed out" module=node/agent node.id=3vvd52294czmefaredb1i8hjt

however, the connection seems ok:

worker$ telnet 172.27.32.18 2377
Trying 172.27.32.18...
Connected to 172.27.32.18.

also, it seems the request arrives to manager ok - as I see in manager logs:

Mar 22 19:55:04 pacslb2 dockerd[92544]: time="2024-03-22T19:55:04.298506119+05:00" level=info msg="worker 3vvd52294czmefaredb1i8hjt was successfully registered" method="(*Dispatcher).register"
Mar 22 19:55:09 pacslb2 dockerd[92544]: time="2024-03-22T19:55:09.736424292+05:00" level=debug msg="node status updated" method="(*Dispatcher).processUpdates" module=dispatcher node.id=3vvd52294czmefaredb1i8hjt
Mar 22 19:55:09 pacslb2 dockerd[92544]: time="2024-03-22T19:55:09.738386505+05:00" level=info msg="worker 3vvd52294czmefaredb1i8hjt was successfully registered" method="(*Dispatcher).register"
Mar 22 19:55:18 pacslb2 dockerd[92544]: time="2024-03-22T19:55:18.801240679+05:00" level=debug msg="node status updated" method="(*Dispatcher).processUpdates" module=dispatcher node.id=3vvd52294czmefaredb1i8hjt
Mar 22 19:55:18 pacslb2 dockerd[92544]: time="2024-03-22T19:55:18.804435260+05:00" level=info msg="worker 3vvd52294czmefaredb1i8hjt was successfully registered" method="(*Dispatcher).register"
Mar 22 19:51:58 pacslb2 dockerd[92544]: time="2024-03-22T19:51:58.646783000+05:00" level=debug msg="error handling streaming rpc" error="rpc error: code = Unavailable desc = node 3vvd52294czmefaredb1i8hjt exceeded rate limit count of registrations" rpc=/docker.swarmkit.v1.Dispatcher/Session

what I see, however, in the manger logs, is this:

Mar 22 19:21:04 pacslb2 dockerd[92544]: time="2024-03-22T19:21:04.992107150+05:00" level=debug msg="error handling streaming rpc" error="context canceled" rpc=/docker.swarmkit.v1.Dispatcher/Session

which is likely the session that times out. However, as I’m not an expert, I’m not sure what may be the underlying problem. I followed the advice to check the ports typically used by docker swarm - but it seems the traffic is not blocked anywhere on networking level:

  1. started tcp/udp listeners on both nodes i.e.:
docker run --rm -d -p 0.0.0.0:2377:2377 -e TCP_PORT=2377 --name tcp-listener-2377 cjimti/go-echo
docker run --rm -d -p 0.0.0.0:7946:7946 -e TCP_PORT=7946 --name tcp-listener-7946 cjimti/go-echo
docker run --rm -d -p 0.0.0.0:7946:7946/udp -e UDPPORT=7946 --name udp-listener-7946 mendhak/udp-listener
docker run --rm -d -p 0.0.0.0:4789:4789/udp -e UDPPORT=4789 --name udp-listener-4789  mendhak/udp-listener
  1. tested access mutually by

… from manager to worker

telnet 172.27.32.16 2377
telnet 172.27.32.16 7946
nc -u 172.27.32.16 7946
nc -u 172.27.32.16 4789

… from worker to manager

telnet 172.27.32.18 2377
telnet 172.27.32.18 7946
nc -u 172.27.32.18 7946
nc -u 172.27.32.18 4789

and all works. docker info - worker:

j4care@pacsvwr2:~$ docker info
Client: Docker Engine - Community
 Version:    26.0.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.21.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: 1
  Running: 0
  Paused: 0
  Stopped: 1
 Images: 7
 Server Version: 26.0.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: pending
  NodeID: 3vvd52294czmefaredb1i8hjt
  Is Manager: false
  Node Address: 172.27.32.16
  Manager Addresses:
   172.27.32.18:2377
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3dd1e886e55dd695541fdcd67420c2888645a495
 runc version: v1.1.10-0-g18a0cb0
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.0-97-generic
 Operating System: Ubuntu 22.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.34GiB
 Name: pacsvwr2
 ID: R26Z:QLMJ:AOAY:ZICF:Z7N4:7IAW:BEBK:V76A:DGIU:GEFQ:T2MK:JFGR
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 27
  Goroutines: 80
  System Time: 2024-03-22T20:01:37.481520282+05:00
  EventsListeners: 0
 Experimental: false
 Insecure Registries:
  docker.j4care.com
  127.0.0.0/8
 Live Restore Enabled: false

… manager:

j4care@pacslb2:~$ docker info
Client: Docker Engine - Community
 Version:    26.0.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.21.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: 5
  Running: 5
  Paused: 0
  Stopped: 0
 Images: 8
 Server Version: 26.0.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: active
  NodeID: yo1n6sawn8kqpiorqfgz6fwv9
  Is Manager: true
  ClusterID: yrd0zw3zc1edpg7yxrargoaj1
  Managers: 1
  Nodes: 4
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 172.27.32.18
  Manager Addresses:
   172.27.32.18:2377
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3dd1e886e55dd695541fdcd67420c2888645a495
 runc version: v1.1.10-0-g18a0cb0
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.0-97-generic
 Operating System: Ubuntu 22.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 15.61GiB
 Name: pacslb2
 ID: DMD2:TFQV:U2AI:P6KP:G5OW:PK2F:DNJW:DLYT:MFGD:XRMX:CP4B:MLWJ
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 94
  Goroutines: 275
  System Time: 2024-03-22T20:02:04.347856714+05:00
  EventsListeners: 6
 Experimental: false
 Insecure Registries:
  docker.j4care.com
  127.0.0.0/8
 Live Restore Enabled: false

Do you have any idea what may be the underlying problem? thanks a lot

How are your nodes connected? Via a VPN or VLAN? Sometimes there are issues with MTU size. Try a ping both ways with a payload > 1500 bytes.

1 Like

Hello, thanks for reply. The issue occurs after moving VM to another VM host. I checked the networks (same on manager and the worker):

j4care@pacslb2:~$ ip addr | grep mtu
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
3: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
4: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
5: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 

and I verified the transmission (same results both directions):

# size = 1500 - works
j4care@pacslb2:~$ ping -s 1472 172.27.32.16
PING 172.27.32.16 (172.27.32.16) 1472(1500) bytes of data.
1480 bytes from 172.27.32.16: icmp_seq=1 ttl=64 time=0.577 ms

# size > 1500 - fails
j4care@pacslb2:~$ ping -s 1500 172.27.32.16
PING 172.27.32.16 (172.27.32.16) 1500(1528) bytes of data.
^C
--- 172.27.32.16 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1010ms

I also verified against another worker that can join the swarm without issues - and the behavior differs!

# size > 1500 - works
j4care@pacslb2:~$ ping -s 1500 172.27.32.12
PING 172.27.32.12 (172.27.32.12) 1500(1528) bytes of data.
1508 bytes from 172.27.32.12: icmp_seq=1 ttl=64 time=0.659 ms

so there is a difference, but I’m not sure if it is significant with respect to the problem… I tried configuring

... /usr/bin/dockerd --mtu 1400

on both affected hosts but it did not help. Next I’ll try to find the cause of the 1500 limit on the affected worker host and increase it. Do you have any other tips? thanks!

Issue was resolved by adjusting the MTU in in networking configuration of the underlying infrastructure. The critical change in behavior:

… before the change - mutual ping with over 1500b fails between manager and affected worker

j4care@pacslb2:~$ ping -s 1500 pacsvwr2
.. 3 packets transmitted, 0 received, 100% packet loss

j4care@pacsvwr2:~$ ping -s 1500 pacslb2
.. 3 packets transmitted, 0 received, 100% packet loss

… after the change:

j4care@pacslb2:~$ ping -s 1500 pacsvwr2
... 2 packets transmitted, 2 received, 0% packet loss

j4care@pacsvwr2:~$ ping -s 1500 pacslb2
... 2 packets transmitted, 2 received, 0% packet loss

after this the join works

j4care@pacsvwr2:~$ docker swarm join --token SWMTKN-1-xxx 172.27.32.18:2377
This node joined a swarm as a worker.

thanks a lot again for help!!!