I’ve recently tried to deploy a new swarm (Docker version 27.1.1, build 6312585) on guest computers from a single (for now) host.
The virtual (KVM / QEmu) machines are running Debian12 (Bookworm).
They are configured in bridge networking (here is the relevant snippet of the host /network/interfaces):
auto br0
iface br0 inet static
address <HOST_IP>
... #(LAN CONFIG)
bridge_ports <HOST_INTERFACE>
bridge_stp off # disable Spanning Tree Protocol
bridge_waitport 0 # no delay before a port becomes available
bridge_fd 0 # no forwarding delay
Because some part of the default swarm address pool (10.0.0.0/8) are used in our wan, I needed to change said pool (see the following command).
Here is the command I used to initialize the swarm:
docker swarm init --advertise-addr <MASTER_IP> --default-addr-pool 192.168.0.0/16 --default-addr-pool-mask-length 24
And here is the command I used to join a worker to said swarm:
docker swarm join --token <WORKER_INV_TKN> <MASTER_IP>:2377 --advertise-addr <WORKER_IP> --listen-addr <WORKER_IP>
The worker seems to join the swarm (containers are dispatched to it).
But I’m facing some problems (timeout) when trying to communicate between containers running on the master and containers running on the worker, even if they are in the same overlay network.
For diagnostic purpose, the firewall is disabled on both guests.
Here is the output of traceroute in both directions (no intermediate hop from what I can tell).
worker => master:
traceroute to <MASTER_IP> (<MASTER_IP>), 30 hops max, 60 byte packets
1 <MASTER_FQDN> (<MASTER_IP>) 0.794 ms 0.659 ms 0.590 ms
master => worker
traceroute to <WORKER_IP> (<WORKER_IP>), 30 hops max, 60 byte packets
1 <WORKER_FQDN> (<WORKER_IP>) 0.704 ms 0.562 ms 0.492 ms
I ran docker node inspect <WORKER_ID> --pretty
on the master guest, and the output was unexpected:
...
Status:
State: Ready
Availability: Active
Address: <HOST_IP>
...
note the “<HOST_IP>” which I didn’t expect to appear in any of the swarm config / settings.
The same happen when joining a worker node from AND to another bare-metal (the runner appear to always have this same <HOST_IP> for Status.Address in all three case, which is very weird).
I would expect the following output instead:
...
Status:
State: Ready
Availability: Active
Address: <WORKER_IP>
...
When joining the other guest as a second manager, I obtain the following result:
...
Status:
State: Ready
Availability: Active
Address: <HOST_IP>
Manager Status:
Address: <OTHER_MASTER_IP>:2377
Raft Status: Reachable
Leader: No
...
This is even weirder, as the “Manager Status.Address” is correct, but the “Status.Address” is still incorrect.
The output of inspection of the first master node seems okay:
...
Status:
State: Ready
Availability: Active
Address: <MASTER_IP>
Manager Status:
Address: <MASTER_IP>:2377
Raft Status: Reachable
Leader: Yes
...
I’m pretty sure this bridge config is messing things up, but another factor could also be in play:
the hosting bare metal (<HOST_IP>) is currently also participating to another swarm for the time being.
I’ve tried to be as brief as possible, so ask if you need more info to help with the diagnostic.
Thank you for your time.