Replicated service only available on one host

We seem to run into some strange situation which we cannot tackle.

We found out that out proxy was not able to connect over swarm to a nginx container on another node and timeout the connection. When the proxy was accessed on the node where the nginx container was running it seems to went fine.

Now we did a test the other way around and it seems that in a replicated service on port 80 the connection between the containers on nodes always times out but when you exec into the container and telnet yourself, all seems to be fine.

So it’s visa-versa, or maybe not, we cannot really log it.

We tried this with caddy and traefik as replicate 1 and 2 and also for nginx, same same thing happens, timeouts.

Out Swarm runs over an internal Vswitch which seems to be unblocked. The ISP is Hetzner (Cloud).

Maybe someone has an idea what could go wrong here.

Help us to help you: what os and architecture, what docker engine version… is the firewall enabled on the nodes or whatever the Hetzner Cloud equivalent for Security Groups is? if so what ports did you open for traffic amongst the nodes?

Both nodes are fully upgraded Ubuntu 21.04 version with the following docker version:

# docker version
Client: Docker Engine - Community
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        f0df350
 Built:             Wed Jun  2 11:56:38 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       b0f5bc3
  Built:            Wed Jun  2 11:54:50 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.6
  GitCommit:        d71fcd7d8303cbf684402823e425e9dd2e99285d
 runc:
  Version:          1.0.0-rc95
  GitCommit:        b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Firewalls are off, at least managed by docker setup so there are IPtables, not external firewalls or so on the network or setup at Hetzner.

What I do see is that one node has two rules the other doesn’t have which is under Chain INPUT (policy ACCEPT). The rest is the same. It wonders me why one of the two nodes doesn’t have these lines, both are manager and both nodes are active, cluster is active.

# iptables -L
# Warning: iptables-legacy tables present, use iptables-legacy to see them
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     udp  --  anywhere             anywhere             policy match dir in pol ipsec udp dpt:4789 u32 "0x0>>0x16&0x3c@0xc&0xffffff00=0x100700"
DROP       udp  --  anywhere             anywhere             udp dpt:4789 u32 "0x0>>0x16&0x3c@0xc&0xffffff00=0x100700"

Chain FORWARD (policy DROP)
target     prot opt source               destination
DOCKER-USER  all  --  anywhere             anywhere
DOCKER-INGRESS  all  --  anywhere             anywhere
DOCKER-ISOLATION-STAGE-1  all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain DOCKER (4 references)
target     prot opt source               destination

Chain DOCKER-INGRESS (1 references)
target     prot opt source               destination
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:https
ACCEPT     tcp  --  anywhere             anywhere             state RELATED,ESTABLISHED tcp spt:https
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:http
ACCEPT     tcp  --  anywhere             anywhere             state RELATED,ESTABLISHED tcp spt:http
RETURN     all  --  anywhere             anywhere

Chain DOCKER-ISOLATION-STAGE-1 (1 references)
target     prot opt source               destination
DOCKER-ISOLATION-STAGE-2  all  --  anywhere             anywhere
DOCKER-ISOLATION-STAGE-2  all  --  anywhere             anywhere
DOCKER-ISOLATION-STAGE-2  all  --  anywhere             anywhere
DOCKER-ISOLATION-STAGE-2  all  --  anywhere             anywhere
RETURN     all  --  anywhere             anywhere

Chain DOCKER-ISOLATION-STAGE-2 (4 references)
target     prot opt source               destination
DROP       all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
RETURN     all  --  anywhere             anywhere

Chain DOCKER-USER (1 references)
target     prot opt source               destination
RETURN     all  --  anywhere             anywhere

You are right, according: https://docs.hetzner.com/cloud/networks/connect-dedi-vswitch/ there is no such thing as security groups of the cloud provider level and all devices should be able to communicate freely.

I have no experience with how docker behaves on Ubuntu 21.04. As At least in the early days of Ubuntu 20.04 people had the same problem caused by nftables beeing the default. Though, your quoted output gives the imperssion that you use iptables-legacy (which I assume is the good old iptables).

I am afraid this one is for somebody else to answer.

Thanks for the update, good to hear it happened before, something to investigate! Let’s see what other come up with, thanks again!

I have created a bugreport: No traffic between Swarm nodes for containers -> timeout · Issue #1265 · docker/for-linux · GitHub