Docker Swarm: Master and Worker not communicating properly

Hello,
I have a customer with a Suse12 SP5 deployment with Docker installed 19.03.15. FireWall appears disabled. No proxy in place.

They have 2 nodes in the swarm:
HostA (Master)
HostB (Worker)

They deploy a container like NGINX as a service in the swarm using a YML like:

version: "3.3"
services:
  nginx:
    image: nginx:latest
    deploy:
      replicas: 1
    ports:
    - 9066:80

The problem is, they can only contact NGINX on the node on which it is running.

if “docker ps” shows the container running on HostA, then,
http request to HostA:9066 is successful
http request to HostB:9066 times out.

if “docker ps” shows the container running on HostB, then,
http request to HostB:9066 is successful
http request to HostA:9066 times out.

There appears to be no obvious errors in /var/log/messages or in dmesg…

We’ve checked:

  • ipv4 port forwarding is enabled
  • initialized and joined the swarm using --advertise-addr (no change in behavior)
  • promoted the worker node as master (no change in behavior)
  • tested latest version of docker (no change in behavior)

Something appears to be blocking the internal networking from understanding services which are running on the other machine.

Can anyone recommend steps to debug this issue?

Please share how you deploy this compose file. Additionaly you might want to check, if required kernel modules are available on the hosts: https://raw.githubusercontent.com/moby/moby/master/contrib/check-config.sh.

N.B: with two nodes, please avoid to make both master nodes! This will render the cluster headless whenever one of both nodes is unavailable. With two nodes a single master and worker is recommended. With 3 nodes, I would always prefere a master only cluster, as it will compensate the outage of one of the nodes.

To deploy the compose file we executed the following:

Master node:
docker swarm init

Worker node:
docker swarm join …

Master node:
docker stack deploy -c nginx_compose.yml mystack

Thanks for the link to the check-config.sh file. I will run that and see the output!

Thank you!
Shyam

Hi Meyay,
I have run the check-config.sh script, here is the output:

MYMACHINE:/scratch/docker # ./check-config.sh
info: reading kernel config from /proc/config.gz ...

Generally Necessary:
- cgroup hierarchy: properly mounted [/sys/fs/cgroup]
- apparmor: enabled and tools installed
- CONFIG_NAMESPACES: enabled
- CONFIG_NET_NS: enabled
- CONFIG_PID_NS: enabled
- CONFIG_IPC_NS: enabled
- CONFIG_UTS_NS: enabled
- CONFIG_CGROUPS: enabled
- CONFIG_CGROUP_CPUACCT: enabled
- CONFIG_CGROUP_DEVICE: enabled
- CONFIG_CGROUP_FREEZER: enabled
- CONFIG_CGROUP_SCHED: enabled
- CONFIG_CPUSETS: enabled
- CONFIG_MEMCG: enabled
- CONFIG_KEYS: enabled
- CONFIG_VETH: enabled (as module)
- CONFIG_BRIDGE: enabled (as module)
- CONFIG_BRIDGE_NETFILTER: enabled (as module)
- CONFIG_IP_NF_FILTER: enabled (as module)
- CONFIG_IP_NF_TARGET_MASQUERADE: enabled (as module)
- CONFIG_NETFILTER_XT_MATCH_ADDRTYPE: enabled (as module)
- CONFIG_NETFILTER_XT_MATCH_CONNTRACK: enabled (as module)
- CONFIG_NETFILTER_XT_MATCH_IPVS: enabled (as module)
- CONFIG_NETFILTER_XT_MARK: enabled (as module)
- CONFIG_IP_NF_NAT: enabled (as module)
- CONFIG_NF_NAT: enabled (as module)
- CONFIG_POSIX_MQUEUE: enabled
- CONFIG_NF_NAT_IPV4: enabled (as module)
- CONFIG_NF_NAT_NEEDED: enabled

Optional Features:
- CONFIG_USER_NS: enabled
- CONFIG_SECCOMP: enabled
- CONFIG_SECCOMP_FILTER: enabled
- CONFIG_CGROUP_PIDS: enabled
- CONFIG_MEMCG_SWAP: enabled
- CONFIG_MEMCG_SWAP_ENABLED: missing
    (cgroup swap accounting is currently not enabled, you can enable it by setting boot option "swapaccount=1")
- CONFIG_LEGACY_VSYSCALL_EMULATE: enabled
- CONFIG_IOSCHED_CFQ: enabled
- CONFIG_CFQ_GROUP_IOSCHED: enabled
- CONFIG_BLK_CGROUP: enabled
- CONFIG_BLK_DEV_THROTTLING: enabled
- CONFIG_CGROUP_PERF: enabled
- CONFIG_CGROUP_HUGETLB: enabled
- CONFIG_NET_CLS_CGROUP: enabled (as module)
- CONFIG_CGROUP_NET_PRIO: enabled
- CONFIG_CFS_BANDWIDTH: enabled
- CONFIG_FAIR_GROUP_SCHED: enabled
- CONFIG_RT_GROUP_SCHED: enabled
- CONFIG_IP_NF_TARGET_REDIRECT: enabled (as module)
- CONFIG_IP_VS: enabled (as module)
- CONFIG_IP_VS_NFCT: enabled
- CONFIG_IP_VS_PROTO_TCP: enabled
- CONFIG_IP_VS_PROTO_UDP: enabled
- CONFIG_IP_VS_RR: enabled (as module)
- CONFIG_SECURITY_SELINUX: enabled
- CONFIG_SECURITY_APPARMOR: enabled
- CONFIG_EXT4_FS: enabled (as module)
- CONFIG_EXT4_FS_POSIX_ACL: enabled
- CONFIG_EXT4_FS_SECURITY: enabled
- Network Drivers:
  - "overlay":
    - CONFIG_VXLAN: enabled (as module)
    - CONFIG_BRIDGE_VLAN_FILTERING: enabled
      Optional (for encrypted networks):
      - CONFIG_CRYPTO: enabled
      - CONFIG_CRYPTO_AEAD: enabled
      - CONFIG_CRYPTO_GCM: enabled (as module)
      - CONFIG_CRYPTO_SEQIV: enabled
      - CONFIG_CRYPTO_GHASH: enabled (as module)
      - CONFIG_XFRM: enabled
      - CONFIG_XFRM_USER: enabled (as module)
      - CONFIG_XFRM_ALGO: enabled (as module)
      - CONFIG_INET_ESP: enabled (as module)
      - CONFIG_INET_XFRM_MODE_TRANSPORT: enabled (as module)
  - "ipvlan":
    - CONFIG_IPVLAN: enabled (as module)
  - "macvlan":
    - CONFIG_MACVLAN: enabled (as module)
    - CONFIG_DUMMY: enabled (as module)
  - "ftp,tftp client in container":
    - CONFIG_NF_NAT_FTP: enabled (as module)
    - CONFIG_NF_CONNTRACK_FTP: enabled (as module)
    - CONFIG_NF_NAT_TFTP: enabled (as module)
    - CONFIG_NF_CONNTRACK_TFTP: enabled (as module)
- Storage Drivers:
  - "aufs":
    - CONFIG_AUFS_FS: missing
  - "btrfs":
    - CONFIG_BTRFS_FS: enabled (as module)
    - CONFIG_BTRFS_FS_POSIX_ACL: enabled
  - "devicemapper":
    - CONFIG_BLK_DEV_DM: enabled (as module)
    - CONFIG_DM_THIN_PROVISIONING: enabled (as module)
  - "overlay":
    - CONFIG_OVERLAY_FS: enabled (as module)
  - "zfs":
    - /dev/zfs: missing
    - zfs command: missing
    - zpool command: missing

Limits:
- /proc/sys/kernel/keys/root_maxkeys: 1000000

The kernel modules look fine.
All modules underneath “overlay” are present → so it’s doesn’t seem the kernel is the issue.

Make sure the overlay network created by the deployment does not overlap with networks in your environment.

Next step, I would create a compose file and troubleshoot if container to container communication in their shared overlay network works.

Something like this should do the trick:

version: "3.8"
services:
  netshoot1:
    image:  nicolaka/netshoot:latest
    deploy:
      replicas: 1
      placement:
        constraints:
          - "node.role==manager"
  netshoot2:
    image: nicolaka/netshoot:latest
    deploy:
      replicas: 1
      placement:
        constraints:
          - "node.role==worker"

Then exec into the container and try to ping netshoot2 from netshoot1 and vise versa. If this doesn’t work, then make sure that the ports mentioned here are not blocked.

If the container to container test worked, undeploy all containers that depend on the ingress network, inspect the ingress network to get the parameters, delete the existing ingress network and create a new one using the parameters you noted before. The creation should look something like this:

 docker network create \
  --driver overlay \
  --ingress \
  --subnet=10.11.0.0/16 \
  --gateway=10.11.0.2
  ingress

Then try to deploy your stack again and see if it works.

Hello @meyay , Thanks so much for your prompt responses!

We have tested the use case you mentioned - indeed containers which run on the manager, cannot ping containers which are running on the worker.

I have requested that the networking admins double check that those 3 ports are fully open between the two machines.

Thank you!!!
Shyam

Hello @meyay,

We’ve confirmed that both machines have the Susefirewall disabled.

both VM are in the same VLAN, no firewall present in the VLAN.
Also internal linux firewall is disabled.

MASTER - SuSEfirewall2.service - SuSEfirewall2 phase 2
Loaded: loaded (/usr/lib/systemd/system/SuSEfirewall2.service; disabled; vendor preset: disabled)
Active: inactive (dead)

WORKER:~ # systemctl status SuSEfirewall2
● SuSEfirewall2.service - SuSEfirewall2 phase 2
Loaded: loaded (/usr/lib/systemd/system/SuSEfirewall2.service; disabled; vendor preset: disabled)
Active: inactive (dead)

Any other suggestions?

Thanks,
Shyam

There is not much left to suggest…

  • make sure the nodes are connected over a low-latency network connection
  • make sure there is no additional firewall on the vm level
  • check if the overlay networks colide with a local lan

For whatever reason I remember having problems in the past with missing settings on vswitch/portrgrups on ESXi, which affected network functionality in vms. I had to allow promiscuous mode, forged transmits and MAC change in order to make something work but I don’t recall wether it was for overlay networks or something else. Actualy, I would expect vxlan to encapsulate the traffic of a docker overlay network to be completly hidden from the physical network layer.