Docker swarm timeout gateway

Hi guys I would appreciate your help.
I have this scenario:
VMWARE:

  • Two VMs

    • Ubuntu1 act as a docker swarm manager and router vm
      • Features:
      • It has two NIC (two port groups created in vmware esxi host in the same vswitch)
      • One NIC with it’s own public IPv4
      • Second NIC Private IP 192.168.10.254 Act as gateway for the second VM
    • Grafana vm act as a worker node
      • It has one NIC private IP
  • Docker Swarm

    • manager node advertise with private ip 192.168.10.254
    • worker node join as 192.168.10.1
    • Two stacks
      • traefik: traefik network (overlay)
        • traefik container running on the manager node. traefik web UI reachable from internet
        • traefik as reverse proxy for grafana
      • grafana: traefik network (overlay)
        • container running in worker node
  • Firewall

    • ufw allow 2377, 7946, 4789, 3000 on vxlan (grafana UI listen on this port)

PROBLEM: Grafana web UI not accesible from traefik container on the manager: curl 192.168.10.1:3000, nor from outside.
GATEWAY TIMEOUT, 502
IMPORTANT: from traefik container we can ping grafana container, but we can’t reach the service, curl grafana_container_ip:3000.

Anyone have a clue of what is happening? I’ve also tried disabling ufw but I still have the error
Thanks in advance

1 Like

If the overlay traffic is not working, usually those are the suspects:

  • Firewall needs following ports to be open on all nodes:
    • Port 2377 TCP for communication with and between manager nodes
    • Port 7946 TCP/UDP for overlay network node discovery
    • Port 4789 UDP (configurable) for overlay network traffic
  • The mtu size is not identical on all nodes
    • ip addr show scope global | grep mtu
  • The nodes don’t share a low latency network connection
  • Nodes are running in vms on VMware vSphere with NSX
    • Outgoing traffic to port 4789 UDP is silently dropped as it conflicts with VMware NSX’s communication port for VXLAN
    • Re-create the swarm with a different data-port:
      • docker swarm init --data-path-port=7789
  • Problems with checksum offloading
    • Disable checksum offloading for the network interface (eth0 is a placeholder):
    • ethtool -K eth0 tx-checksum-ip-generic off
2 Likes

HI Meyay,

  • We’ve rules to allow all that ports in manager and work nodes.
  • I’ve checked mtu but is the same size
  • We don’t use vSphere, vms are running on the same ESXi host

I think the main problem is in layer 4 since we can ping grafana container (worker node) from traefik container (manager node). They share same overlay network (traefik-network). But we can’t curl IP:3000

Update: finally the ethtool -K eth0 tx-checksum-ip-generic off solved the problem, thanks a lot!

To be honest, it’s a forums effort.

I just shared a text template where I gathered all the root causes and solution related to swarm networking. :slight_smile: