Servce High Availability connectivity issue, Docker Swarm (3 manager, 1 worker)

Greetings,
We need help with Docker networking, specifically with Docker Swarm.

We aim to set up a highly available (HA) Traefik Proxy as the entry point to our system from the public network over ports 80 and 443 (HTTP/HTTPS).
Traefik will be deployed on three manager nodes, and we will use a load balancer in front of our network to distribute traffic and if one manager node goes down, the others should handle incoming requests.

Our current VM setup is Docker Swarm with 3 manager nodes (Traefik) and 1 workers node (nginx, for now).

On manager nodes we deployed Traefik as Docker Swarm service with global deployment mode, and on worker node we have deployed nginx which will server our page.
Traefik on all nodes has ports 80 and 443 exposed to the host, so incoming HTTPS requests are passed to Traefik, which then forwards them to the nginx service on the worker node.

The problem we are facing is that only one Traefik instance successfully accepts connections and forwards them to the nginx service.
If we route all HTTPS requests to the Traefik instance on manager node A, everything works as expected. However, if we route requests to the Traefik instance on manager node B, the requests are not forwarded to the nginx service on the worker node, resulting in a “Bad Gateway” error.

I can ping the nginx service from all Traefik containers, but if I try to use telnet to connect to ports 80/443, the connection works only from the Traefik instance on manager node A.

Stack file:

version: '3.3'
services:
  traefik:
    image: traefik:v3.2.0
    ports:
      - mode: host
        protocol: tcp
        published: 443
        target: 443
      - mode: host
        protocol: tcp
        published: 80
        target: 80
      - mode: host
        protocol: tcp
        published: 5432
        target: 5432
    deploy:
      mode: global
      placement:
        constraints:
          - "node.role==manager"
      labels:
        - traefik.enable=true
        - traefik.docker.network=traefik_traefik-public
        - traefik.http.middlewares.https-redirect.redirectscheme.scheme=https
        - traefik.http.middlewares.https-redirect.redirectscheme.permanent=true
        - traefik.http.routers.traefik-public-http.rule=Host(`traefik.dev.si`)
        - traefik.http.routers.traefik-public-http.entrypoints=http
        - traefik.http.routers.traefik-public-http.middlewares=https-redirect
        - traefik.http.routers.traefik-public-https.rule=Host(`traefik.dev.si`)
        - traefik.http.routers.traefik-public-https.entrypoints=https
        - traefik.http.routers.traefik-public-https.tls=true
        - traefik.http.routers.traefik-public-https.service=api@internal
        - traefik.http.services.traefik-public.loadbalancer.server.port=8080
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    networks:
      - traefik-public

networks:
  traefik-public:
    driver: overlay
    ipam:
      config:
        - subnet: 10.0.10.0/24

Nginx stack file:

version: '3.3'
services:
  nginx:
    image: nginx
    networks:
      - traefik-public
    container_name: nginx
    volumes:
      - app-tmp-public:/var/www/gms/current/public/tmp
    logging:
      driver: json-file
      options:
        max-size: 50m
        max-file: 3
    deploy:
      placement:
        constraints:
          - "node.role!=manager"
      labels:
        traefik.enable: 'true'
        traefik.http.routers.dev-web.rule: 'Host(`test.dev.net`)'
        traefik.http.services.dev-web.loadbalancer.server.port: '80'
        traefik.http.routers.dev-web.entrypoints: 'https'
        traefik.http.routers.dev-web.tls: 'true'
      update_config:
        order: start-first
        failure_action: rollback
        delay: 5s

Docker Swarm Cluster setup
Manager A (Traefik :80 :443)
Manager B (Traefik :80 :443)
Manager C (Traefik :80 :443)
Worker A (nginx)

Overlay network: traefik-public

Works
Public → Manager A Traefik → Worker A nginx
Doesn’t works
Public → Manager B Treafik → Worker A nginx

Already tried but it doesn’t works:

  • create overlay network outside of the stack
  • map ports as ingress to host
  • completly new VMs for managers
  • ping MTU

Some screenshots of my testing.
On the left side of screenshot is manager A on which connectivity works and on right side is manager B where connectivity doesn’t works.
Ping, telnet, traceroute commands are exectued inside Traefik containers on manager node A/B.

Is your setup with VMs the final one or you just use VMs for testing?

We are trying to setup this first on our develoment cluster so we can change VMs settings.

Please do not share text content as screenshot, as it makes it unnecessary hard to read. Even if I wanted, I can not read anything on my 14" 2880x1800 pixel screen, so I have no idea what they show. Please use code blocks, like you did for your compose files, instead!

The Traefik compose file looks fine. Is it possible you modified the configuration of the traefik-public network after creation? If this is the case, then you might have an inconsistent configuration amongst the nodes.

Does it generally work to access Traefik using the host ips and ports on each of the manager nodes?

I shared so big screenshot because i’m a new user i i can share only one file so i merged all together.
Sharing screenshot content as text snipet belowe.

Docker network inspect Manager A:

[
    {
        "Name": "traefik-proxy",
        "Id": "u4qoeqty6t31zujakfgqsgzxj",
        "Created": "2024-12-16T07:17:47.885975405+01:00",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.0.9.0/24",
                    "Gateway": "10.0.9.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "61bf4da41b2d4b65f0729d327fa44ff9bc90dc8ae0c17b59854b585c4ffc69b6": {
                "Name": "traefik_traefik.wkoismyy4uwj8j3fmyo3oyiti.lm5b6ggp8blhonotcfgfsiuzp",
                "EndpointID": "4f3014c1cbc08a89898375e3896b04dcc1cabfe0158f09e8fadb04643c37207e",
                "MacAddress": "02:42:0a:00:09:05",
                "IPv4Address": "10.0.9.5/24",
                "IPv6Address": ""
            },
            "e298a35656736d8519b8de70253a906d85d19c02a4fce0186781f456365b5f98": {
                "Name": "mol_dev_nginx.1.kic9tolqrrd14gyf4gb8pshhg",
                "EndpointID": "86f237f4359e8ee5317f18eeb3426ae3f464fe5ce704f6c91f673ab2fc389296",
                "MacAddress": "02:42:0a:00:09:33",
                "IPv4Address": "10.0.9.51/24",
                "IPv6Address": ""
            },
            "lb-traefik-proxy": {
                "Name": "traefik-proxy-endpoint",
                "EndpointID": "acf0ba0cc8ecf83f5bfaf57f78d8030b776ad8336939991e2fe89e701711cbd4",
                "MacAddress": "02:42:0a:00:09:06",
                "IPv4Address": "10.0.9.6/24",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4108"
        },
        "Labels": {},
        "Peers": [
            {
                "Name": "fe1bf02c3697",
                "IP": "192.168.17.224"
            },
            {
                "Name": "a04d96cfec00",
                "IP": "192.168.16.176"
            },
            {
                "Name": "43a9eaf08e81",
                "IP": "192.168.17.223"
            }
        ]
    }
]

Docker network inspect Manager B

[
    {
        "Name": "traefik-proxy",
        "Id": "u4qoeqty6t31zujakfgqsgzxj",
        "Created": "2024-12-13T11:29:40.382389485Z",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.0.9.0/24",
                    "Gateway": "10.0.9.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "9bdf7a21ac30864a5d0491fc2982a2f8553f1eb02d8b83b0fc50208cd802fb30": {
                "Name": "traefik_traefik.iis711gwu7d0nmks750zx65l0.2k1on9luprvv9mjoz3p57gb0k",
                "EndpointID": "ccd3ec7db1bb25fe2b12f4b3ae9ee5997374db4b3bf6c0becd4987a9291f94a5",
                "MacAddress": "02:42:0a:00:09:0c",
                "IPv4Address": "10.0.9.12/24",
                "IPv6Address": ""
            },
            "lb-traefik-proxy": {
                "Name": "traefik-proxy-endpoint",
                "EndpointID": "4d608ca5e5b4cf48ac431f6ed9c587ebd853e5d2df835548991aac52ef7cd958",
                "MacAddress": "02:42:0a:00:09:0d",
                "IPv4Address": "10.0.9.13/24",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4108"
        },
        "Labels": {},
        "Peers": [
            {
                "Name": "43a9eaf08e81",
                "IP": "192.168.17.223"
            },
            {
                "Name": "a04d96cfec00",
                "IP": "192.168.16.176"
            },
            {
                "Name": "fe1bf02c3697",
                "IP": "192.168.17.224"
            }
        ]
    }
]

Docker network inspect Worker A

[
    {
        "Name": "traefik-proxy",
        "Id": "u4qoeqty6t31zujakfgqsgzxj",
        "Created": "2024-12-13T12:11:40.060596999+01:00",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.0.9.0/24",
                    "Gateway": "10.0.9.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "62c1385c46cca308a6d4c8a401aad0ab2dfd717faa2a4e3f6e8c0a8fa353a4ae": {
                "Name": "flycom_dev_nginx.1.8nwyxgy1nrk1y16cfhn5w3jwj",
                "EndpointID": "073167c9aa696aa9e85db8d659f9525963b7d45df0718c05296794e114fbf739",
                "MacAddress": "02:42:0a:00:09:08",
                "IPv4Address": "10.0.9.8/24",
                "IPv6Address": ""
            },
            "lb-traefik-proxy": {
                "Name": "traefik-proxy-endpoint",
                "EndpointID": "ba77dcd4c39c8a1d707dc16333d4598352ad5b5388fad0e4e847660eaebdfc07",
                "MacAddress": "02:42:0a:00:09:09",
                "IPv4Address": "10.0.9.9/24",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4108"
        },
        "Labels": {},
        "Peers": [
            {
                "Name": "a04d96cfec00",
                "IP": "192.168.16.176"
            },
            {
                "Name": "43a9eaf08e81",
                "IP": "192.168.17.223"
            },
            {
                "Name": "fe1bf02c3697",
                "IP": "192.168.17.224"
            }
        ]
    }
]

Test commands Manager A

/ # traceroute 10.0.9.8
traceroute to 10.0.9.8 (10.0.9.8), 30 hops max, 46 byte packets
 1  flycom_dev_nginx.1.8nwyxgy1nrk1y16cfhn5w3jwj.traefik-proxy (10.0.9.8)  0.168 ms  0.112 ms  0.177 ms
/ # ping 10.0.9.8 -s 2000
PING 10.0.9.8 (10.0.9.8): 2000 data bytes
2008 bytes from 10.0.9.8: seq=0 ttl=64 time=0.206 ms
2008 bytes from 10.0.9.8: seq=1 ttl=64 time=0.259 ms
2008 bytes from 10.0.9.8: seq=2 ttl=64 time=0.288 ms
2008 bytes from 10.0.9.8: seq=3 ttl=64 time=0.306 ms
^C
--- 10.0.9.8 ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 0.206/0.264/0.306 ms
/ # telnet 10.0.9.8 80
Connected to 10.0.9.8

Thank you for sharing the content as text.

Forget the remark in my last post about inconsistent network configuration: overlay networks with swarm scope can’t be inconsistent. Though, the part about the config being immutable remains.

I am missing the answer to this question:

I want to understand whether your problem is with reaching Traefik on each node, or is in the overlay network between Traefik and the target nginx container.

I can reach Traefik on each node.

When i make a request to Traefik on manager B, i see request in Traefik log but then it is not forwarded to nginx and response is Bad Gateway.
So I assume that it is something with overlay network.

If the overlay traffic is not working, usually those are the suspects:

  • Firewall needs following ports to be open on all nodes:
    • Port 2377 TCP for communication with and between manager nodes
    • Port 7946 TCP/UDP for overlay network node discovery
    • Port 4789 UDP (configurable) for overlay network traffic
  • The mtu size is not identical on all nodes
    • ip addr show scope global | grep mtu
  • The nodes don’t share a low latency network connection
  • Nodes are running in vms on VMware vSphere with NSX
    • Outgoing traffic to port 4789 UDP is silently dropped as it conflicts with VMware NSX’s communication port for VXLAN
    • Re-create the swarm with a different data-port:
      • docker swarm init --data-path-port=7789
  • Problems with checksum offloading
    • Disable checksum offloading for the network interface (eth0 is a placeholder):
    • ethtool -K eth0 tx-checksum-ip-generic off
1 Like