Overlay network ping works, but HTTP requests only work within same swarm node. Hangs as if messages dropped if to other node

Hi…

Hoping it’s ok to ask a Docker Swarm question related to Traefik use here. The core issue seems to me to be related to Docker swarm and not Traefik so I’m hoping it is fitting, and the similar threads I found here seems to be a bit different than mine.

Summary:
On my overlay network, I can ping all containers across Docker nodes, but I can only talk to the local containers using HTTP. Local firewall ufw has been disabled to verify that it is not the cause.

When on the overlay network, shouldn’t I have full network connectivity to the other containers on the network? What is wrong with my setup?

Details:
I’m setting up 3 docker nodes on Ubuntu Jammy stable, and have joined them into a swarm with 3 managers. I don’t have much load but would like to be able to take a node down for maintenance, so I plan to run worker bits on the manager nodes to be able have enough manager nodes to still elect a leader if one is down.

I’m trying to setup Traefik, so I can have each swarm service redirecting of a common top domain, so foobar . mydomain . com and otherservice . mydomain . com both have A records pointing to Traefic service, and Traefic will redirect to correct service through HTTP host header.

I’ve set up a docker network in overlay mode, called traefik-public that I join Traefik and all my services to, so Traefik can talk to them. And here starts the issue.
I’ve only added A record to one of my nodes, and I’ve set a node tag forcing Traefik to run there. I’m using a whoami service as a simplistic web server to verify connectivity. Additionally, I’ve set up an ubuntu installation as a third service, so I have an environment I can install stuff in which is also on the overlay network.

  • If I let the whoami service run it’s container on the Traefik node, Traefik is working and I can send requests from outside. If I run it on another node, it doesn’t.
  • From my Ubuntu service, I can curl to tasks.whoami or straight on IP if it’s running on the same host as whoami, else it just hangs as if the network communication is dropped. This is true regardless of whether the whoami service is on Traefik node or not.

Some basics of how I set it up:

docker network create --driver=overlay traefik-public
docker stack deploy -c traefik.yaml traefik

traefik.yaml

version: '3.3'

services:
  traefik:
    image: traefik:v2.8
    command:
      - --api.insecure=true
      - --providers.docker
      - --providers.docker.exposedbydefault=false
      - --entrypoints.web.address=:80
      - --providers.docker.swarmMode=true
    ports:
      - 80:80
      - 8080:8080
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    deploy:
      placement:
        constraints:
          - node.labels.traefik-public.has-certificates == true
      labels:
        - traefik.enable=true
        - traefik.docker.network=traefik-public
        - traefik.http.services.traefik-public.loadbalancer.server.port=51242
    networks:
      - traefik-public

  whoami:
    image: "traefik/whoami"
    deploy:
      placement:
        constraints:
          - node.labels.traefik-public.has-certificates == true
      labels:
        - traefik.enable=true
        - traefik.docker.network=traefik-public
        - traefik.http.routers.whoami.rule=Host(`foobar.mydomain.com`)
        - traefik.http.routers.whoami.entrypoints=web
        - traefik.http.routers.whoami.service=whoami
        - traefik.http.services.whoami.loadbalancer.server.port=80
    networks:
      - traefik-public

  debugnix:
    image: "ubuntu"
    tty: true
    command: sh
    deploy:
      placement:
        constraints:
          - node.labels.traefik-public.has-certificates == true
    networks:
      - traefik-public

  debugnix2:
    image: "ubuntu"
    tty: true
    command: sh
    deploy:
      placement:
        constraints:
          - node.labels.traefik-public.has-certificates == false
    networks:
      - traefik-public

networks:
  traefik-public:
    external: true

Results of requests when debugging:

traefik-node# docker container ls | grep debugnix | cut -d' ' -f1 | xargs -o -I {} docker exec -it {} /bin/bash
curl http request to tasks.whoami (real command is recognized as URL and blocked in forum :/)
Hostname: 480c5c46f063
IP: 127.0.0.1
IP: 10.0.6.106
IP: 172.18.0.5
RemoteAddr: 10.0.6.102:53732
GET / HTTP/1.1
Host: tasks.whoami
User-Agent: curl/7.81.0
Accept: */*

other-node# docker container ls | grep debugnix | cut -d' ' -f1 | xargs -o -I {} docker exec -it {} /bin/bash
root@7c034d32cd0f:/# host -t A tasks.whoami
tasks.whoami has address 10.0.6.106
root@7c034d32cd0f:/# curl http request to tasks.whoami (real command is recognized as URL and blocked in forum :/)
  # Doesn't return. I guess default timeout is infinite and packages are dropped somewhere..
root@7c034d32cd0f:/# ping tasks.whoami
PING tasks.whoami (10.0.6.106) 56(84) bytes of data.
...
--- tasks.whoami ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2054ms
rtt min/avg/max/mdev = 0.213/0.251/0.277/0.027 ms

traefik-node# iptables -L -n
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ufw-before-logging-input  all  --  0.0.0.0/0            0.0.0.0/0
ufw-before-input  all  --  0.0.0.0/0            0.0.0.0/0
ufw-after-input  all  --  0.0.0.0/0            0.0.0.0/0
ufw-after-logging-input  all  --  0.0.0.0/0            0.0.0.0/0
ufw-reject-input  all  --  0.0.0.0/0            0.0.0.0/0
ufw-track-input  all  --  0.0.0.0/0            0.0.0.0/0

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
DOCKER-USER  all  --  0.0.0.0/0            0.0.0.0/0
DOCKER-INGRESS  all  --  0.0.0.0/0            0.0.0.0/0
DOCKER-ISOLATION-STAGE-1  all  --  0.0.0.0/0            0.0.0.0/0
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
DOCKER     all  --  0.0.0.0/0            0.0.0.0/0
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
DOCKER     all  --  0.0.0.0/0            0.0.0.0/0
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
DOCKER     all  --  0.0.0.0/0            0.0.0.0/0
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0
ufw-before-logging-forward  all  --  0.0.0.0/0            0.0.0.0/0
ufw-before-forward  all  --  0.0.0.0/0            0.0.0.0/0
ufw-after-forward  all  --  0.0.0.0/0            0.0.0.0/0
ufw-after-logging-forward  all  --  0.0.0.0/0            0.0.0.0/0
ufw-reject-forward  all  --  0.0.0.0/0            0.0.0.0/0
ufw-track-forward  all  --  0.0.0.0/0            0.0.0.0/0
DROP       all  --  0.0.0.0/0            0.0.0.0/0

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
ufw-before-logging-output  all  --  0.0.0.0/0            0.0.0.0/0
ufw-before-output  all  --  0.0.0.0/0            0.0.0.0/0
ufw-after-output  all  --  0.0.0.0/0            0.0.0.0/0
ufw-after-logging-output  all  --  0.0.0.0/0            0.0.0.0/0
ufw-reject-output  all  --  0.0.0.0/0            0.0.0.0/0
ufw-track-output  all  --  0.0.0.0/0            0.0.0.0/0

Chain DOCKER (3 references)
target     prot opt source               destination

Chain DOCKER-INGRESS (1 references)
target     prot opt source               destination
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8080
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:8080
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:80
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:80
RETURN     all  --  0.0.0.0/0            0.0.0.0/0

Chain DOCKER-ISOLATION-STAGE-1 (1 references)
target     prot opt source               destination
DOCKER-ISOLATION-STAGE-2  all  --  0.0.0.0/0            0.0.0.0/0
DOCKER-ISOLATION-STAGE-2  all  --  0.0.0.0/0            0.0.0.0/0
DOCKER-ISOLATION-STAGE-2  all  --  0.0.0.0/0            0.0.0.0/0
RETURN     all  --  0.0.0.0/0            0.0.0.0/0

Chain DOCKER-ISOLATION-STAGE-2 (3 references)
target     prot opt source               destination
DROP       all  --  0.0.0.0/0            0.0.0.0/0
DROP       all  --  0.0.0.0/0            0.0.0.0/0
DROP       all  --  0.0.0.0/0            0.0.0.0/0
RETURN     all  --  0.0.0.0/0            0.0.0.0/0

Chain DOCKER-USER (1 references)
target     prot opt source               destination
RETURN     all  --  0.0.0.0/0            0.0.0.0/0

Chain ufw-after-forward (1 references)
target     prot opt source               destination

Chain ufw-after-input (1 references)
target     prot opt source               destination

Chain ufw-after-logging-forward (1 references)
target     prot opt source               destination

Chain ufw-after-logging-input (1 references)
target     prot opt source               destination

Chain ufw-after-logging-output (1 references)
target     prot opt source               destination

Chain ufw-after-output (1 references)
target     prot opt source               destination

Chain ufw-before-forward (1 references)
target     prot opt source               destination

Chain ufw-before-input (1 references)
target     prot opt source               destination

Chain ufw-before-logging-forward (1 references)
target     prot opt source               destination

Chain ufw-before-logging-input (1 references)
target     prot opt source               destination

Chain ufw-before-logging-output (1 references)
target     prot opt source               destination

Chain ufw-before-output (1 references)
target     prot opt source               destination

Chain ufw-reject-forward (1 references)
target     prot opt source               destination

Chain ufw-reject-input (1 references)
target     prot opt source               destination

Chain ufw-reject-output (1 references)
target     prot opt source               destination

Chain ufw-track-forward (1 references)
target     prot opt source               destination

Chain ufw-track-input (1 references)
target     prot opt source               destination

Chain ufw-track-output (1 references)
target     prot opt source               destination

Actually managed to find the root myself… Of similar looking issues I found online people had not managed to setup correct firewall rules or had messed up somewhere or used some custom service, so didn’t find anything matching until I found:

Basically, I had to run

ethtool -K ens160 tx-checksum-ip-generic off

on all my docker nodes, and then it worked… (My device name was a bit different than jmcombs, it’s the name of the main interface connecting the nodes (outside Docker)

I’m trying to set up Docker on an Ubuntu image someone has tried to harden to avoid hacker holes, so I guess it might have added some non-default setting to check checksums… Though it seems strange that Docker swarm should rely on being able to send messages with what I assume is broken checksums. Not that I know anything about that feature, so maybe it’s not what it sounds like…

2 Likes

Actually, not related to the hardened image, but Ubuntu / linux issue on kernel 5.10+ or something:

You save me!
I spent a whole day to struggle with this problem :smiling_face_with_tear: