Containers on swarm cannot communicate between nodes and connect to host network

Hi,
I’m using docker swarm and deploy services through docker stack deploy.
I have two problems, which probably are connected to each other.

I cannot expose every detail of my configuration, so I created separate stack for test purposes, where problem is the same.

services:
  nginx-no-expose:
    image: nginx
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints:
          - node.labels.env.node == 1
    volumes:
      - /home/docker/templates:/etc/nginx/templates
    networks:
      - test-network
    environment:
      - NGINX_HOST=foobar.com
      - NGINX_PORT=80

  nginx-exposed:
    image: nginx
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints:
          - node.labels.env.node == 1
    volumes:
      - /home/docker/templates:/etc/nginx/templates
    ports:
      - "8080:81"
    networks:
      - test-network
    environment:
      - NGINX_HOST=foobar.com
      - NGINX_PORT=81

networks:
  test-network:
    driver: overlay
    attachable: true
    external: true

I’ve created test-network at the start

docker network create --driver overlay --attachable test-network

Then I’ve created the stack

docker stack deploy --detach=true --with-registry-auth --compose-file test-stack.yml test && watch -n2 docker service ls

180.0.0.4 is my database (MySQL) VIP.

After stack deploy I cannot connect to database from withing service that is exposing ports (nginx-exposed). nginx-no-expose is working properly. Snippet from tests below.

root@env:/home/docker# docker container ls
CONTAINER ID   IMAGE          COMMAND                  CREATED         STATUS         PORTS                                       NAMES
dec7fe825b65   nginx:latest   "/docker-entrypoint.…"   2 minutes ago   Up 2 minutes   80/tcp                                      test_nginx-no-expose.1.zi5x4g6z9x4j7ag2m36so5pgm
8ce2433d6631   nginx:latest   "/docker-entrypoint.…"   2 minutes ago   Up 2 minutes   80/tcp                                      test_nginx-exposed.1.5jm9nch6a2i3j57vxu1d5a81l
root@env:/home/docker# docker exec -it dec7fe825b65 /bin/bash
root@dec7fe825b65:/# ping 180.0.0.4
PING 180.0.0.4 (180.0.0.4) 56(84) bytes of data.
64 bytes from 180.0.0.4: icmp_seq=1 ttl=63 time=0.363 ms
64 bytes from 180.0.0.4: icmp_seq=2 ttl=63 time=0.268 ms
64 bytes from 180.0.0.4: icmp_seq=3 ttl=63 time=0.400 ms
64 bytes from 180.0.0.4: icmp_seq=4 ttl=63 time=0.695 ms
^C
--- 180.0.0.4 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3069ms
rtt min/avg/max/mdev = 0.268/0.431/0.695/0.159 ms
root@dec7fe825b65:/# exit
exit
root@env:/home/docker# docker exec -it 8ce2433d6631 /bin/bash
root@8ce2433d6631:/# ping 180.0.0.4
PING 180.0.0.4 (180.0.0.4) 56(84) bytes of data.
From 180.0.0.40 icmp_seq=1 Destination Host Unreachable
From 180.0.0.40 icmp_seq=2 Destination Host Unreachable
From 180.0.0.40 icmp_seq=3 Destination Host Unreachable
^C
--- 180.0.0.4 ping statistics ---
5 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4093ms
pipe 3
root@8ce2433d6631:/#

The second problem (which probably is connected) to that is very common routing mesh problem.

When I deploy services on different machines using global mode, they cannot communicate with each other.

My machines look like so:

Machine A - 180.0.0.1
Machine B - 180.0.0.2
Machine C - 180.0.0.3
Database (MySQL) - 180.0.0.4

services:
  nginx-exposed:
    image: nginx
    deploy:
      mode: global
    volumes:
      - /home/docker/templates:/etc/nginx/templates
    ports:
      - "8567:80"
    networks:
      - test-network
    environment:
      - NGINX_HOST=foobar.com
      - NGINX_PORT=80

  nginx-no-expose:
    image: nginx
    deploy:
      mode: global
    volumes:
      - /home/docker/templates:/etc/nginx/templates
    networks:
      - test-network
    environment:
      - NGINX_HOST=foobar.com
      - NGINX_PORT=81

networks:
  test-network:
    driver: overlay
    attachable: true
    external: true

Ping from machine A to machine A:

root@env:/home/docker# curl http://180.0.0.1:8567/hello
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx/1.27.0</center>
</body>
</html>

Ping from machine B to machine A:

root@env2:/home/docker# curl http://180.0.0.1:8567/hello
(HANGS FOREVER)

Ping from machine B to machine B:

root@env2:/home/docker# curl http://180.0.0.2:8567/hello
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx/1.27.0</center>
</body>
</html>

I’ve tested if it is overlay network problem, but it is not. Deploying container outside of swarm on machine A:

docker run --rm -p "4032:80" --network=test-network --env="NGINX_PORT=80" -it nginx

I can normally curl on machine B by port 4032 and it works like a charm.

It is not the overlay network problem, it is the ingress network problem which clashes with network between my machine (that is my prediction).

I have also podman installed on my machine. By far i cannot find any information regarding if I should be concerned about it - I stopped the service for a moment, but the problem remains. Unfortunately, I cannot uninstall it cause it’s used by other people in this environment.

What I’ve tried:

  • checking firewalls (ufw inactive)
  • clearing iptables
  • checking ports required for docker swarm using netcat (all are open between machines)
  • customizing --data-path-port using swarm init
  • re-initializing swarm couple times
  • disabling tx on every node using below command
sudo ethtool -K <iface> tx-checksum-ip-generic off
sudo ethtool -K <iface> tx off
  • stopping podman service on my machine
  • stopping zabbix and nginx on my machine
  • changing MTU to 1450 or 1400 on my network

Docker version:

Client: Docker Engine - Community
 Version:           27.0.3
 API version:       1.46
 Go version:        go1.21.11
 Git commit:        7d4bcd8
 Built:             Sat Jun 29 00:02:33 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.0.3
  API version:      1.46 (minimum version 1.24)
  Go version:       go1.21.11
  Git commit:       662f78c
  Built:            Sat Jun 29 00:02:33 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.18
  GitCommit:        ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 runc:
  Version:          1.7.18
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Ubuntu version:

Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy

What can I do to debug where the problem is?

Are you fixing any IPs to certain values? Usually you can just use a service/container name to connect via Docker DNS.

Use docker inspect <c-id> and docker network inspect <n-id> to check the IPs of your 3 containers.