I cannot access my services in worker mode

My docker swarm build is as follows;

10.10.10.101 - Manager 1 - Ubuntu 22.04
10.10.10.102 - Manager 2 - Ubuntu 22.04
10.10.10.111 - Worker 1 - Ubuntu 22.04
10.10.10.112 - Worker2 - Ubuntu 22.04
10.10.10.131 - Nfs server - Ubuntu 22.04

Deploy mode for my two Postgres and Portainer servers
pull to replicated and worker parameters “docker stack deploy -c docker-compose.yml traefik”
But I cannot access these two servers in any way.

When I change the deploy mode to global and manager; I can access my servers.

I get tcp dial error error in traffic logs.

When I look at the portainer interface ;

a-) I see my manager and worker servers.
b-) For example, when I set customer1 as global and manager and set customer2 as replicated and worker
in portainer;

I encounter these screenshots.

Although I spent a very long time, unfortunately I could not solve the problem. Thank you very, very much in advance to friends who can help.

version: ‘3.9’

services:
traefik:
image: ‘traefik:v3.1’
hostname: ‘{{.Node.Hostname}}’
ports:
- “80:80”
- “443:443”
- “5432:5432”
deploy:
mode: global
placement:
constraints:
- node.role==manager
volumes:
- “/var/run/docker.sock:/var/run/docker.sock:ro”
- “traefik-certificates:/certificates”
command:
- “–api.dashboard=true”
- “–log.level=INFO”
- “–accesslog=true”
- “–providers.docker.network=proxy”
- “–providers.docker.exposedbydefault=false”
- “–providers.swarm.endpoint=unix:///var/run/docker.sock”
- “–entrypoints.http.address=:80”
- “–entrypoints.https.address=:443”
- “–entrypoints.postgres.address=:5432”
- “--certificatesresolvers.stagingresolver.acme.email=berk.xxxxx@gmail.com
- “–certificatesresolvers.stagingresolver.acme.tlschallenge=true”
- “–certificatesresolvers.stagingresolver.acme.storage=/certificates/acme.json”
- “–certificatesresolvers.stagingresolver.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory
networks:
- proxy
labels:
- “traefik.enable=true”
- “traefik.docker.network=proxy”
- “traefik.http.middlewares.https-redirect.redirectscheme.scheme=https”
- “traefik.http.middlewares.https-redirect.redirectscheme.permanent=true”
- “traefik.http.routers.traefik-public-http.rule=Host(traefik.example.com)”
- “traefik.http.routers.traefik-public-http.entrypoints=http”
- “traefik.http.routers.traefik-public-http.middlewares=https-redirect”
- “traefik.http.routers.traefik-public-https.rule=Host(traefik.example.com)”
- “traefik.http.routers.traefik-public-https.entrypoints=https”
- “traefik.http.routers.traefik-public-https.tls=true”
- “traefik.http.routers.traefik-public-https.service=api@internal”
- “traefik.http.routers.traefik-public-https.tls.certresolver=stagingresolver”
- “traefik.http.services.traefik-public.loadbalancer.server.port=80”

portainer:
image: portainer/portainer-ce:latest
command: -H unix:///var/run/docker.sock
volumes:
- “/var/run/docker.sock:/var/run/docker.sock”
- “portainer_data:/data”
networks:
- proxy
deploy:
mode: global
placement:
constraints:
- node.role==manager
labels:
- “traefik.enable=true”
- “traefik.http.routers.portainer.rule=Host(portainer.example.com)”
- “traefik.http.routers.portainer.entrypoints=http”
- “traefik.http.middlewares.portainer-https-redirect.redirectscheme.scheme=https”
- “traefik.http.routers.portainer.middlewares=portainer-https-redirect”
- “traefik.http.routers.portainer-secured.rule=Host(portainer.example.com)”
- “traefik.http.routers.portainer-secured.entrypoints=https”
- “traefik.http.routers.portainer-secured.tls=true”
- “traefik.http.routers.portainer-secured.tls.certresolver=stagingresolver”
- “traefik.http.services.portainer.loadbalancer.server.port=9000”

customer_000001_postgres:
image: postgres:latest
environment:
POSTGRES_DB: postgres
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres001
volumes:
- customer_000001:/var/lib/postgresql/data
networks:
- proxy
deploy:
mode: global
placement:
constraints:
- node.role==manager
labels:
- “traefik.enable=true”
- “traefik.tcp.routers.customer_000001_postgres.entrypoints=postgres”
- “traefik.tcp.routers.customer_000001_postgres.rule=HostSNI(customer1.example.com)”
- “traefik.tcp.routers.customer_000001_postgres.tls=true”
- “traefik.tcp.routers.customer_000001_postgres.tls.certresolver=stagingresolver”
- “traefik.tcp.services.customer_000001_postgres.loadbalancer.server.port=5432”

customer_000002_postgres:
image: postgres:latest
environment:
POSTGRES_DB: postgres
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres002
volumes:
- customer_000002:/var/lib/postgresql/data
networks:
- proxy
deploy:
mode: replicated
replicas: 1
placement:
constraints:
- node.role==worker
labels:
- “traefik.enable=true”
- “traefik.tcp.routers.customer_000002_postgres.entrypoints=postgres”
- “traefik.tcp.routers.customer_000002_postgres.rule=HostSNI(customer2.example.com)”
- “traefik.tcp.routers.customer_000002_postgres.tls=true”
- “traefik.tcp.routers.customer_000002_postgres.tls.certresolver=stagingresolver”
- “traefik.tcp.services.customer_000002_postgres.loadbalancer.server.port=5432”

volumes:
traefik-certificates:
driver: local
driver_opts:
type: nfs
o: addr=10.10.10.131,nfsvers=4
device: “:/mnt/nfsdisk/certificates”
portainer_data:
driver: local
customer_000001:
driver: local
driver_opts:
type: nfs
o: addr=10.10.10.131,nfsvers=4
device: “:/mnt/nfsdisk/customer_000001/postgres_data”
customer_000002:
driver: local
driver_opts:
type: nfs
o: addr=10.10.10.131,nfsvers=4
device: “:/mnt/nfsdisk/customer_000002/postgres_data”

networks:
proxy:
name: proxy
driver: overlay
attachable: true
driver_opts:
com.docker.network.driver.mtu: 1400


Please, format your post according to the following guide: How to format your forum posts
In short: please, use </> button to share codes, terminal outputs, error messages or anything that can contain special characters which would be interpreted by the MarkDown filter. Use the preview feature to make sure your text is formatted as you would expect it and check your post after you have sent it so you can still fix it.

Example code block:

```
echo "I am a code."
echo "An athletic one, and I wanna run."
```

Sorry and thank you @meyay

my docker-compose.yaml file:

version: '3.9'

services:
  traefik:
    image: 'traefik:v3.1'
    ports:
      - "80:80"
      - "443:443"
      - "5432:5432"
    deploy:
      mode: global
      placement:
        constraints:
          - node.role==manager
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
      - "traefik-certificates:/certificates"
    command:
      - "--api.dashboard=true"
      - "--log.level=INFO"
      - "--accesslog=true"
      - "--providers.docker.network=proxy"
      - "--providers.docker.exposedbydefault=false"
      - "--providers.swarm.endpoint=unix:///var/run/docker.sock"
      - "--entrypoints.http.address=:80"
      - "--entrypoints.https.address=:443"
      - "--entrypoints.postgres.address=:5432"
      - "--certificatesresolvers.stagingresolver.acme.email=berk.xxxxx@gmail.com"
      - "--certificatesresolvers.stagingresolver.acme.tlschallenge=true"
      - "--certificatesresolvers.stagingresolver.acme.storage=/certificates/acme.json"
      - "--certificatesresolvers.stagingresolver.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory"
    networks:
      - proxy
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=proxy"
      - "traefik.http.middlewares.https-redirect.redirectscheme.scheme=https"
      - "traefik.http.middlewares.https-redirect.redirectscheme.permanent=true"
      - "traefik.http.routers.traefik-public-http.rule=Host(`traefik.example.com`)"
      - "traefik.http.routers.traefik-public-http.entrypoints=http"
      - "traefik.http.routers.traefik-public-http.middlewares=https-redirect"
      - "traefik.http.routers.traefik-public-https.rule=Host(`traefik.example.com`)"
      - "traefik.http.routers.traefik-public-https.entrypoints=https"
      - "traefik.http.routers.traefik-public-https.tls=true"
      - "traefik.http.routers.traefik-public-https.service=api@internal"
      - "traefik.http.routers.traefik-public-https.tls.certresolver=stagingresolver"
      - "traefik.http.services.traefik-public.loadbalancer.server.port=80"

  portainer:
    image: portainer/portainer-ce:latest
    command: -H unix:///var/run/docker.sock
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock"
      - "portainer_data:/data"
    networks:
      - proxy
    deploy:
      mode: global
      placement:
        constraints:
          - node.role==manager
      labels:
        - "traefik.enable=true"
        - "traefik.http.routers.portainer.rule=Host(`portainer.example.com`)"
        - "traefik.http.routers.portainer.entrypoints=http"
        - "traefik.http.middlewares.portainer-https-redirect.redirectscheme.scheme=https"
        - "traefik.http.routers.portainer.middlewares=portainer-https-redirect"
        - "traefik.http.routers.portainer-secured.rule=Host(`portainer.example.com`)"
        - "traefik.http.routers.portainer-secured.entrypoints=https"
        - "traefik.http.routers.portainer-secured.tls=true"
        - "traefik.http.routers.portainer-secured.tls.certresolver=stagingresolver"
        - "traefik.http.services.portainer.loadbalancer.server.port=9000"

  customer_000001_postgres:
    image: postgres:latest
    environment:
      POSTGRES_DB: postgres
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres001
    volumes:
      - customer_000001:/var/lib/postgresql/data
    networks:
      - proxy
    deploy:
      mode: global
      placement:
        constraints:
          - node.role==manager
      labels:
        - "traefik.enable=true"
        - "traefik.tcp.routers.customer_000001_postgres.entrypoints=postgres"
        - "traefik.tcp.routers.customer_000001_postgres.rule=HostSNI(`customer1.example.com`)"
        - "traefik.tcp.routers.customer_000001_postgres.tls=true"
        - "traefik.tcp.routers.customer_000001_postgres.tls.certresolver=stagingresolver"
        - "traefik.tcp.services.customer_000001_postgres.loadbalancer.server.port=5432"

  customer_000002_postgres:
    image: postgres:latest
    environment:
      POSTGRES_DB: postgres
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres002
    volumes:
      - customer_000002:/var/lib/postgresql/data
    networks:
      - proxy
    deploy:
      mode: replicated
      replicas: 1
	  placement:
        constraints:
          - node.role==worker
      labels:
        - "traefik.enable=true"
        - "traefik.tcp.routers.customer_000002_postgres.entrypoints=postgres"
        - "traefik.tcp.routers.customer_000002_postgres.rule=HostSNI(`customer2.example.com`)"
        - "traefik.tcp.routers.customer_000002_postgres.tls=true"
        - "traefik.tcp.routers.customer_000002_postgres.tls.certresolver=stagingresolver"
        - "traefik.tcp.services.customer_000002_postgres.loadbalancer.server.port=5432"

volumes:
  traefik-certificates:
    driver: local
    driver_opts:
      type: nfs
      o: addr=10.10.10.131,nfsvers=4
      device: ":/mnt/nfsdisk/certificates"
  portainer_data:
    driver: local
  customer_000001:
    driver: local
    driver_opts:
      type: nfs
      o: addr=10.10.10.131,nfsvers=4
      device: ":/mnt/nfsdisk/customer_000001/postgres_data"
  customer_000002:
    driver: local
    driver_opts:
      type: nfs
      o: addr=10.10.10.131,nfsvers=4
      device: ":/mnt/nfsdisk/customer_000002/postgres_data"

networks:
  proxy:
    name: proxy
    driver: overlay
    attachable: true
    driver_opts:
      com.docker.network.driver.mtu: 1400

So you containers are not able to communicate through the overlay network?

Typical reasons for overlay network communication to fail are:

  • missing open ports on the firewall
  • the nodes do not share the same mtu size setting
  • the nodes are running on esxi
  • the nodes do not share a low latency network connection (e.g. nodes are connected through wan or are in other regions)

Note: the ingress routing mesh used by the published Traefik ports already uses an overlay network.

Additional observations:

  • an equal number of manager nodes is worse, than running one less. With two manager nodes, either one has to be unhealthy to make the cluster headless.
  • make sure to use mode: dnsrr for databases so that long running connections of a database connection pool are not breaking after 900 seconds idle time
  • if you want to retain source ips in traefik, publish the ports with mode: host (=no ingress mesh)
  • Portainer is supposed to have a single replica, the agents (which you don’t use at all?!) are supposed to be deployed in global mode.

Note about volumes and networks: both are immutable once created on a node. Configuration changes in the compose file will never be reflected. They must be deleted (on each node!) and re-created to have the updated configuration,

As @meyay stated, it’s a bad idea to have two manager nodes.

Note that manager nodes can be used for worker tasks, too. Just make sure to constrain the CPU usage of the workloads to not take up 100%.

Furthermore running a database with a network file systems seems also totally not best practice. Databases are usually build for fast direct access to local files.

You probably want to ensure the underlaying file system when a node breaks and the container is moved to a different node. But for databases this is usually done with a primary/secondary setup or a DB cluster, not with a shared file system.

Thank you for your valuable feedback and help. The cause of the problem that I have been struggling with for days is that my ESXi servers
turned out to be a port conflict in the overlay network. I did some research when I saw the esxi note in @meyay’s notes.

In order to help other friends;

1-) If your servers are running on Esxi, the overlay network you set up on swarm mode is exposed to a port conflict.

Setting up my swarm cluster this way solved my general problem.

docker swarm init --advertise-addr 10.10.10.101 --data-path-port=9789

I would like to thank Mertcan Gökgöz for this important informative article.

2-) I also noticed the article “VMware and Swarm routing” at Known issues with VMware.

It was written that Ubuntu servers may also have problems.

I thought it would be useful to try.

“ethtool -K [network] tx-checksum-ip-generic off”

3-) Apart from this, I continue to update my production in accordance with the very valuable advice of @meyay and @bluepuma77 and to build the system I want in the most optimum way.

The stage I have reached has been working for the last 6 hours without any error.

In my testConnection.js function (with pgadmin) with nodejs, which I have constant connection problems that I cannot ensure to be stable function (while I could connect with pgadmin, I could not connect with nodejs or python) started to work stably.

Thank you again and again, I wish you a healthy day.