Unable to create service on docker swarm managers, only on workers

Hi all
I have a fresh install of docker swarm on 8 nodes - 3 managers and 5 workers. All seems ok. I can spin up containers on the worker nodes but not on any manager nodes. Which is ok as I want services to run only on worker nodes. However, I want Traefik to run on manager/s only as I believe that’s where it should be.
I’ve tried using various placement constraints and labels, to no avail. The message is always the same: network “sandbox join failed: …”. This only happens when I try to run the service on any manager node. When I run the same service on one or all workers it runs without a hitch.
What I have noticed is that when the replicas are less than or equal to the number of worker nodes, and no placement constraints present, they are automatically and randomly spread around the worker nodes. And they work perfectly. When I create more replicas than there are workers, they fail, and I realised its because docker tries to put the remainder of the replicas on the manager nodes, after placing one on each of the worker nodes. But the manager nodes just won’t have containers/services on them. Even a basic Nginx service fails on the manager nodes.
I hope someone can help me just to get traefik on a manager node as I don’t know what I’m doing wrong.
Thank you in advance
Edwin

I am afraid a description alone is not enough to actually get an exact understanding of the situation.
Please share the compose file content for the Traefik deployment + the command used to deploy them.

Hi meyay
Thank you for getting back to me. My concern is not that Traefik won’t run but that I cannot create any service on the manager nodes. Even the “hellow-world”. Below is the Traefik docker compose file. I use docker stack deploy -c traefik.yml traefik. An overlay network was created “swarmnetwork”.

services:
  traefik:
    image: "traefik:2.9.9"
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints:
          - "node.role==manager"
    command:
      #
      - "--api=true"
      - "--api.dashboard=true"
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      - "--entrypoints.webhttps.address=:443"
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.traefik.rule=Host(`traefik.mydomain.com`)"
      - "traefik.http.routers.trafik.entrypoints=webhttps"
      - "traefik.http.routers.traefik.service=api@internal"
      - "traefik.http.routers.traefik.middlewares=traefik-basic-auth"
      - "traefik.http.middlewares.traefik-basic-auth.basicauth.users=username:<password>"
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
    ports:
      - "443:443"
    networks:
      - "swarmnetwork"

  whoami:
    image: "traefik/whoami"
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.whoami.rule=Host(`whoami.mydomain.com`)"
      - "traefik.http.routers.whoami.entrypoints=webhttps"
    networks:
      - "swarmnetwork"

networks:
  swarmnetwork:
    external: true

volumes:
  ssl:
    external: true

Your compose file looks about right. I don’t see anything that should prevent traefik from being deployed correctly.

Please share the output of these commands:

# on any manager node
docker service ps traefik_traefik --no-trunc` 
# from every manager node
netstat -tpln | awk '/:443 / { split($7,x,"/"); system("ps -f --pid " x[1])}'

I want to understand whether traefik is running, and if not get more details. The 2nd command allows to see which application binds port 443 on your worker nodes.

Of and the most important thing: I would also like to see the output of docker network inspect swarmnetwork to make sure the network was created using the overlay driver and is swarm scoped.

Hi meyay
Thanks for getting back to me
Just so you know this problem only occurs on raspberry pi nodes. I tried three Odroid N2’s with exactly same setup and on the same network, and they work perfectly. I can deploy services on them as managers.
Here are your answers:

# on any manager node**
docker service ps traefik_traefik --no-trunc

y18zke8xwvyiky8q51y8sile6 traefik_traefik.1 traefik:2.9.9@sha256:a6462879a1fce98fdd21ef4688a7af32b30e156a30411c5b2f4d5877b9a0bf91 mngr21 Ready Rejected 2 seconds ago "network sandbox join failed: subnet sandbox join failed for "10.0.1.0/24": error creating vxlan interface: operation not supported"

*# from every manager node*
*netstat -tpln | awk '/:443 / { split($7,x,"/"); system("ps -f --pid " x[1])}'*

I get nothing

docker network inspect swarmnetwork

[
    {
        "Name": "swarmnetwork",
        "Id": "x9230ibahrsy4i1gobdtawe3t",
        "Created": "2023-03-29T21:12:28.145269474Z",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.0.1.0/24",
                    "Gateway": "10.0.1.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": null,
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4097"
        },
        "Labels": null
    }
]

Thanx again

The network swarmnetwork looks good to me. Also no other process binds the port one the node you shared the information from. The service creates no container, as it does not find the network to join.

I still don’t see why traefik is not able to join the network. There must be a reason it can not find it.

Can you try if it makes a difference, if the network is declared like this:

networks:
  swarmnetwork:
    external: true
    name: swarmnetwork

Hi meyay and thanx again
I was running Ubuntu 22.04 raspberry pi version. Yesterday I downloaded Raspberry Pi OS Lite, installed it on all three raspberry pi nodes and added them as managers. Three Odroid N2 nodes as workers. And everything just works. I can create services (including Traefik) on the raspberry pi manager nodes, as well as on the odroid worker nodes and it all just works. I briefly went back to Ubuntu on one of the raspberry pi managers, just to make sure, and the problem resurfaced. Ubuntu has been my OS of choice for many years and I’m not sure what I’m doing wrong as others are running it on Rpi’s without issue.
So I’m not sure what the issue is. Until I find the issue I will have to stick with Raspberry Pi OS.
Thank you very much for your kind assistance.

1 Like

Hi Meyay,

I am experiencing the same exact problem as described here.

I have a 3 node cluster on ubuntu 22.04 on x64 VMs and I wanted to extend the cluster to a physical ARM Device with which I installed using the Ubuntu 22.04 distro for PI. I am having the exact same problem, so this should be reproducible by any one.

Is moving to Raspberry Pi OS Lite the only solution to resolving this?

Note that I did try to install linux-modules-extra-raspi as some forums suggested, but the situation did not improve.

I don’t own a RPi, so I can’t really say anything about it.

You check if all required kernel modules are available: curl -L https://raw.githubusercontent.com/moby/moby/master/contrib/check-config.sh | bash

And you can check if the firewall restricts the communication. Execute this on a manager node:

node_ids=$(docker node ls -q)
check_ips=""
for node in ${node_ids}; do
  check_ips="${check_ips} $(docker node inspect ${node} --format '{{.Status.Addr}}')"
done
cat << EOF
# execute this on each node:
check_ips="${check_ips}"
for _ip in \${check_ips}; do
  echo "## ip: \${_ip}"
  nc -zv \${_ip} -t 2377 7946
  nc -zv \${_ip} -u 7946 4789
done
EOF

It will create a script-snippet that needs to be run on each of your nodes. The worker nodes don’t bind port 2377/tcp - it is normal that nothing is reachable there.

Hi Meyay,

The check-config script shows all green. No issues or errors.

The script you provided also looks fine:

execute this on each node:
check_ips=“192.168.213.210 192.168.213.211 192.168.213.212 192.168.213.216”
for _ip in ${check_ips}; do
echo “## ip: ${_ip}”
nc -zv ${_ip} -t 2377 7946
nc -zv ${_ip} -u 7946 4789
done

ip: 192.168.213.210

Connection to 192.168.213.210 2377 port [tcp/] succeeded!
Connection to 192.168.213.210 7946 port [tcp/
] succeeded!
Connection to 192.168.213.210 7946 port [udp/] succeeded!
Connection to 192.168.213.210 4789 port [udp/
] succeeded!

ip: 192.168.213.211

Connection to 192.168.213.211 2377 port [tcp/] succeeded!
Connection to 192.168.213.211 7946 port [tcp/
] succeeded!
Connection to 192.168.213.211 7946 port [udp/] succeeded!
Connection to 192.168.213.211 4789 port [udp/
] succeeded!

ip: 192.168.213.212

Connection to 192.168.213.212 2377 port [tcp/] succeeded!
Connection to 192.168.213.212 7946 port [tcp/
] succeeded!
Connection to 192.168.213.212 7946 port [udp/] succeeded!
Connection to 192.168.213.212 4789 port [udp/
] succeeded!

ip: 192.168.213.216

nc: connect to 192.168.213.216 port 2377 (tcp) failed: Connection refused
Connection to 192.168.213.216 7946 port [tcp/] succeeded!
Connection to 192.168.213.216 7946 port [udp/
] succeeded!

in my case 210, 211 and 212 are manager and 216 is a worker node, so should not be listening on 2377.

Anyone else willing to try docker on raspberry pi with ubuntu 22.04?

Just to be sure, the result of the script has the same output on all nodes?

If it’s the case, then I have no idea: it should work.

Years ago, I had a RPi 3b running with Ubuntu core 18.04 and joined as 3rd swarm manager node. It worked generally. But I ended up replacing it with an older i3 Intel nuc to have a gbit connection and a ssd.

You did install docker-ce from the docker repositories and not the snap package or docker.io from the Ubuntu repositories, right?

Update: is Connection to *192.168.213.216 4789 port [udp/] missing, or did you just forget to include it when you pasted it?

Just to be sure, the result of the script has the same output on all nodes?

Yes

You did install docker-ce from the docker repositories and not the snap package or docker.io from the Ubuntu repositories, right?

Docker.io - same way I installed the x64 nodes - should I have done this differently?

Update: is Connection to *192.168.213.216 4789 port [udp/] missing, or did you just forget to include it when you pasted it?

Missing, the script did not return this line for the problematic node. No firewall is running.

That’s odd, as the same command that checks 7946/udp and 4789/udp returns a result for the other nodes. While for that specific node you only get output for 7946/udp and no output for 4789/udp at all. It should have printed either failed or succeeded for port 4789/udp. Don’t ask me why it didn’t list it, I can only say it should have been listed.

Well, that’s the Docker distribution of Ubuntu, which is maintained and supported by Ubuntu and not by Docker (and this forum).

It is up to you which distribution you want to use. Though, if you want vanilla docker experience, I would suggest uninstalling the docker.io package and installing it from docker’s repositories.

@meyay

Same here, not sure why the 3rd port is not showing.

Correction, I meant to say docker.com

# cat /etc/apt/sources.list/docker.list
deb [arch=arm64 signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu jammy stable

No one from this forum ever ran docker swarm with a node running ubuntu 22.04 on a raspberry pi?

Thanks.

If you are on a raspberry pi with Ubuntu you may just be missing the necessary libraries. You can install them with:

sudo apt install linux-modules-extra-raspi

And there is also a possibility of a mtu size mismatch between the nodes.

I always forget about the mtu size, but the other day it was the solution in another topic about swarm overlay problems.