Limit of 116 services with published ports on docker swarm?

Hi,

We are facing a issue that I cannot find documented (it might well be but we can not find it).

We have swarm cluster with two overlay networks (running jupyterhub but that is irrelevant as it happen with any container/service) with around 100-200 services (spread on 11 hosts).

The problem:
We cannot create more than 116 services WITH published ports, after this only services with no published ports are possible to create.

There is no error log, and when inspecting the service it looks ok but it hangs on creation (desired state running)
Is it possible to lift this limitation or any documentation describing why this limitation is in place ?

Regards
Jonas

Weird limit, what’s the error message you’re getting when trying to bring up a new container with published ports?

Hi,

I can run a container with published ports (on the manager node) but I cannot create a service with the same port mapping. So it is definitely something related to the swarm but not sure what.
I do not get (or can’t find) any error logs/messages it just hangs on creation of the service.

If I run docker service ps it says it is desired state “Running” and Current state “New 2 hours ago”
docker service inspect looks good and no obvious errors.

The version of docker in the cluster are 27.0.1 (four nodes have 27.2.0 as they where recently upgraded)

Do you manually set the published port for every service? Can you share an example?

Please provide an example with all required information to understand the situation, including node labels, placement constraints and commands to create the services (preferably as compose file for a stack deployment).

Furthermore, have you checked whether it makes a difference if the published port uses the ingress or host mode (can be restricted to nodes with node labels and placement constraints)? And whether it makes a difference, if the endpoint mode vip or dnsrr is used.

Are we dealing with layer7 http(s) traffic? If this is the case, wouldn’t a reverse proxy with domain based forwarding make more sense?

Hi,

I will try to o explain the system but please ping me if I am unclear.

We have 11 nodes, we use no node labels but we use a single engine label to difference between GPU and non GPU enabled nodes.
example :

"Spec": {
            "Labels": {},
            "Role": "worker",
            "Availability": "active"
        },
        "Description": {
            "Hostname": "carl-gpu1",
            "Platform": {
                "Architecture": "x86_64",
                "OS": "linux"
            },
            "Resources": {
                "NanoCPUs": 20000000000,
                "MemoryBytes": 67096875008,
                "GenericResources": [
                    {
                        "NamedResourceSpec": {
                            "Kind": "GPU2080",
                            "Value": "GPU-542a9b92-8d14-6059-bca0-d67c7531f2fb"
                        }
                    },
                    {
                        "NamedResourceSpec": {
                            "Kind": "GPU2080",
                            "Value": "GPU-5c1c83b1-5ea2-8d7e-7ba1-b50153a327ac"
                        }
                    }
                ]
            },
            "Engine": {
                "EngineVersion": "27.2.0",
                "Labels": {
                    "worker-type": "GPU"
                },

The swarm is used to host a jupyerhub “cluster”, the docker compose (example-docker-compose.yml) (and our custom code) can bee found here: csma / jupyterhub · GitLab
The user notebooks are spawned by the swarmspawner GitHub - jupyterhub/dockerspawner: Spawns JupyterHub single user servers in Docker containers

The system works fine until 116 services are created but then it does not work any more.
Example when I try to manually launch a service

# docker service ls |grep ' 1/1 '|wc -l
116
# docker service create --name noport  docker.cs.kau.se/csma/jupyterhub/datascience-notebook:2024-09-04
knatjfwpw6c1z8mi8xct35b6x
overall progress: 1 out of 1 tasks 
1/1: running   [==================================================>] 
verify: Service knatjfwpw6c1z8mi8xct35b6x converged 
# docker service create --name withport --publish 777:777  docker.cs.kau.se/csma/jupyterhub/datascience-notebook:2024-09-04
vz1ihff9108mxau048b6o9gpm
overall progress: 0 out of 1 tasks 
1/1: new       [=====>                                             ] 
^COperation continuing in background.
Use `docker service ps vz1ihff9108mxau048b6o9gpm` to check progress.
# docker service ls |grep port
knatjfwpw6c1   noport                 replicated   1/1        docker.cs.kau.se/csma/jupyterhub/datascience-notebook:2024-09-04   
vz1ihff9108m   withport               replicated   0/1        docker.cs.kau.se/csma/jupyterhub/datascience-notebook:2024-09-04

It also work fine to manually launch a container (on the manager node) with publsihed ports eg.
docker run --rm -d -p778:777 --name withportrun docker.cs.kau.se/csma/jupyterhub/datascience-notebook:2024-09-04

The ports are published for ssh access so domain based forwarding does not work (maybe something like sshpiper could work but we have not had time to test it yet).

Since it is a production system we have reverted the “ssh access” feature and restarted the system but I will next week reproduce the issue on our development system and can do further analyses then.

Thanks for the feedback
Jonas

Indeed, if you need tcp or udp ports, a reverse proxy is not going to help reducing the number of published ports. If it’s common to wrap the traffic for the protocol in TLS, you could leverage SNI to identify the target. Though, afaik SSH is not a suited candidate for this.

This service indeed has no running tasks. Something must prevent them from either scheduling, starting or running. Have you checked the output of docker service ps withport --no-trunc? It usually helps with the first two situations.

Yes I ran docker service ps but the output only said that the desired state was “Running” and Current state “New 2 hours ago” (after 2 hours of course) and nothing more (no node assigned).

I have not tried anything than the default settings so I guess that is ingress and vip, I can do more checking next week on our development system and try different modes.

Thanks

Hi,

We use ingress and vip as dnsrr can only be used with “host” mode and that will not work in our setting.
I tested today on our dev environment and the results as puzzling.
On the dev environment (with 3 nodes in total) I can publish around 100 services with published ports.

I wonder if it has someting to do with the /24 of the ingress network but the numbers seams odd.