Limit of 116 services with published ports on docker swarm?

jonakarl · September 6, 2024, 6:33am

Hi,

We are facing a issue that I cannot find documented (it might well be but we can not find it).

We have swarm cluster with two overlay networks (running jupyterhub but that is irrelevant as it happen with any container/service) with around 100-200 services (spread on 11 hosts).

The problem:
We cannot create more than 116 services WITH published ports, after this only services with no published ports are possible to create.

There is no error log, and when inspecting the service it looks ok but it hangs on creation (desired state running)
Is it possible to lift this limitation or any documentation describing why this limitation is in place ?

Regards
Jonas

deanayalon · September 6, 2024, 7:14am

Weird limit, what’s the error message you’re getting when trying to bring up a new container with published ports?

jonakarl · September 6, 2024, 8:04am

Hi,

I can run a container with published ports (on the manager node) but I cannot create a service with the same port mapping. So it is definitely something related to the swarm but not sure what.
I do not get (or can’t find) any error logs/messages it just hangs on creation of the service.

If I run docker service ps it says it is desired state “Running” and Current state “New 2 hours ago”
docker service inspect looks good and no obvious errors.

The version of docker in the cluster are 27.0.1 (four nodes have 27.2.0 as they where recently upgraded)

bluepuma77 · September 6, 2024, 6:29pm

Do you manually set the published port for every service? Can you share an example?

meyay · September 6, 2024, 7:36pm

Please provide an example with all required information to understand the situation, including node labels, placement constraints and commands to create the services (preferably as compose file for a stack deployment).

Furthermore, have you checked whether it makes a difference if the published port uses the ingress or host mode (can be restricted to nodes with node labels and placement constraints)? And whether it makes a difference, if the endpoint mode vip or dnsrr is used.

Are we dealing with layer7 http(s) traffic? If this is the case, wouldn’t a reverse proxy with domain based forwarding make more sense?

jonakarl · September 7, 2024, 7:28am

Hi,

I will try to o explain the system but please ping me if I am unclear.

We have 11 nodes, we use no node labels but we use a single engine label to difference between GPU and non GPU enabled nodes.
example :

"Spec": {
            "Labels": {},
            "Role": "worker",
            "Availability": "active"
        },
        "Description": {
            "Hostname": "carl-gpu1",
            "Platform": {
                "Architecture": "x86_64",
                "OS": "linux"
            },
            "Resources": {
                "NanoCPUs": 20000000000,
                "MemoryBytes": 67096875008,
                "GenericResources": [
                    {
                        "NamedResourceSpec": {
                            "Kind": "GPU2080",
                            "Value": "GPU-542a9b92-8d14-6059-bca0-d67c7531f2fb"
                        }
                    },
                    {
                        "NamedResourceSpec": {
                            "Kind": "GPU2080",
                            "Value": "GPU-5c1c83b1-5ea2-8d7e-7ba1-b50153a327ac"
                        }
                    }
                ]
            },
            "Engine": {
                "EngineVersion": "27.2.0",
                "Labels": {
                    "worker-type": "GPU"
                },

The swarm is used to host a jupyerhub “cluster”, the docker compose (example-docker-compose.yml) (and our custom code) can bee found here: csma / jupyterhub · GitLab
The user notebooks are spawned by the swarmspawner GitHub - jupyterhub/dockerspawner: Spawns JupyterHub single user servers in Docker containers

The system works fine until 116 services are created but then it does not work any more.
Example when I try to manually launch a service

# docker service ls |grep ' 1/1 '|wc -l
116
# docker service create --name noport  docker.cs.kau.se/csma/jupyterhub/datascience-notebook:2024-09-04
knatjfwpw6c1z8mi8xct35b6x
overall progress: 1 out of 1 tasks 
1/1: running   [==================================================>] 
verify: Service knatjfwpw6c1z8mi8xct35b6x converged 
# docker service create --name withport --publish 777:777  docker.cs.kau.se/csma/jupyterhub/datascience-notebook:2024-09-04
vz1ihff9108mxau048b6o9gpm
overall progress: 0 out of 1 tasks 
1/1: new       [=====>                                             ] 
^COperation continuing in background.
Use `docker service ps vz1ihff9108mxau048b6o9gpm` to check progress.
# docker service ls |grep port
knatjfwpw6c1   noport                 replicated   1/1        docker.cs.kau.se/csma/jupyterhub/datascience-notebook:2024-09-04   
vz1ihff9108m   withport               replicated   0/1        docker.cs.kau.se/csma/jupyterhub/datascience-notebook:2024-09-04

It also work fine to manually launch a container (on the manager node) with publsihed ports eg.
docker run --rm -d -p778:777 --name withportrun docker.cs.kau.se/csma/jupyterhub/datascience-notebook:2024-09-04

The ports are published for ssh access so domain based forwarding does not work (maybe something like sshpiper could work but we have not had time to test it yet).

Since it is a production system we have reverted the “ssh access” feature and restarted the system but I will next week reproduce the issue on our development system and can do further analyses then.

Thanks for the feedback
Jonas

meyay · September 7, 2024, 10:01am

Indeed, if you need tcp or udp ports, a reverse proxy is not going to help reducing the number of published ports. If it’s common to wrap the traffic for the protocol in TLS, you could leverage SNI to identify the target. Though, afaik SSH is not a suited candidate for this.

This service indeed has no running tasks. Something must prevent them from either scheduling, starting or running. Have you checked the output of docker service ps withport --no-trunc? It usually helps with the first two situations.

jonakarl · September 7, 2024, 5:18pm

Yes I ran docker service ps but the output only said that the desired state was “Running” and Current state “New 2 hours ago” (after 2 hours of course) and nothing more (no node assigned).

I have not tried anything than the default settings so I guess that is ingress and vip, I can do more checking next week on our development system and try different modes.

Thanks

jonakarl · September 10, 2024, 12:25pm

Hi,

We use ingress and vip as dnsrr can only be used with “host” mode and that will not work in our setting.
I tested today on our dev environment and the results as puzzling.
On the dev environment (with 3 nodes in total) I can publish around 100 services with published ports.

I wonder if it has someting to do with the /24 of the ingress network but the numbers seams odd.

sr151511 · January 6, 2025, 12:40pm

Hi, any updates on this? I just ran into the exact same problem and it seems, that there is no solution to this.

bluepuma77 · January 6, 2025, 9:13pm

Default Docker networks only have a /24 subnet, with subnet mask 255.255.255.0.

Maybe try to create a /16 Docker network.

jonakarl · January 7, 2025, 8:58am

Hi,

No update since I have not been able to test it but the /24 of the ingress network is a suspected culprit.
Please reply if you have the ability/time to test and increase the default ingress network.

Regards
Jonas

sr151511 · January 7, 2025, 2:48pm

Hello,

seems both of you were right, i configured the ingress network with /20 and i was able to start 190 dummy services on my little dev system through the hub admin interface.

I will implement this solution on our jupyterhub prod system in the coming days and i am fairly optimistic, that it will work with real users who connect to the system too.

sr151511 · January 13, 2025, 11:18am

Put it on the prod System and it works, 141 Users so far.

Stop Jupyter Services
Remove all Nodes from swarm
Create Config: nano /etc/docker/daemon.json

{
  "default-address-pools": [
    {
      "base": "10.8.0.0/16",
      "size": 20
    }
  ]
}

Edit compose file: nano /etc/jupyterhub/docker-compose.yml

networks:
  jupyterswarm:
    name: jupyterswarm
    driver: overlay
    ipam:
      config:
        - subnet: 10.8.64.0/20

Restart docker:
systemctl restart docker
Recreate swarm with address pool:
docker swarm init --default-addr-pool 10.8.80.0/20
Join Nodes to the swarm:
docker swarm join --token SWMTKN-1-3mfo9u3nst1hf…
Start the Hub, Start a few JupyterHub Containers & check the two networks on the nodes for the correct range:
docker network inspect jupyterswarm
docker network inspect ingress

If you wonder why i chose 10.8.0.0/16 inside the daemon.json, but then put the two overlay networkls in 10.8.64.0/20 and in 10.8.80.0/20,… i also create a registry on the same machine, and it complained that the IPs it wanted to use were already in use, so is just put the two overlay networks a bit more far away from the start of the /16 pool, just to be on the safe side.

Topic		Replies	Views
Publishing ports in swarm mode doesn´t work as expected Swarm	2	3315	May 23, 2017
Docker Swarm: Any limit on number of exposed ports? Swarm swarm	0	531	January 6, 2021
Automatically binded Ports for swarm services Swarm docker	0	1736	February 1, 2017
Service in Swarm mode Swarm docker , swarm	2	963	March 11, 2017
Docker stuck at "docker service update --network-add ..." Swarm	2	1608	September 5, 2019

Limit of 116 services with published ports on docker swarm?

Related topics