Hi,
I will try to o explain the system but please ping me if I am unclear.
We have 11 nodes, we use no node labels but we use a single engine label to difference between GPU and non GPU enabled nodes.
example :
"Spec": {
"Labels": {},
"Role": "worker",
"Availability": "active"
},
"Description": {
"Hostname": "carl-gpu1",
"Platform": {
"Architecture": "x86_64",
"OS": "linux"
},
"Resources": {
"NanoCPUs": 20000000000,
"MemoryBytes": 67096875008,
"GenericResources": [
{
"NamedResourceSpec": {
"Kind": "GPU2080",
"Value": "GPU-542a9b92-8d14-6059-bca0-d67c7531f2fb"
}
},
{
"NamedResourceSpec": {
"Kind": "GPU2080",
"Value": "GPU-5c1c83b1-5ea2-8d7e-7ba1-b50153a327ac"
}
}
]
},
"Engine": {
"EngineVersion": "27.2.0",
"Labels": {
"worker-type": "GPU"
},
The swarm is used to host a jupyerhub “cluster”, the docker compose (example-docker-compose.yml) (and our custom code) can bee found here: csma / jupyterhub · GitLab
The user notebooks are spawned by the swarmspawner GitHub - jupyterhub/dockerspawner: Spawns JupyterHub single user servers in Docker containers
The system works fine until 116 services are created but then it does not work any more.
Example when I try to manually launch a service
# docker service ls |grep ' 1/1 '|wc -l
116
# docker service create --name noport docker.cs.kau.se/csma/jupyterhub/datascience-notebook:2024-09-04
knatjfwpw6c1z8mi8xct35b6x
overall progress: 1 out of 1 tasks
1/1: running [==================================================>]
verify: Service knatjfwpw6c1z8mi8xct35b6x converged
# docker service create --name withport --publish 777:777 docker.cs.kau.se/csma/jupyterhub/datascience-notebook:2024-09-04
vz1ihff9108mxau048b6o9gpm
overall progress: 0 out of 1 tasks
1/1: new [=====> ]
^COperation continuing in background.
Use `docker service ps vz1ihff9108mxau048b6o9gpm` to check progress.
# docker service ls |grep port
knatjfwpw6c1 noport replicated 1/1 docker.cs.kau.se/csma/jupyterhub/datascience-notebook:2024-09-04
vz1ihff9108m withport replicated 0/1 docker.cs.kau.se/csma/jupyterhub/datascience-notebook:2024-09-04
It also work fine to manually launch a container (on the manager node) with publsihed ports eg.
docker run --rm -d -p778:777 --name withportrun docker.cs.kau.se/csma/jupyterhub/datascience-notebook:2024-09-04
The ports are published for ssh access so domain based forwarding does not work (maybe something like sshpiper could work but we have not had time to test it yet).
Since it is a production system we have reverted the “ssh access” feature and restarted the system but I will next week reproduce the issue on our development system and can do further analyses then.
Thanks for the feedback
Jonas