How to make docker swarm more robust by allowing random generic-resource-matching?

fangqq · March 25, 2023, 8:01pm

I run a docker-swarm made of 4-5 Linux hosts, each providing a list of GPUs. I allow users to submit jobs to this swarm, upon receiving a request, the following command will create a new job

docker service create --user ... \
   --restart-condition=none \
   --mount ... --name '...' \
   --generic-resource "NVIDIA_GPU=1" "dockerimage" "commands"

the hope is that docker swarm will automatically find an idle GPU and launch the job to it.

it works mostly fine, but I found some robustness issue. Whenever one of the nodes has an issue - such as the GPU driver was accidentally removed or updated, the node still shows as “active” in the docker node ls list and accept jobs, but any job thrown to it will fail, but this does not stop docker swarm from keeping giving jobs to it. It also seems that docker swarm match GPUs in a sequential order from my docker node ls output. So, it kept get stuck on a node with GPU failure.

I would like to know if there is a way to make this more robust - for example, can I ask docker swarm to randomly pick a GPU instead of sequentially match the resources?

any other suggestions will be appreciated!

thanks

meyay · March 25, 2023, 8:47pm

I moved your post to “Open Source Project / Swarm”.

As this question is quite special, chances are high that it stays unanswered or at least not answered immediately.

@neersighted can you pitch in?

fangqq · March 26, 2023, 2:51am

As this question is quite special, chances are high that it stays unanswered or at least not answered immediately.

for some reason, I thought this should be a common question, at least I hope I can understand how docker assigns jobs to generic resources? my observation is that it assigns in the order of the node ls output. is this the expected behavior?

fangqq · March 27, 2023, 3:20pm

let’s put generic-resource assignment aside, does anyone how docker swarm’s dispatcher assigns jobs to nodes? does it search the next available node via the node ls list? or there is other ways to find nodes?

also, how does it know if a node is busy?

Topic		Replies	Views
Manage GPUs in a docker swarm Swarm swarm	2	4210	February 1, 2024
Does Docker support Generic Resource/Third Party Resource (like GPU, FPGA, etc)currently? General docker	0	894	February 20, 2019
Swarm - attach to network and run with gpu General swarm	4	4058	January 28, 2023
Using Swarm Filter to serialize access to GPU card Swarm	0	1424	August 24, 2016
Using NVIDIA GPU with docker swarm started by docker-compose file Swarm	1	2599	April 6, 2021

How to make docker swarm more robust by allowing random generic-resource-matching?

Related topics