How to make docker swarm more robust by allowing random generic-resource-matching?

I run a docker-swarm made of 4-5 Linux hosts, each providing a list of GPUs. I allow users to submit jobs to this swarm, upon receiving a request, the following command will create a new job

docker service create --user ... \
   --restart-condition=none \
   --mount ... --name '...' \
   --generic-resource "NVIDIA_GPU=1" "dockerimage" "commands"

the hope is that docker swarm will automatically find an idle GPU and launch the job to it.

it works mostly fine, but I found some robustness issue. Whenever one of the nodes has an issue - such as the GPU driver was accidentally removed or updated, the node still shows as “active” in the docker node ls list and accept jobs, but any job thrown to it will fail, but this does not stop docker swarm from keeping giving jobs to it. It also seems that docker swarm match GPUs in a sequential order from my docker node ls output. So, it kept get stuck on a node with GPU failure.

I would like to know if there is a way to make this more robust - for example, can I ask docker swarm to randomly pick a GPU instead of sequentially match the resources?

any other suggestions will be appreciated!

thanks

I moved your post to “Open Source Project / Swarm”.

As this question is quite special, chances are high that it stays unanswered or at least not answered immediately.

@neersighted can you pitch in?

As this question is quite special, chances are high that it stays unanswered or at least not answered immediately.

for some reason, I thought this should be a common question, at least I hope I can understand how docker assigns jobs to generic resources? my observation is that it assigns in the order of the node ls output. is this the expected behavior?

let’s put generic-resource assignment aside, does anyone how docker swarm’s dispatcher assigns jobs to nodes? does it search the next available node via the node ls list? or there is other ways to find nodes?

also, how does it know if a node is busy?