I run a docker-swarm made of 4-5 Linux hosts, each providing a list of GPUs. I allow users to submit jobs to this swarm, upon receiving a request, the following command will create a new job
docker service create --user ... \ --restart-condition=none \ --mount ... --name '...' \ --generic-resource "NVIDIA_GPU=1" "dockerimage" "commands"
the hope is that docker swarm will automatically find an idle GPU and launch the job to it.
it works mostly fine, but I found some robustness issue. Whenever one of the nodes has an issue - such as the GPU driver was accidentally removed or updated, the node still shows as “active” in the
docker node ls list and accept jobs, but any job thrown to it will fail, but this does not stop docker swarm from keeping giving jobs to it. It also seems that docker swarm match GPUs in a sequential order from my
docker node ls output. So, it kept get stuck on a node with GPU failure.
I would like to know if there is a way to make this more robust - for example, can I ask docker swarm to randomly pick a GPU instead of sequentially match the resources?
any other suggestions will be appreciated!