Manage GPUs in a docker swarm

Hi everyone,

I am new to docker and I am curious about how to mange GPUs in a docker swarm. I have multiple Linux servers and each machine is equipped with multiple NVIDIA GPUs (three servers and each has two GPUs: GPU 0 and GPU 1).

What I have done so far:
1. I create a swarm consisting of one manager and two workers;
2. I follow the instructions in this post Instructions for Docker swarm with GPUs · GitHub and expose the GPU resources on the worker nodes;
3. Now I can create docker services using GPUs distributed over this swarm.

Problems I am faced with:
1. It seems that only GPU 0 of each node is used for running the services, likely because each instance of my service containers uses a single GPU. Therefore the GPU 1 is not used and wasted.
2. The docker container do have access to all the GPUs, even if I only expose one GPU.

I wonder how to setup a swarm with one GPU per worker so that I can make the best of my GPUs. Any comments and suggestions will be appreciated.

Thank you!
Shijie

Hello Shijie,

I am having the exact same issues. I can see that each docker image in the swarm is told use a specific GPU while also exposing all the GPUs at a same which is resulting in each docker image trying to use only the first GPU, instead of the GPU id it was told to use.

Were you able to find a solution to this issue?

Well after many days I have finally found the solution.

  1. Use complete UUIDs
  2. In /etc/nvidia-container-runtime/config.toml change “DOCKER_RESOURCE_GPU” to “DOCKER_RESOURCE_NVIDIA-GPU”

After making those changes the docker swarm will be able to select the correct GPU on machines with multiple GPUs.

1 Like