Swarm with GPU MIGs

Hello,

I am deploying a service on swarm mode, this service use GPU, One container per GPU.

It is currently working fine with GPU instance containing only one GPU. We used a generic resource setup as shared in this forum.

But when I try to run my service on an instance with multiple MIG devices (4 MIG), it only start one container other containers fails to start with the folloing message :

“no suitable node (insufficient resources on 1 node)”

just to share our current config :

  • daemon.json

{
“default-runtime”: “nvidia”,
“runtimes”: {
“nvidia”: {
“path”: “nvidia-container-runtime”,
“runtimeArgs”: 
}
},
“node-generic-resources”: [“NVIDIA-GPU=all”],
“default-shm-size”: “1G”,
“default-ulimits”: {
“memlock”: { “name”:“memlock”, “soft”:  -1, “hard”: -1 },
“stack”  : { “name”:“stack”, “soft”: 67108864, “hard”: 67108864 }
}
}
  • /etc/nvidia-container-runtime/config.toml
swarm-resource = "DOCKER_RESOURCE_GPU"

  • Stack service definition
  worker-service:
    image: image:tag
    deploy:
      replicas: 4
      resources:
        reservations:
          generic_resources:
            - discrete_resource_spec:
                kind: 'NVIDIA-GPU' 
                value: 1
    command: >
      bash -c "
      cd apps/inferno &&
      python3 -m launch_bare_metal
      "

So is swarm able to have GPUs information and start workload accordingly ?

thank you !

As I always say in these topics, I am not a Swarm user, just share some ideas.

When you list GPUs on the node, can you see multiple GPU instances?

Are you also sure all requirements are met like described here?

For example GPU types, required dependencies like nvidia container toolkit

Since I never used Nvidia MIG, I can’t tell you what exactly you could miss if that happened. Or whether Swarm supports that or not.