Swarm with GPU MIGs

Hello,

I am deploying a service on swarm mode, this service use GPU, One container per GPU.

It is currently working fine with GPU instance containing only one GPU. We used a generic resource setup as shared in this forum.

But when I try to run my service on an instance with multiple MIG devices (4 MIG), it only start one container other containers fails to start with the folloing message :

“no suitable node (insufficient resources on 1 node)”

just to share our current config :

  • daemon.json

{
“default-runtime”: “nvidia”,
“runtimes”: {
“nvidia”: {
“path”: “nvidia-container-runtime”,
“runtimeArgs”: 
}
},
“node-generic-resources”: [“NVIDIA-GPU=all”],
“default-shm-size”: “1G”,
“default-ulimits”: {
“memlock”: { “name”:“memlock”, “soft”:  -1, “hard”: -1 },
“stack”  : { “name”:“stack”, “soft”: 67108864, “hard”: 67108864 }
}
}
  • /etc/nvidia-container-runtime/config.toml
swarm-resource = "DOCKER_RESOURCE_GPU"

  • Stack service definition
  worker-service:
    image: image:tag
    deploy:
      replicas: 4
      resources:
        reservations:
          generic_resources:
            - discrete_resource_spec:
                kind: 'NVIDIA-GPU' 
                value: 1
    command: >
      bash -c "
      cd apps/inferno &&
      python3 -m launch_bare_metal
      "

So is swarm able to have GPUs information and start workload accordingly ?

thank you !

As I always say in these topics, I am not a Swarm user, just share some ideas.

When you list GPUs on the node, can you see multiple GPU instances?

Are you also sure all requirements are met like described here?

For example GPU types, required dependencies like nvidia container toolkit

Since I never used Nvidia MIG, I can’t tell you what exactly you could miss if that happened. Or whether Swarm supports that or not.

Hello,
thank you. I have been doing a lot of testing lately I realize this has nothing to see with MIG device. I try with a server with two GPUs and still got issues to schedule properly the containers on proper GPU (one per service).

here is what I did so far :

the new daemon.json look like :

{
    "default-runtime": "nvidia",
    "default-shm-size": "1G",
    "default-ulimits": {
        "memlock": {
            "hard": -1,
            "name": "memlock",
            "soft": -1
        },
        "stack": {
            "hard": 67108864,
            "name": "stack",
            "soft": 67108864
        }
    },
    "node-generic-resources": [
        "NVIDIA-GPU=GPU-f2ba9cd4-6f6b-860f-3c78-4a6639e4b5db",
        "NVIDIA-GPU=GPU-f12d79c4-d485-2fd1-ca2d-cd5eef76fe40"
    ],
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}
  • starting a service with swarm :
  worker-service:
    image: image:tag
    deploy:
      replicas: 2
      resources:
        reservations:
          generic_resources:
            - discrete_resource_spec:
                kind: 'NVIDIA-GPU' 
                value: 1
    command: >
      bash -c "
      python3 -m launch_bare_metal
      "

After deploying it I can see swarm schedule two containers as requested, but both service are started on GPU id 0 which fails because of memory constraints :

Wed Jan 21 15:45:11 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:01:00.0 Off |                    0 |
| N/A   30C    P0            116W /  700W |   52037MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          Off |   00000000:02:00.0 Off |                    0 |
| N/A   33C    P0             71W /  700W |       4MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           24783      C   python3                               26010MiB |
|    0   N/A  N/A           25442      C   python3                               26012MiB |
+-----------------------------------------------------------------------------------------+

Looking at what swarm did I inspect both containers. Container 1 is started with the following env var :
"DOCKER_RESOURCE_NVIDIA-GPU=GPU-f2ba9cd4-6f6b-860f-3c78-4a6639e4b5db",

container2 has the following env var :
"DOCKER_RESOURCE_NVIDIA-GPU=GPU-f12d79c4-d485-2fd1-ca2d-cd5eef76fe40",

So this looks good to me on that side, but I don’t get why both containers are started on same GPU after all.
If soemone came accross this would be helpfull :slight_smile:

Thanks for reading

Maybe this is the missing piece?

Update: never mind, your config.toml probably already looks like this, otherwise it wouldn’t be able to schedule service task that use a gpu.

Update2:
I just noticed now that you already shared the relevant line from your config.toml:

Apparently what follows the prefix DOCKER_RESOURCE_ is what needs to be the key of the entries in node-generic-ressource in the daemon.json and the value for kind in discrete_resource_spec

Hello,

So yes I had an error in config.toml which I fixed.
But still had issue. I created a ticket for nvidia-container-toolkit and they fastly acknowledge the issue. See tech details here :

using mode legacy of nvidia-container-toolkit fixed this behaviour.

thanks for your help!

2 Likes