Hello,
I am deploying a service on swarm mode, this service use GPU, One container per GPU.
It is currently working fine with GPU instance containing only one GPU. We used a generic resource setup as shared in this forum.
But when I try to run my service on an instance with multiple MIG devices (4 MIG), it only start one container other containers fails to start with the folloing message :
“no suitable node (insufficient resources on 1 node)”
just to share our current config :
- daemon.json
{
“default-runtime”: “nvidia”,
“runtimes”: {
“nvidia”: {
“path”: “nvidia-container-runtime”,
“runtimeArgs”:
}
},
“node-generic-resources”: [“NVIDIA-GPU=all”],
“default-shm-size”: “1G”,
“default-ulimits”: {
“memlock”: { “name”:“memlock”, “soft”: -1, “hard”: -1 },
“stack” : { “name”:“stack”, “soft”: 67108864, “hard”: 67108864 }
}
}
- /etc/nvidia-container-runtime/config.toml
swarm-resource = "DOCKER_RESOURCE_GPU"
- Stack service definition
worker-service:
image: image:tag
deploy:
replicas: 4
resources:
reservations:
generic_resources:
- discrete_resource_spec:
kind: 'NVIDIA-GPU'
value: 1
command: >
bash -c "
cd apps/inferno &&
python3 -m launch_bare_metal
"
So is swarm able to have GPUs information and start workload accordingly ?
thank you !