Docker Community Forums

Share and learn in the Docker community.

Swarm - attach to network and run with gpu

Hi all. I want to start service in swarm with gpu resource (with nvidia runtime) and custom overlay network. So, when I’m starting service like this

docker service create --with-registry-auth --generic-resource "gpu=1" --name=test --constraint=node.id==50pbc33tbompfiiu1n61khyc5 --network=myinternal busybox:latest sh -c "while true; do echo Hello; sleep 2; done"

I’ve got error node is missing network attachments, ip addresses may be exhausted and then assigned node no longer meets constraints:

ID                          NAME                IMAGE                                                                                    NODE                DESIRED STATE       CURRENT STATE             ERROR                                                                  PORTS
yonzgcjx8793nxf2jbuvpdukq    \_ test.1     busybox:latest@sha256:d366a4665ab44f0648d7a00ae3fae139d55e32f9712c67accd604bb55df9d05a   node-4             Shutdown            Rejected 19 seconds ago   "assigned node no longer meets constraints"
3a3wrspme0m5ureu69dd9wpju    \_ test.1     busybox:latest@sha256:d366a4665ab44f0648d7a00ae3fae139d55e32f9712c67accd604bb55df9d05a   node-4             Shutdown            Rejected 19 seconds ago   "node is missing network attachments, ip addresses may be exhausted"

Service starts ok if I remove either --network or --generic-resource. Overlay network myinternal is empty (there is no other services/containers in this network) and I can’t understand how it gets exhausted. Network inspect:

docker network inspect e0fs28o8t7pq
[
    {
        "Name": "myinternal",
        "Id": "e0fs28o8t7pqgc5p2jusa662g",
        "Created": "2020-10-08T06:46:38.851827933Z",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.9.8.1/16",
                    "Gateway": "10.9.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": null,
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4096"
        },
        "Labels": null
    }
]

Where is my mistake?

Start by installing the appropriate NVidia drivers. Then continue to install NVidia Docker.

Verify with docker run --gpus all,capabilities=utility nvidia/cuda:10.0-base nvidia-smi.

Configuring Docker to work with your GPU(s)
The first step is to identify the GPU(s) available on your system. Docker will expose these as ‘resources’ to the swarm. This allows other nodes to place services (swarm-managed container deployments) on your machine.

These steps are currently for NVidia GPUs.

Docker identifies your GPU by its Universally Unique IDentifier (UUID). Find the GPU UUID for the GPU(s) in your machine.

nvidia-smi -a
A typical UUID looks like GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1. Now, only take the first two dash-separated parts, e.g.: GPU-45cbf7b3.

Open up the Docker engine configuration file, typically at /etc/docker/daemon.json.

Add the GPU ID to the node-generic-resources. Make sure that the nvidia runtime is present and set the default-runtime to it. Make sure to keep other configuration options in-place, if they are there. Take care of the JSON syntax, which is not forgiving of single quotes and lagging commas.

{
“runtimes”: {
“nvidia”: {
“path”: “/usr/bin/nvidia-container-runtime”,
“runtimeArgs”:
}
},
“default-runtime”: “nvidia”,
“node-generic-resources”: [
“gpu=GPU-45cbf7b”
]
}
Now, make sure to enable GPU resource advertisting by adding or uncommenting the following in /etc/nvidia-container-runtime/config.toml

swarm-resource = “DOCKER_RESOURCE_GPU”
Restart the service.

sudo systemctl restart docker.service

Thanks for your answer. But I already have nvidia runtime set up, as I wrote in first message. I can start nvidia-specific containers like nvidia/cuda:10.0-base and everything works as intended. Problems start appearing when I start with GPU and with overlay network simultaneously. This way it works:

docker service create --with-registry-auth --name=test --constraint=node.id==50pbc33tbompfiiu1n61khyc5 --network=myinternal busybox:latest sh -c "while true; do echo Hello; sleep 2; done"

and this way it works

docker service create --with-registry-auth --generic-resource "gpu=1" --name=test --constraint=node.id==50pbc33tbompfiiu1n61khyc5 busybox:latest sh -c "while true; do echo Hello; sleep 2; done"

but not this

docker service create --with-registry-auth --generic-resource "gpu=1" --name=test --constraint=node.id==50pbc33tbompfiiu1n61khyc5 --network=myinternal busybox:latest sh -c "while true; do echo Hello; sleep 2; done"

(note the --network and --generic-resource parameters).

You can safely ignore lewish95’s responses. Its a bot! You can find all of them by googling yourself. The last one is taken from https://gist.github.com/tomlankhorst/33da3c4b9edbde5c83fc1244f010815c.

The pure existance of this bot is a proof the whole docker forum is completly unmoderated ^^

@muxlevator, did you manage to solve this? I’m hitting this issue as well.
However, I hit this error when I use generic resources, it does not depend whether the service has a network attached or not.

I tried a lot of things but can’t figure out why this is happening.
I didn’t take the time to look at the code yet.

I tried to setup this with Nvidia driver 450 and 455, on an Ubuntu 18.04 and 20.04 without success. I also tested to use solely node-generic-resources in my daemon.conf (without installing Nvidia-docker)
I feel like it is a regression since I used to be able to do it on another machine. Or I missed something like a conf somewhere…

Any way if you have some findings feel free to share, I’ll post here if I find something. Cheers.

Ps one thing I didn’t mention, is that I use Gpu passthrough, before using it, with docker.

@nokidev, I resolved my issue by removing --generic-resource. This way you get GPU support (nvidia-runtime magic?) AND network. I think it’s poor swarm support from nvidia is to blame. You can check GPU availability from docker by running nvidia-smi from nvidia/cuda:10.0-base container.

And I don’t know a thing about gpu passthrough.

Thanks @muxlevator,
I gave a shot to Nvidia-runtime black magic and it worked.
I’m pretty sad because generic-resources used to work.

I also had few issue with passthrough, but was able to solve it and now everything work normally inside docker.

Thanks again,
Cheers