Docker Community Forums

Share and learn in the Docker community.

Swarm - attach to network and run with gpu

Hi all. I want to start service in swarm with gpu resource (with nvidia runtime) and custom overlay network. So, when I’m starting service like this

docker service create --with-registry-auth --generic-resource "gpu=1" --name=test --constraint=node.id==50pbc33tbompfiiu1n61khyc5 --network=myinternal busybox:latest sh -c "while true; do echo Hello; sleep 2; done"

I’ve got error node is missing network attachments, ip addresses may be exhausted and then assigned node no longer meets constraints:

ID                          NAME                IMAGE                                                                                    NODE                DESIRED STATE       CURRENT STATE             ERROR                                                                  PORTS
yonzgcjx8793nxf2jbuvpdukq    \_ test.1     busybox:latest@sha256:d366a4665ab44f0648d7a00ae3fae139d55e32f9712c67accd604bb55df9d05a   node-4             Shutdown            Rejected 19 seconds ago   "assigned node no longer meets constraints"
3a3wrspme0m5ureu69dd9wpju    \_ test.1     busybox:latest@sha256:d366a4665ab44f0648d7a00ae3fae139d55e32f9712c67accd604bb55df9d05a   node-4             Shutdown            Rejected 19 seconds ago   "node is missing network attachments, ip addresses may be exhausted"

Service starts ok if I remove either --network or --generic-resource. Overlay network myinternal is empty (there is no other services/containers in this network) and I can’t understand how it gets exhausted. Network inspect:

docker network inspect e0fs28o8t7pq
[
    {
        "Name": "myinternal",
        "Id": "e0fs28o8t7pqgc5p2jusa662g",
        "Created": "2020-10-08T06:46:38.851827933Z",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.9.8.1/16",
                    "Gateway": "10.9.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": null,
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4096"
        },
        "Labels": null
    }
]

Where is my mistake?

Start by installing the appropriate NVidia drivers. Then continue to install NVidia Docker.

Verify with docker run --gpus all,capabilities=utility nvidia/cuda:10.0-base nvidia-smi.

Configuring Docker to work with your GPU(s)
The first step is to identify the GPU(s) available on your system. Docker will expose these as ‘resources’ to the swarm. This allows other nodes to place services (swarm-managed container deployments) on your machine.

These steps are currently for NVidia GPUs.

Docker identifies your GPU by its Universally Unique IDentifier (UUID). Find the GPU UUID for the GPU(s) in your machine.

nvidia-smi -a
A typical UUID looks like GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1. Now, only take the first two dash-separated parts, e.g.: GPU-45cbf7b3.

Open up the Docker engine configuration file, typically at /etc/docker/daemon.json.

Add the GPU ID to the node-generic-resources. Make sure that the nvidia runtime is present and set the default-runtime to it. Make sure to keep other configuration options in-place, if they are there. Take care of the JSON syntax, which is not forgiving of single quotes and lagging commas.

{
“runtimes”: {
“nvidia”: {
“path”: “/usr/bin/nvidia-container-runtime”,
“runtimeArgs”:
}
},
“default-runtime”: “nvidia”,
“node-generic-resources”: [
“gpu=GPU-45cbf7b”
]
}
Now, make sure to enable GPU resource advertisting by adding or uncommenting the following in /etc/nvidia-container-runtime/config.toml

swarm-resource = “DOCKER_RESOURCE_GPU”
Restart the service.

sudo systemctl restart docker.service

Thanks for your answer. But I already have nvidia runtime set up, as I wrote in first message. I can start nvidia-specific containers like nvidia/cuda:10.0-base and everything works as intended. Problems start appearing when I start with GPU and with overlay network simultaneously. This way it works:

docker service create --with-registry-auth --name=test --constraint=node.id==50pbc33tbompfiiu1n61khyc5 --network=myinternal busybox:latest sh -c "while true; do echo Hello; sleep 2; done"

and this way it works

docker service create --with-registry-auth --generic-resource "gpu=1" --name=test --constraint=node.id==50pbc33tbompfiiu1n61khyc5 busybox:latest sh -c "while true; do echo Hello; sleep 2; done"

but not this

docker service create --with-registry-auth --generic-resource "gpu=1" --name=test --constraint=node.id==50pbc33tbompfiiu1n61khyc5 --network=myinternal busybox:latest sh -c "while true; do echo Hello; sleep 2; done"

(note the --network and --generic-resource parameters).

You can safely ignore lewish95’s responses. Its a bot! You can find all of them by googling yourself. The last one is taken from https://gist.github.com/tomlankhorst/33da3c4b9edbde5c83fc1244f010815c.

The pure existance of this bot is a proof the whole docker forum is completly unmoderated ^^