Title: Frequent loss of GPU access and “Failed to initialize NVML: Unknown Error” when using nvidia-smi in Docker container
Description:
I am experiencing a recurring issue where I lose access to the GPU and receive the error “Failed to initialize NVML: Unknown Error” when using the nvidia-smi
command within a Docker container. I am using the following command to create the container:
docker run --name=test -ti -d --gpus 'all' -m 256gb --shm-size 20G --net host --runtime nvidia nvidia/12.2.2-cudnn8-runtime-ubuntu22.04
My system details are as follows:
- Docker version: 23.0.5, build bc4487a
- OS: Ubuntu 22.04.2 LTS
- Release: 22.04
- Codename: jammy
- NVIDIA-SMI Driver Version: 535.161.07
- CUDA Version: 12.2
Has anyone else encountered this issue and found a solution? Any help would be greatly appreciated.