Failed to initialize NVML: Unknown Error

Title: Frequent loss of GPU access and “Failed to initialize NVML: Unknown Error” when using nvidia-smi in Docker container

Description:
I am experiencing a recurring issue where I lose access to the GPU and receive the error “Failed to initialize NVML: Unknown Error” when using the nvidia-smi command within a Docker container. I am using the following command to create the container:

docker run --name=test -ti -d --gpus 'all' -m 256gb --shm-size 20G --net host --runtime nvidia nvidia/12.2.2-cudnn8-runtime-ubuntu22.04

My system details are as follows:

  • Docker version: 23.0.5, build bc4487a
  • OS: Ubuntu 22.04.2 LTS
  • Release: 22.04
  • Codename: jammy
  • NVIDIA-SMI Driver Version: 535.161.07
  • CUDA Version: 12.2

Has anyone else encountered this issue and found a solution? Any help would be greatly appreciated.

It seems the Docker engine has arrived at version 26. Maybe update and try again.