Failed to initialize NVML: Unknown Error

Title: Frequent loss of GPU access and “Failed to initialize NVML: Unknown Error” when using nvidia-smi in Docker container

I am experiencing a recurring issue where I lose access to the GPU and receive the error “Failed to initialize NVML: Unknown Error” when using the nvidia-smi command within a Docker container. I am using the following command to create the container:

docker run --name=test -ti -d --gpus 'all' -m 256gb --shm-size 20G --net host --runtime nvidia nvidia/12.2.2-cudnn8-runtime-ubuntu22.04

My system details are as follows:

  • Docker version: 23.0.5, build bc4487a
  • OS: Ubuntu 22.04.2 LTS
  • Release: 22.04
  • Codename: jammy
  • NVIDIA-SMI Driver Version: 535.161.07
  • CUDA Version: 12.2

Has anyone else encountered this issue and found a solution? Any help would be greatly appreciated.

It seems the Docker engine has arrived at version 26. Maybe update and try again.

I’ve been running into something similar. seems relevant, and mentions possible workarounds.