Docker container GPU access-passthrough. Nvidia-GPU on ZF-ProAI

Hello all,

I am working on a device called ZF-ProAI that uses Nvidia-Xavier-SOC, CPU 8 Cores @ 2.1 GHz, GPU Volta, 4TPC with Linux tegra-ubuntu 4.14.78-rt44-tegra OS installed in it.
This hardware is sold with this preinstalled OS and with CUDA-10 .1 for AI development.
A standalone python application for “object detection works fine” on this hardware

Retinanet_resnet50_fpn model + python3.7 + Conda environment

Now I want to containerize this application, but I am unable to find an exact base docker-image for my hardware from docker hub. I built the docker image using approximately matching docker container from dockerhub.

# Dockerfile 
FROM nvidia/cuda:11.2.1-base-ubuntu18.04

After setting up Nvidia-container toolkit (GitHub - NVIDIA/nvidia-docker: Build and run Docker containers leveraging NVIDIA GPUs). The usage of nvidia-container toolkit and the GPU access using “–gpus all” command was explained in How to Use the GPU within a Docker Container

nvidia@tegra-ubuntu: docker run --gpus all gpu-nvidia-test

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.
ERRO[0001] error waiting for container: context canceled

After many unsuccessful tries using nvidia-container toolkit.

Later again to narrow down the problem. I created a simple python application to check if docker container would be able to use GPU.

# Simple python application.
import torch
  import time
  while(1):
      print("gpu usage =",torch.cuda.is_available()) #  Prints true if GPU is available
      time.sleep(1)

I stopped using the command “- -gpus all” command and tried to “volume mounts” in the docker container and mounted the resources as docker volumes needed for the above shown python application to run in docker container.

# Container creation using volume
nvidia@tegra-ubuntu:~$ sudo docker run -v '/usr/local:/usr/local' -v '/usr/lib:/usr/lib' -v '/usr/share:/usr/share' -e LD_LIBRARY_PATH='/usr/local/cuda-10.1/lib64:$LD_LIBRARY_PATH' -e Path='/usr/local/cuda-10.1/bin' -it  gpu-nvidia-test

Even after mounting it seems that the docker container with python application is not able to use the GPU. It shows the following error.

/usr/local/lib/python3.7/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
gpu usage = False
gpu usage = False
gpu usage = False
gpu usage = False

Can someone help me with this problem. Thank you in advance.

Sorry if I missed something but how did you install the nvidia driver? This is in the short readme of nvidia-docker: GitHub - NVIDIA/nvidia-docker at 51d3c9e22b2b891773ab9525eaf7b3ce1c014ab1

Make sure you have installed the NVIDIA driver and Docker engine for your Linux distribution Note that you do not need to install the CUDA Toolkit on the host system, but the NVIDIA driver needs to be installed

Thank you very much your reply.

The nvidia driver was preinstalled when the hardware was provided to me. I tried to reinstall the nvidia driver by using the package manger apt-get.

nvidia@tegra-ubuntu:~$ sudo apt-cache search nvidia-driver-*
[sudo] password for nvidia: 
xserver-xorg-video-nvidia-465 - NVIDIA binary Xorg driver
nvidia-driver-465 - NVIDIA driver metapackage
nvidia-headless-no-dkms-465 - NVIDIA headless metapackage - no DKMS
nvidia-headless-465 - NVIDIA headless metapackage
nvidia@tegra-ubuntu:~$ sudo apt install nvidia-driver-465
Reading package lists... Done
Building dependency tree       
Reading state information... Done
nvidia-driver-465 is already the newest version (465.19.01-0ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.

But when I run the docker container that contains the simple python code shown below,

# Simple python application.
import torch
  import time
  while(1):
      print("gpu usage =",torch.cuda.is_available()) #  Prints true if GPU is available
      time.sleep(1)

Shows the following error,

nvidia@tegra-ubuntu:~$ docker run -it --runtime nvidia l4t-nvcrio
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.
ERRO[0001] error waiting for container: context canceled

Please let me know if further info is needed.