Hello all,
I am working on a device called ZF-ProAI that uses Nvidia-Xavier-SOC, CPU 8 Cores @ 2.1 GHz, GPU Volta, 4TPC with Linux tegra-ubuntu 4.14.78-rt44-tegra OS installed in it.
This hardware is sold with this preinstalled OS and with CUDA-10 .1 for AI development.
A standalone python application for “object detection works fine” on this hardware
Retinanet_resnet50_fpn model + python3.7 + Conda environment
Now I want to containerize this application, but I am unable to find an exact base docker-image for my hardware from docker hub. I built the docker image using approximately matching docker container from dockerhub.
# Dockerfile
FROM nvidia/cuda:11.2.1-base-ubuntu18.04
After setting up Nvidia-container toolkit (GitHub - NVIDIA/nvidia-docker: Build and run Docker containers leveraging NVIDIA GPUs). The usage of nvidia-container toolkit and the GPU access using “–gpus all” command was explained in How to Use the GPU within a Docker Container
nvidia@tegra-ubuntu: docker run --gpus all gpu-nvidia-test
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.
ERRO[0001] error waiting for container: context canceled
After many unsuccessful tries using nvidia-container toolkit.
Later again to narrow down the problem. I created a simple python application to check if docker container would be able to use GPU.
# Simple python application.
import torch
import time
while(1):
print("gpu usage =",torch.cuda.is_available()) # Prints true if GPU is available
time.sleep(1)
I stopped using the command “- -gpus all” command and tried to “volume mounts” in the docker container and mounted the resources as docker volumes needed for the above shown python application to run in docker container.
# Container creation using volume
nvidia@tegra-ubuntu:~$ sudo docker run -v '/usr/local:/usr/local' -v '/usr/lib:/usr/lib' -v '/usr/share:/usr/share' -e LD_LIBRARY_PATH='/usr/local/cuda-10.1/lib64:$LD_LIBRARY_PATH' -e Path='/usr/local/cuda-10.1/bin' -it gpu-nvidia-test
Even after mounting it seems that the docker container with python application is not able to use the GPU. It shows the following error.
/usr/local/lib/python3.7/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
gpu usage = False
gpu usage = False
gpu usage = False
gpu usage = False
Can someone help me with this problem. Thank you in advance.