nemlehet
(nemlehet4)
April 14, 2026, 1:07pm
1
Hey Guys,
Probably its a simple thing, but I’m pretty new to docker and linux and I couldnt find a good explanation how to do this.
I have a RTX3090 in my home server. It works with docker no problem. However if I restart the containers using the GPU fail to start. Based on the logs it takes roughly 45 seconds for the nvidia CDI to become available. Its a bit annoying to always go in and manually start them, especially when I’m not at home… (there is some power fluctuation in my area so the server unfortunately restarts a few times a week)
I want to add a delay to either the docker service or add some kind of health check as a start condition to my containers.
My problem is I see how can I ping an endpoint with healthcheck in compose or how to run scripts, but dont see any tutorial on how to check if a service is running…
I found this for waiting until a mount is available:
[Unit]
#ExecStartPre =/bin/sleep 30
RequiresMountsFor=/media/localadmin/FILES /media/localadmin/PHOTOS
I would need something like this, but for nvidia CDI.
rimelek
(Ákos Takács)
April 14, 2026, 7:49pm
2
It was some time ago when I configured Docker or Docker containers for nvidia GPUs, so I don’t remember if docker run fails immediately or the process fails in the container. If the process fails in the container and the container stops, you could simply use
restart: always
In compose or from command line: https://docs.docker.com/reference/cli/docker/container/run/#restart
But checking the nvidia docs:
I see this systemd service: nvidia-cdi-refresh.service, so you could try something like this
[Unit]
After=nvidia-cdi-refresh.service
I also found an open issue on GitHub mentioning that this service can also fail:
opened 02:14PM - 24 Jan 26 UTC
Since updating CTK today, I'm seeing red in the logs at boot:
```
> journalctl … -b -u nvidia-cdi-refresh
Jan 25 00:45:43 Pallas systemd[1]: Starting Refresh NVIDIA CDI specification file...
Jan 25 00:45:43 Pallas nvidia-ctk[1600]: time="2026-01-25T00:45:43+11:00" level=info msg="Using /usr/lib64/libnvidia-ml.so.580.126.09"
Jan 25 00:45:43 Pallas nvidia-ctk[1600]: time="2026-01-25T00:45:43+11:00" level=info msg="Using /usr/lib64/libnvidia-sandboxutils.so.580.126.09"
Jan 25 00:45:43 Pallas nvidia-ctk[1600]: ERROR: init 243 result=9time="2026-01-25T00:45:43+11:00" level=warning msg="Failed to init nvsandboxutils: ERROR_NVML_LIB_CALL; ignoring"
Jan 25 00:45:43 Pallas nvidia-ctk[1600]: time="2026-01-25T00:45:43+11:00" level=warning msg="Could not determine driver version: libnvsandboxutils is not available\nfailed to initialize nvml: Driver Not Loaded"
Jan 25 00:45:43 Pallas nvidia-ctk[1600]: time="2026-01-25T00:45:43+11:00" level=info msg="Auto-detected mode as 'nvml'"
Jan 25 00:45:43 Pallas nvidia-ctk[1600]: time="2026-01-25T00:45:43+11:00" level=error msg="failed to generate CDI spec: failed to create device CDI specs: failed to construct device spec generators: failed to initialize NVML: Driver Not Loaded"
Jan 25 00:45:43 Pallas systemd[1]: nvidia-cdi-refresh.service: Main process exited, code=exited, status=1/FAILURE
Jan 25 00:45:43 Pallas systemd[1]: nvidia-cdi-refresh.service: Failed with result 'exit-code'.
Jan 25 00:45:43 Pallas systemd[1]: Failed to start Refresh NVIDIA CDI specification file.
Jan 25 00:45:43 Pallas systemd[1]: nvidia-cdi-refresh.service: Consumed 105ms CPU time, 79.8M memory peak.
Jan 25 00:45:44 Pallas systemd[1]: nvidia-cdi-refresh.service: Scheduled restart job, restart counter is at 1.
Jan 25 00:45:44 Pallas systemd[1]: Starting Refresh NVIDIA CDI specification file...
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Using /usr/lib64/libnvidia-ml.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Using /usr/lib64/libnvidia-sandboxutils.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Auto-detected mode as 'nvml'"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Using driver version 580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /dev/nvidia-modeset as /dev/nvidia-modeset"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /dev/nvidia-uvm-tools as /dev/nvidia-uvm-tools"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /dev/nvidia-uvm as /dev/nvidia-uvm"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /dev/nvidiactl as /dev/nvidiactl"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-egl-gbm.so.1.1.2 as /usr/lib64/libnvidia-egl-gbm.so.1.1.2"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-egl-wayland.so.1.1.21 as /usr/lib64/libnvidia-egl-wayland.so.1.1.21"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-allocator.so.580.126.09 as /usr/lib64/libnvidia-allocator.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=warning msg="Could not locate libnvidia-vulkan-producer.so.580.126.09: pattern libnvidia-vulkan-producer.so.580.126.09 not found\nlibnvidia-vulkan-producer.so.580.126.09: not found"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/xorg/modules/drivers/nvidia_drv.so as /usr/lib64/xorg/modules/drivers/nvidia_drv.so"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/xorg/modules/extensions/libglxserver_nvidia.so.580.126.09 as /usr/lib64/xorg/modules/extensions/libglxserver_nvidia.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/share/glvnd/egl_vendor.d/10_nvidia.json as /usr/share/glvnd/egl_vendor.d/10_nvidia.json"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json as /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json as /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/share/nvidia/nvoptix.bin as /usr/share/nvidia/nvoptix.bin"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=warning msg="Could not locate X11/xorg.conf.d/10-nvidia.conf: pattern X11/xorg.conf.d/10-nvidia.conf not found"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=warning msg="Could not locate X11/xorg.conf.d/nvidia-drm-outputclass.conf: pattern X11/xorg.conf.d/nvidia-drm-outputclass.conf not found"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=warning msg="Could not locate vulkan/icd.d/nvidia_icd.json: pattern vulkan/icd.d/nvidia_icd.json not found\npattern vulkan/icd.d/nvidia_icd.json not found"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=warning msg="Could not locate vulkan/icd.d/nvidia_layers.json: pattern vulkan/icd.d/nvidia_layers.json not found\npattern vulkan/icd.d/nvidia_layers.json not found"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/share/vulkan/implicit_layer.d/nvidia_layers.json as /etc/vulkan/implicit_layer.d/nvidia_layers.json"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/share/vulkan/icd.d/nvidia_icd.x86_64.json as /etc/vulkan/icd.d/nvidia_icd.x86_64.json"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libEGL_nvidia.so.580.126.09 as /usr/lib64/libEGL_nvidia.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libGLESv1_CM_nvidia.so.580.126.09 as /usr/lib64/libGLESv1_CM_nvidia.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libGLESv2_nvidia.so.580.126.09 as /usr/lib64/libGLESv2_nvidia.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libGLX_nvidia.so.580.126.09 as /usr/lib64/libGLX_nvidia.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libcuda.so.580.126.09 as /usr/lib64/libcuda.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libcudadebugger.so.580.126.09 as /usr/lib64/libcudadebugger.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvcuvid.so.580.126.09 as /usr/lib64/libnvcuvid.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-allocator.so.580.126.09 as /usr/lib64/libnvidia-allocator.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-cfg.so.580.126.09 as /usr/lib64/libnvidia-cfg.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-eglcore.so.580.126.09 as /usr/lib64/libnvidia-eglcore.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-encode.so.580.126.09 as /usr/lib64/libnvidia-encode.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-fbc.so.580.126.09 as /usr/lib64/libnvidia-fbc.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-glcore.so.580.126.09 as /usr/lib64/libnvidia-glcore.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-glsi.so.580.126.09 as /usr/lib64/libnvidia-glsi.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-glvkspirv.so.580.126.09 as /usr/lib64/libnvidia-glvkspirv.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-gpucomp.so.580.126.09 as /usr/lib64/libnvidia-gpucomp.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-ml.so.580.126.09 as /usr/lib64/libnvidia-ml.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-ngx.so.580.126.09 as /usr/lib64/libnvidia-ngx.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-nvvm.so.580.126.09 as /usr/lib64/libnvidia-nvvm.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-opencl.so.580.126.09 as /usr/lib64/libnvidia-opencl.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-opticalflow.so.580.126.09 as /usr/lib64/libnvidia-opticalflow.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-pkcs11-openssl3.so.580.126.09 as /usr/lib64/libnvidia-pkcs11-openssl3.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-present.so.580.126.09 as /usr/lib64/libnvidia-present.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-ptxjitcompiler.so.580.126.09 as /usr/lib64/libnvidia-ptxjitcompiler.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-rtcore.so.580.126.09 as /usr/lib64/libnvidia-rtcore.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-sandboxutils.so.580.126.09 as /usr/lib64/libnvidia-sandboxutils.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-tls.so.580.126.09 as /usr/lib64/libnvidia-tls.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvidia-vksc-core.so.580.126.09 as /usr/lib64/libnvidia-vksc-core.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/libnvoptix.so.580.126.09 as /usr/lib64/libnvoptix.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/lib64/vdpau/libvdpau_nvidia.so.580.126.09 as /usr/lib64/vdpau/libvdpau_nvidia.so.580.126.09"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /run/nvidia-persistenced/socket as /run/nvidia-persistenced/socket"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=warning msg="Could not locate /nvidia-fabricmanager/socket: pattern /nvidia-fabricmanager/socket not found"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=warning msg="Could not locate /tmp/nvidia-mps: pattern /tmp/nvidia-mps not found"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /lib/firmware/nvidia/580.126.09/gsp_ga10x.bin as /lib/firmware/nvidia/580.126.09/gsp_ga10x.bin"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /lib/firmware/nvidia/580.126.09/gsp_tu10x.bin as /lib/firmware/nvidia/580.126.09/gsp_tu10x.bin"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/bin/nvidia-smi as /usr/bin/nvidia-smi"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/bin/nvidia-debugdump as /usr/bin/nvidia-debugdump"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/bin/nvidia-persistenced as /usr/bin/nvidia-persistenced"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/bin/nvidia-cuda-mps-control as /usr/bin/nvidia-cuda-mps-control"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Selecting /usr/bin/nvidia-cuda-mps-server as /usr/bin/nvidia-cuda-mps-server"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=warning msg="Could not locate nvidia-imex: pattern nvidia-imex not found"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=warning msg="Could not locate nvidia-imex-ctl: pattern nvidia-imex-ctl not found"
Jan 25 00:45:44 Pallas nvidia-ctk[2390]: time="2026-01-25T00:45:44+11:00" level=info msg="Generated CDI spec with version 1.1.0"
Jan 25 00:45:44 Pallas systemd[1]: nvidia-cdi-refresh.service: Deactivated successfully.
Jan 25 00:45:44 Pallas systemd[1]: Finished Refresh NVIDIA CDI specification file.
```
This service start and error is occurring very early in the boot sequence, for example, before network is up, disks are checked, or crash kernel is loaded.
This seems to fix it:
```
After=multi-user.target
```
I just picked that as an ordering dependency based on the `WantedBy` so perhaps there might be something better.
So that dependency definition alone might not solve everything, but you can try.
nemlehet
(nemlehet4)
April 14, 2026, 7:53pm
3
Hey!
restart: always was the first thing I tried (changed from unless stopped), but doesnt help.
Not sure why its not trying to restart.
The 2nd solution you suggested is a really good one, I will try it as soon as I get home.
I dont think cdi refresh is failing as in ~45 seconds I can manually start the docker without restarting it.