Docker Fails to Launch GPU Containers with NVIDIA Runtime, but Podman Works

Hi team,

I was trying to run docker with nvidia-container-toolkit to enable GPU in container:

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

However I got the following error:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI 
runtime create failed: runc create failed: unable to start container process: error during container init: error 
running prestart hook #0: signal: killed, stdout: , stderr: Auto-detected mode as 'legacy': unknown

Run 'docker run --help' for more information

You can see that there is no meaningful debug information from stderr and stdout. I went through a lot of forums and github issues and didn’t find any similar cases. I also have tried podman in the rootless mode:

 podman run --rm --security-opt=label=disable \
   --device=nvidia.com/gpu=all \
   ubuntu nvidia-smi

And it worked:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L40S-24Q                On  | 00000000:00:05.0 Off |                    0 |
| N/A   N/A    P8              N/A /  N/A |     51MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Here is my docker info:

docker info
Client: Docker Engine - Community
 Version:    28.1.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.35.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 4
 Server Version: 28.1.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
 runc version: v1.2.5-0-g59923ef
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.0-138-generic
 Operating System: Ubuntu 22.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 116.2GiB
 Name: rs-l-r9yh49
 ID: cbd02c7b-b0d7-41ad-9593-a2079a60350b
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  ::1/128
  127.0.0.0/8
 Live Restore Enabled: false

Do you think

Try removing --runtime=nvidia from your command. the nvidia container runtime is an archived project. --gpus all should be enough. You can find examples on NVIDIAs website too

It shows the runtime option as an alternative, but the recommended is just the gpus option.

And some examples for Docker Compose if you need it

I just experienced this, too, which I found to be very strange since I had not changed anything since last time I started up my docker container that uses the nvidia container runtime. I uncovered that, sadly, the computer had updated itself (and my version hold/locks were inadequate).

Docker (a vllm-for-Blackwell image) would no longer start. Nvidia-smi was telling me i had a version mismatch with, i believe, nvml. I rebooted and now nvidia-smi only tells me it ā€œcouldn’t communicate with the driver.ā€

I found ā€œ/usr/bin/unattended-upgradeā€ ran this morning during the 6 AM hour (Eastern/NY/USA), installing libnvidia-compute-570:amd64 (570.133.07-0ubuntu0.24.10.1, automatic) and upgraded libnvidia-compute-560 (560.35.03-0ubuntu5, 560.35.03-0ubuntu5.2) as well.

WAG here but those two updates seem likely to have cause the unexpected failure to start docker with nvidia, but i see several other serious updates that could play into it: linux-headers, linux-tools, linux-modules, linux-image.

Hoo boy, just when you think using the container will isolate you from ugly surprises like this, nope. Ugh.

I suppose I’m off to figure out how to run vLLM (on a Blackwell GPU) without the vllm docker image since the container pretty well just presents a different set of issues.

Given python virtual environments, container issues are not a set of issues I particularly care to learn more about for my purpose of local inferencing. Time to dig (back) in on tryin’a run vllm+Blackwell without the container.

Please, oh please, everything, be production-ready today!

Hopefully the Blackwell stuff is production-ready by now and building won’t be so delicate.

Hi everyone,

Thank you for your response!

For my problem, It terms out when I run docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi, it pulls the latest cuda image. However the cuda toolkit in that image is not aligned with host machine’s GPU driver version. After I run the correct cuda image it works:

sudo docker run --rm --runtime=nvidia  --gpus all nvcr.io/nvidia/cuda:12.2.2-runtime-ubuntu22.04 nvidia-smi

The cuda:12.2.2-runtime-ubuntu22.04 image include the cuda version 12.2 which is compatible with my GPU’s driver version 535.216.01.

I guess the lesson here is whenever you run any container which includes its own CUDA runtime, make sure it is compatible with host’s driver.

1 Like