Docker run fails but only when run sequentially after a previously successful docker run

noahharrison64 · May 7, 2024, 8:46am

Hi all,

I’ve used google cloud platform to create an auto-scaling compute cluster using slurm. When a job is submitted, a compute node is booted up, and the start up script runs, which pulls a docker image from google’s artefact registry. Then a length simulation process is run through the container in GPU mode. After length computation (around `2 hours) the process finishes. This works no problem, the issue is if another job is then directly submitted to the same compute node that was previously running, with no node reboot occurring in between. I then get this error message when I try and start run my docker container:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as ‘legacy’
nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown.
time=“2024-05-03T15:49:31Z” level=error msg=“error waiting for container: context canceled”

Clearly, my docker image is working fine since it will run if the jobs are submitted to separate compute nodes. The issue only arises when docker tries to run on a compute node that has just finished running a previous process, and hasn’t fully shut down / rebooted.

I have tried adding docker clean up commands to the start and end of my submission script but this isn’t helping.
e.g:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --partition=compute
#SBATCH --output {stdout_file}
#SBATCH --error {stderr_file}
#SBATCH --open-mode=append


# Re-start any existing containers 
CONTAINER_IDS=$(docker ps -a -q)
if [[ ! -z "$CONTAINER_IDS" ]]; then
    # Stop all running Docker containers
    if [[ ! -z "$(docker ps -q)" ]]; then
        docker stop $(docker ps -q)
    fi

    # Remove all Docker containers
    docker rm -f $CONTAINER_IDS

    # Remove unused Docker volumes and networks
else
    echo "No Docker containers to restart."
fi

docker volume prune -f
docker network prune -f

# Use docker container to run python script
export GOOGLE_APPLICATION_CREDENTIALS={google_application_credentials}
gcloud auth configure-docker us-central1-docker.pkg.dev --quiet
docker pull us-central1-docker.pkg.dev/{project_name}/{docker_repo}/{docker_image}:{docker_tag}
docker run --gpus all --entrypoint /bin/bash --rm -v {login_mount_point}:{compute_mount_point} us-central1-docker.pkg.dev/{project_name}/{docker_repo}/{docker_image}:{docker_tag} \
    -c "begin running code that runs the process..."

If anyone has any ideas what is causing this then any help would be appreciated.

david2658 · May 8, 2024, 6:12am

Hello,
The error you are encountering with docker on compute nodes likely stems from residual state after job completion. Ensure nodes are rebooted post job to clear resources. Validate NVIDIA MGHPatientGateway driver compatibility and cleanup GPU resources properly. Verify Docker and NVIDIA Container Toolkit configurations for correctness.

Topic		Replies	Views
Docker Run Issue with Runtime Flag General docker	2	109	January 4, 2025
Docker Fails to Launch GPU Containers with NVIDIA Runtime, but Podman Works General	4	330	June 5, 2025
Error starting docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1 Docker Desktop windows	1	3420	October 14, 2022
Docker: Error response from daemon: failed to create shim task: OCI runtime create failed General	1	6850	January 30, 2024
Docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: signal: segmentation fault General docker	2	7433	May 9, 2022

Docker run fails but only when run sequentially after a previously successful docker run

Related topics