Docker run fails but only when run sequentially after a previously successful docker run

Hi all,

I’ve used google cloud platform to create an auto-scaling compute cluster using slurm. When a job is submitted, a compute node is booted up, and the start up script runs, which pulls a docker image from google’s artefact registry. Then a length simulation process is run through the container in GPU mode. After length computation (around `2 hours) the process finishes. This works no problem, the issue is if another job is then directly submitted to the same compute node that was previously running, with no node reboot occurring in between. I then get this error message when I try and start run my docker container:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as ‘legacy’
nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown.
time=“2024-05-03T15:49:31Z” level=error msg=“error waiting for container: context canceled”

Clearly, my docker image is working fine since it will run if the jobs are submitted to separate compute nodes. The issue only arises when docker tries to run on a compute node that has just finished running a previous process, and hasn’t fully shut down / rebooted.

I have tried adding docker clean up commands to the start and end of my submission script but this isn’t helping.
e.g:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --partition=compute
#SBATCH --output {stdout_file}
#SBATCH --error {stderr_file}
#SBATCH --open-mode=append


# Re-start any existing containers 
CONTAINER_IDS=$(docker ps -a -q)
if [[ ! -z "$CONTAINER_IDS" ]]; then
    # Stop all running Docker containers
    if [[ ! -z "$(docker ps -q)" ]]; then
        docker stop $(docker ps -q)
    fi

    # Remove all Docker containers
    docker rm -f $CONTAINER_IDS

    # Remove unused Docker volumes and networks
else
    echo "No Docker containers to restart."
fi

docker volume prune -f
docker network prune -f

# Use docker container to run python script
export GOOGLE_APPLICATION_CREDENTIALS={google_application_credentials}
gcloud auth configure-docker us-central1-docker.pkg.dev --quiet
docker pull us-central1-docker.pkg.dev/{project_name}/{docker_repo}/{docker_image}:{docker_tag}
docker run --gpus all --entrypoint /bin/bash --rm -v {login_mount_point}:{compute_mount_point} us-central1-docker.pkg.dev/{project_name}/{docker_repo}/{docker_image}:{docker_tag} \
    -c "begin running code that runs the process..."

If anyone has any ideas what is causing this then any help would be appreciated.

Hello,
The error you are encountering with docker on compute nodes likely stems from residual state after job completion. Ensure nodes are rebooted post job to clear resources. Validate NVIDIA MGHPatientGateway driver compatibility and cleanup GPU resources properly. Verify Docker and NVIDIA Container Toolkit configurations for correctness.