Docker Community Forums

Share and learn in the Docker community.

Many complaints during MPI within container

Issue type

MPI process appears to complete successfully but complains voluminously in the process.

TL;DR

I think the various processes running inside the container are having trouble knowing each others’ identities for communicating, or some such. This is happening during a wait4 process, best I can tell. Is there a directory I can mount or a flag I can use that will help?

OS Version/build

Inside container: python:3.7.5-buster plus many additions, mostly via apt-get

Outside container: Pop!_OS 19.10 with Docker version 19.03.6, build 369ce74a3c

App version

Very custom home-grown image based on buster as given above.

The app most likely giving/causing the error: mpirun (Open MPI) 3.1.3

Steps to reproduce

I’m hoping no one needs to do this because it’s a lot of work.

Overview:

  1. Start with Buster.

  2. Add everything necessary to run image as a compute node in a Slurm cluster.

  3. Add everything needed to compile and run AMBER (ambermd.org) on that compute node.

  4. Build AMBER into the image.

  5. Set up a system for running an MD simulation using sander.MPI and mount the directory containing the simulation files.

  6. Start up the container using docker-compose. The Slurm daemon is the process, but we’re not using it in this test. It’s just there.

  7. Attach to the container with /bin/bash and become a non-root user (for convenience).

  8. Run your simulation using "mpirun -np 4 " before the rest of the (lengthy) executable. The ‘4’ can change to ‘2’, but the errors disappear if it is ‘1’.

  9. Watch many, many lines like this fly past:

    [gw-slurm-amber:00053] Read -1, expected 34728, errno = 1
    [gw-slurm-amber:00052] Read -1, expected 17856, errno = 1
    [gw-slurm-amber:00052] Read -1, expected 34728, errno = 1
    [gw-slurm-amber:00054] Read -1, expected 17544, errno = 1
    [gw-slurm-amber:00054] Read -1, expected 35472, errno = 1
    [gw-slurm-amber:00055] Read -1, expected 35472, errno = 1
    [gw-slurm-amber:00052] Read -1, expected 17856, errno = 1
    [gw-slurm-amber:00052] Read -1, expected 34728, errno = 1
    [gw-slurm-amber:00053] Read -1, expected 17616, errno = 1

Some notes:

I tried searching on “Read -1 expected errno = 1” and got “Your search - Read -1 expected errno = 1 - did not match any documents.” None. In all of the Goracle’s infinite knowledge. This must be my superpower.

I ran strace on the whole thing because I had no idea what else to do. Apparently, this is happening during a process called wait4. This process (wait4) seems not to be about the simulation software. My wild guess is that mpirun is having trouble getting consistent information regarding child processes.

This is not being run in parallel across multiple docker containers. This is using 2 or 4 processors inside the one docker container.

If I run with “-np 1” (use mpirun, and the MPI executable, but only one processor), then there are no messages. With 2 or 4 processors, there are messages.

There are more messages - about 4x more - when np is 4 versus when np is 2.

I’ve run lots of many varieties of AMBER simulations and have never seen this.

If no one has any ideas, I’ll just try rebuilding everything, in case something went sideways in the meantime.