"failed to update store for object type *libnetwork.endpointCnt: Key not found in store." and "Cannot connect to the Docker daemon"

Hi, I have been trying to get some software running that uses docker: a Nextflow pipeline to analyze PacBio HiFi full-length 16S data. ⁠⁠⁠Unfortunately it’s important that we get this running ASAP.

I have the software (Docker version 20.10.21, build baeda1f) installed in rootless mode on our CentOS 8 server and compute nodes, and am just trying to complete the testing phase, at the paragraph starting with “To test the pipeline, run the example below”. You might see the issues I’m having starting on this comment.

What I’ve noticed is that after I execute the nextflow command I am always getting “Cannot connect to the Docker daemon” error, and after that I can’t even docker run hello-world without restarting the docker service (systemctl --user restart docker), otherwise it throws the “Key not found in store” error mentioned in the subject.

Once the docker service is restarted, hello-world runs fine, but still receive “Cannot connect to the Docker daemon” error when running the nextflow command. Then hello-world stops working until docker is restarted again. Any idea what keeps causing the demon to be unreachable and require constant restarts, and how to stop that?

Further, I’ve also learned that dockerd is having problems on some of the nodes in our cluster. We have a head node + 4 compute nodes. I’m attempting to troubleshoot further by starting dockerd in the foreground by first stopping the background service (systemctl --user stop docker) and then running dockerd-rootless.sh. This works fine on the head node and compute node n010, but for some reason dockerd-rootless.sh is failing to start on the remaining compute nodes n011-n013, the last lines of the output shown below. This is especially puzzling to me because all 4 compute nodes boot from the same image, and so are essentially identical systems. Unfortunately, I couldn’t find anyone else out there with similar problems and a solution…

WARN[2022-12-12T17:18:10.897937785-07:00] could not use snapshotter devmapper in metadata plugin  error="devmapper not configured"
INFO[2022-12-12T17:18:10.897945545-07:00] metadata content store policy set             policy=shared
WARN[2022-12-12T17:18:11.856348550-07:00] grpc: addrConn.createTransport failed to connect to {unix:///run/user/10063/docker/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: error while dialing: dial unix:///run/user/10063/docker/containerd/containerd.sock: timeout". Reconnecting...  module=grpc
WARN[2022-12-12T17:18:14.723191667-07:00] grpc: addrConn.createTransport failed to connect to {unix:///run/user/10063/docker/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: error while dialing: dial unix:///run/user/10063/docker/containerd/containerd.sock: timeout". Reconnecting...  module=grpc
WARN[2022-12-12T17:18:19.111939509-07:00] grpc: addrConn.creteTransport failed to connect to {unix:///run/user/10063/docker/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: error while dialing: dial unix:///run/user/10063/docker/containerd/containerd.sock: timeout". Reconnecting...  module=grpc
WARN[2022-12-12T17:18:20.898041903-07:00] waiting for response from boltdb open         plugin=bolt
WARN[2022-12-12T17:18:24.874256415-07:00] grpc: addrConn.createTransport failed to connect to {unix:///run/user/10063/docker/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = "transport: error while dialing: dial unix:///run/user/10063/docker/containerd/containerd.sock: timeout". Reconnecting...  module=grpc
failed to start containerd: timeout waiting for containerd to start
[rootlesskit:child ] error: command [/usr/bin/dockerd-rootless.sh] exited: exit status 1
[rootlesskit:parent] error: child exited: exit status 1

Okay I’m just getting weird results all over…

What I’ve observed are strange things such as:

  1. If I stop the service (systemctl --user stop docker) on the compute node it’s working on (n010), then the service will start normally without issues on another compute node (n011 for example). hello-world runs.
  2. Now hello-world doesn’t run on the head node anymore until I restart docker. Further, docker fails start at all on n010.
  3. It seems I can only successfully start docker on the head node + one compute node, but I can only actually run hello-world docker in one of those places, then it corrupts the other running instance.

Obviously, then, it’s something to do with the clustered environment and/or shared resources. Here are the relevant resources I can think of that are shared between the nodes:

  • Users’ home directories, which includes ~/.config/docker and ~/.config/systemd (which has the docker service scripts)
  • Docker data-root

So what’s the proper way to let users run docker (rootless mode) in a clustered environment? Is this even possible/supported? will be researching this in the mean time… If not, then I’ll have to get the software running another way…

Maybe docker swarm mode is what we need? looking into setting up and deploying this now…

Never share the docker data root between nodes! That directory contains everything including the status of your containers. If you want to share the home directories, don’t use rootless mode. If you used rootless mode for security reasons, you can use user namespaces instead.

If you want the users to be able to run their own Docker daemon you could try to run Docker in Docker and give every user one container with their mounted home directory. I have never done and I don’t think I would that could work (in theory)

An other idea is that you could move the docker data dir out of the home dir like /var/lib/docker-rootless/<USERNAME> and create a symbolic link in the home pointing to that directory. I am not sure if it works.

I don’t know if you could run swarm in rootless mode, but Swarm would not work either with shared home folder in rootless mode. I only heard about Nextflow, so I don’t know what environment it needs. If it need swarm, you can try it, but that alone would not help.

1 Like

Yeah I guess rootless mode was perhaps causing the issues. Swarm wasn’t working either, plus the more I read about it, the less it made sense since we already have Slurm setup on the cluster and this particular Nextflow pipeline supports using slurm.

The developers of the pipeline have successfully run it through docker, so I’m guessing the underlying issue here is with rootless mode.

Rather than continue down this path, I just tried the other configuration methods of this pipeline, which can be run without docker, and that ended up working.

If I find myself in a similar situation in the future, I’ll be sure to check out user namespaces as well. Thanks!