"failed to update store for object type *libnetwork.endpointCnt: Key not found in store." and "Cannot connect to the Docker daemon"

chandlesagi · December 13, 2022, 9:56pm

Okay I’m just getting weird results all over…

What I’ve observed are strange things such as:

If I stop the service (systemctl --user stop docker) on the compute node it’s working on (n010), then the service will start normally without issues on another compute node (n011 for example). hello-world runs.
Now hello-world doesn’t run on the head node anymore until I restart docker. Further, docker fails start at all on n010.
It seems I can only successfully start docker on the head node + one compute node, but I can only actually run hello-world docker in one of those places, then it corrupts the other running instance.

Obviously, then, it’s something to do with the clustered environment and/or shared resources. Here are the relevant resources I can think of that are shared between the nodes:

Users’ home directories, which includes ~/.config/docker and ~/.config/systemd (which has the docker service scripts)
Docker data-root

So what’s the proper way to let users run docker (rootless mode) in a clustered environment? Is this even possible/supported? will be researching this in the mean time… If not, then I’ll have to get the software running another way…