Docker Container Restart Problems

Our team has been running a service within our docker container, and the host device is restarted occasionally, sometimes a few times a day. Most of the time the service and container restarts without a problem, but sometimes the container exits with 0 on boot, so our service fails to restart on boot.

Our current restart policy is set to “on-failure”.
Of course, we can set the restart policy to “always”, but I’m trying to understand why sometimes the container exits with 0. Our code does not exit with 0, and threads are always actively running.

I’ve found in these cases, a weird log message shows up

Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.819697640Z" level=error msg="7f33f3820431187ad77e5d4b4da207ccb855e307a09856d0549769c5672ba6dd cleanup: failed to delete container from containerd: no such container"

Why is docker trying to delete a container we want to keep? The container is still there btw, its just exited with code (0). I’m having a hard time reproduce this error, and have been trying to. If anyone has any ideas of how this can be reproduced, please let me know. I’ve been trying all types of reboots, from button, from terminal, from unplugging power.

I am not sure about the stopped container with exit 0 because I don’t exactly know how containers stop in case of a system reboot so I can imagine that sometimes containers can stop properly so there is no error code. Correct me if you have more information.

If I understand well, it is the part of the cleanup process when the docker daemon cannot start a stopped container on boot. This is why I think it:

1 ) I could find the error message here: daemon/start.go on line 264, which is in the cleanup method.
2 ) This is called on line 133 in the containerStart method with a comment on line 119:

// if we encounter an error during start we need to ensure that any other
// setup has been cleaned up properly

3 ) The Cleanup method calls containerd.Delete method but that probably does not mean Docker will forget about the container. My guess is (based on some research) containerd stops processes, unmount the container’s root filesystem so you will not see it in the list of mounted directories (mount | grep overlay) but the root filesystem itself remains untouched. You can actually delete a running container by killing some processes like containerd-shim and still see the container is running when you execute docker ps. Of course you would not be able to exec any command inside it. I tried it :slight_smile:

4 ) Since the error message says the container was not found, I guess it tried to cleanup based on an incorrect metadata because of the system reboot.

I don’t rely on the restart policy of Docker for a long time. I would suggest use docker swarm for example to make sure containers are restarting on boot.

Thanks for the quick reply Akos!
I find it weird that my container still exits with 0 even when there is this line in the code. When an error happens during startup and the container is detected with exit code zero, the exit code is set to 128. This was not the case for me though, because when checking my containers, the exit code was 0.

In response to your comment about containerd stopping processes, I have observed from containerd logs the shim being disconnected, but that’s likely normal during a shutdown (I could be wrong). Even when this shows up, it doesn’t always always end up with my System.exitCode(0) problem.

Oct 20 17:02:37 amazon containerd[996]: time="2021-10-20T17:02:37.730777709Z" level=info msg="shim disconnected" id=84d717d17ef01c13a7c8d010a0f6d098c904d44ac267ecb423e3261bec17524e
Oct 20 17:02:37 amazon containerd[996]: time="2021-10-20T17:02:37.730817732Z" level=error msg="copy shim log" error="read /proc/self/fd/12: file already closed"

I would attach the whole daemon log file, but since I’m a new user I can’t upload.
Heres the snippet of the engine shutting down and rebooting to get that error message I shared in my post:

Oct 17 09:02:50 amazon systemd[1]: Stopping Docker Application Container Engine...
Oct 17 09:03:00 amazon dockerd[1219]: time="2021-10-17T09:03:00.529311102Z" level=info msg="Container 7f33f3820431187ad77e5d4b4da207ccb855e307a09856d0549769c5672ba6dd failed to exit within 10 seconds of signal 15 - using the force"
Oct 17 09:03:00 amazon dockerd[1219]: time="2021-10-17T09:03:00.656214834Z" level=info msg="ignoring event" container=7f33f3820431187ad77e5d4b4da207ccb855e307a09856d0549769c5672ba6dd module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
-- Reboot --
Oct 17 09:03:49 amazon systemd[1]: Starting Docker Application Container Engine...
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.712208488Z" level=info msg="Starting up"
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.747076549Z" level=info msg="parsed scheme: \"unix\"" module=grpc
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.747099106Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.747114491Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.747125866Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.752794860Z" level=info msg="parsed scheme: \"unix\"" module=grpc
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.752826418Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.752842997Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.752850269Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.784221034Z" level=info msg="[graphdriver] using prior storage driver: overlay2"
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.806295629Z" level=warning msg="Your kernel does not support swap memory limit"
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.806315057Z" level=warning msg="Your kernel does not support CPU realtime scheduler"
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.806319334Z" level=warning msg="Your kernel does not support cgroup blkio weight"
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.806322879Z" level=warning msg="Your kernel does not support cgroup blkio weight_device"
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.806826331Z" level=info msg="Loading containers: start."
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.819697640Z" level=error msg="7f33f3820431187ad77e5d4b4da207ccb855e307a09856d0549769c5672ba6dd cleanup: failed to delete container from containerd: no such container"
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.951032416Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Oct 17 09:03:49 amazon dockerd[1214]: time="2021-10-17T09:03:49.989878591Z" level=info msg="Loading containers: done."
Oct 17 09:03:50 amazon dockerd[1214]: time="2021-10-17T09:03:50.076058251Z" level=info msg="Docker daemon" commit=75249d8 graphdriver(s)=overlay2 version=20.10.8
Oct 17 09:03:50 amazon dockerd[1214]: time="2021-10-17T09:03:50.080070825Z" level=info msg="Daemon has completed initialization"
Oct 17 09:03:50 amazon systemd[1]: Started Docker Application Container Engine.
Oct 17 09:03:50 amazon dockerd[1214]: time="2021-10-17T09:03:50.122605118Z" level=info msg="API listen on /run/docker.sock"

Yes, I saw that line. This is why I wrote

I would like to help you because 1 ) I like to solve mysteries and 2 ) I had a similar problem and I still have I just did not have time to work on it. In my case not only the containers could not start but the docker daemon could not start either and the reason seamed to be some kind of timeout after similar error messages.

If someone has more ideas, I would like to hear them too :slight_smile: Until that if I have any new idea, I will get back to you. I am too tired now, but I will read your last quoted logs later.

Thank you so much for helping me out! I haven’t observed any docker daemon problems, and this problem is easily solved with a “docker container start” but we can’t afford to require manual intervention every time this happens.

Our application is running a handful of devices (we are constantly adding more), so having docker successfully restart containers on reboot is crucial for smooth operations.

I know you wrote

But if your goal is to make sure the containers restart on boot, maybe you should change the policy to “always” until you find out the reason.

Unfortunately I don’t see more clue to the exit code problem but I noticed this:

Which means the container received the stop signal and did not handle it in time. 10 seconds is the default timeout which you can increase using the --stop-timeout parameter with docker run or stop_grace_period with docker compose. If the daemon has to kil the container before the container could stop gracefully, that could corrupt your data. The reason of this error is either your app needs more time to close connections and open files or the stop signal is not handled correctly. Which often caused by a misconfigured start command built from SHELL, ENTRYPOINT and CMD in the Dockerfile. I have an example project to show incorrect and correct configurations: itsziget/tutorial-linux-signals

I made it to use it in my youtube tutorial video which I don’t link here since it is Hungarian so you would not understand. tini and supervisor can help but I also have some bash examples to handle signals correct. This is important if you want to be sure stopping and restarting containers work as you expect within a reasonable time.

The next part of the error message shows

This contains the same container ID so there might a connection between forcibly killing the container and trying to delete it later.

I am not sure if on-failure is even supposed to restart a service on boot. The name doesn’t imply that it should. I believe that you randomy “benefited” accidently from force-full kills of your containers.

I can only second rimelek’s suggestion to change the resart policy to always.

We can expect that the docker container will exit with a non-zero code because it is forcibly killed. I did a test with a basic image, and when I restarted docker, the container exited with code (137).

logs showed the same message of force kill:

This explains why on-failure restarts are working because every single time we reboot, this error message happens. However the reason why docker attempts to delete the container is still a mystery to me. This only happens in rare cases accompanied with a System.exit code of 0.

Yes, but it shouldn’t. So this is what you should solve first in my opinion. If the containers need more time to stop, you can increase the stop timeout. If the containers does not handle the stop signal, then you should fix that.