Best practice for debugging failed containers

Hi! Just learning docker and trying to grasp what’s the best practice for debugging, when container execution fails.

E.g. consider simple dockerfile below.

FROM debian:bookworm
RUN apt-get update -y && apt-get install -y \
    python-is-python3
CMD python -c "raise Exception"

My questions are:

  1. Is there a way to do something like docker run --rm, but keep the container in case it fails? Searching for option like this, I’ve found reports that --rm should indeed keep the failed container (link), but it doesn’t seem to be true anymore / in my example, see the snippet below.

  2. Is there a way to enter failed container / container after it exited? I’ve found that it’s possible to do docker commit <failed_container_id> postmortem-debug and then docker run -it postmortem-debug /bin/bash, but I wonder if there’s a more direct way to do it.

Snippet:

docker build -t hello-world-debian .
docker run --rm hello-world-debian
# Exception

docker ps -a
# No containers left.

--rm was never for keeping but removing as its name suggest. rm = remove. The linked issue indicates containers were not removed when they failed, but it was 13 years ago. Basically at the beginning of Docker.

If you want to check the logs, do not use --rm. If you want to debug the command that would run when starting the container, start the container with an interactive shel and execute the same command interactively.

docker run --rm hello-world-debian bash

and then run any command you would like to test.

If the build itself fails, then remove everything from the dockerfile except lines before the failing line and run the rest of the commands interactively.

1 Like

Thanks for the clarification. I guess it’s still possible to run container with something like CONTAINER_ID=docker run -d hello-world-debian && docker attach $CONTAINER_ID && docker rm $CONTAINER_ID || echo "Error running container $CONTAINER_ID", to remove unless it fails.

Any ideas about the one below - is docker commit the way to do it or there’s something else?

Have you considered overriding the entrypoint script with sh or bash (if the latter is available in the image → container)?

You could inspect the image to find out what the original entrypoint and/or command of the image is, then start a container with overridden entrypoint script from the image, and start your troubleshooting steps from there.

Something like this:

docker run -ti --rm --entrypoint sh <image>

Overriding entrypoint and running commands manually works, but I was thinking that more common situtation is when user would need to investigate failed container right after it already failed, since starting container again with -it and rerunning commands manually can be time consuming.

Debugging is often time consuming, but has to be done. There is no automated solution here. Just like without containers, you still have to check logs and find out what casued the error. If you have a well built image with parameters and entrypoint that warns you when a parameter is used incorrectly, there is less chance to run into an issue that takes a lot of time to solve. But during development, you will have to deal with possible errors.

Note that a container is basically a process running on the host, but isolated. Once the process stops, that means the “container stopped”. The only way to keep it running if you have an init process that runs even when the app that the container was made for already stopped. But that is a virtual machine way as that los requires commands available in the container. Containers often contain only a sinbgle binary or at least very few commands that also makes the container less vulnerable.

The app in the container should return good error messages so you don’t need to run anything in the container.
But the solution recommended by @meyay (and by the way by me, except that I recommended changing the comand, but sometimes you need to change the entrypoint) can be used to run the exact same commands you originally ran in the container, except you have a shell and you see all the files in the containers more easily when the process fails and in case the process saved logs in files and didn’t send them to the standard output or error stream.

It is probably not something that many users need, that is why it is not an existing feature. Whe you run a container in attched mode (the defaul), you have all the error and status logs in your terminal. If you also add -it you get an interactive terminal (which I forgot to add to my previous post of running bash, but I’m going to fix it). Whenever you have generated data, you need to use some kind of volumes or bind mounted folders so it will be kept even after the container is automatically removed.

When you run a container in the background (in detached mode), you probably don’t want to automatically remove it when it stops, so you can check the logs. Containers running in detched mode are often set to always restart so even when it fails, it will be restarted. If it fails constantly, it will be restarted indefiniely unless there is a limit when using “on-failure” restart policy. Having informative log messages is important when using containers. It would be always important, but since containers ae intentionally isolated and more difficult to debug, you need good logs to recognize the problem.

When checking the logs is not enough, you can try to reproduce the issue interactively as we recommended. Or you can run the container without --rm during development in attached mode, but that would only give you the logs. You should have everything on mounted folders or volumes so there is no reason to keep the container even when it fails, unless running it in detched mode. Even then, you can change the default logging driver on Linux with Docker CE (it won’t work with Docker Desktop) and send logss to journald. That way the container logs would be kept when the containr is removed.

If you think Docker CE could have some good features to help you debugging, you could ask for those features in the roadmap:

1 Like

Thank you, it gives me more perspective on what are the general workflows for this.

I’ll try to apply it in practice, maybe I was overestimating the need to inspect the failed container and either inspecting base environment with run -it or inspecting the volumes / mounted folders should indeed cover it.