Looking for suggestions identify the problem causing docker engine hangs

Hi all,

I have a Ubuntu 20.04 LTS server running the docker engine service and it’s being used as the production environment. It worked well for a long time, but something strange happened last night. The docker daemon stopped responding to some requests sent, all the docker run commands were stuck indefinitely(no failure, no timeout, no warning). Yet docker ps can still show the old running containers and the finished containers were unable to exit. I noticed this issue after 12 hours and restarted the docker daemon service, then everything was back to normal. I didn’t find any resource usage in the server reaching the limit before I restarted the service.

I tried to check the system logs to see what happened, but nothing was recorded in docker.service or containerd.service. Below attached logs at the time. There was no log after Nov 12 at 00:06 for 12 hours until I manually restarted the service on Nov 12 at 12:48, and it had to force stop the service to restart.

Nov 11 23:53:42 orchtool.strayos.com dockerd[1116]: time="2024-11-11T23:53:42.319442151Z" level=info msg="ignoring event" container=dad00bccd486e90f6fc0dc94f10958c44f9d94a6d773fbb2b5f3eae65f025e7c module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Nov 12 00:06:41 orchtool.strayos.com dockerd[1116]: time="2024-11-12T00:06:41.174392203Z" level=info msg="ignoring event" container=19d535ea4a9778da6729f8a94439faeb6f42d95eb6e6dce287648177aee55453 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Nov 12 12:48:14 orchtool.strayos.com systemd[1]: Stopping Docker Application Container Engine...
Nov 12 12:48:14 orchtool.strayos.com dockerd[1116]: time="2024-11-12T12:48:14.638534155Z" level=info msg="Processing signal 'terminated'"
Nov 12 12:48:24 orchtool.strayos.com dockerd[1116]: time="2024-11-12T12:48:24.725989449Z" level=info msg="Container failed to exit within 10s of signal 15 - using the force" container=f9d12f05d57a1ab647ec7868baf3cd229c91fad02126dc303981b10a75c0ea3c
Nov 12 12:48:24 orchtool.strayos.com dockerd[1116]: time="2024-11-12T12:48:24.725989449Z" level=info msg="Container failed to exit within 10s of signal 15 - using the force" container=19ca0fedfecb540d9d5b2d3f2e39f27b1861828884bc902cb0063175392930d9
Nov 12 12:48:24 orchtool.strayos.com dockerd[1116]: time="2024-11-12T12:48:24.725999249Z" level=info msg="Container failed to exit within 10s of signal 15 - using the force" container=262917a8c48a4ca3775a4ed668de3e754607a90544585c3c04a25c39b31b887d
Nov 12 12:48:24 orchtool.strayos.com dockerd[1116]: time="2024-11-12T12:48:24.899560080Z" level=info msg="ignoring event" container=f9d12f05d57a1ab647ec7868baf3cd229c91fad02126dc303981b10a75c0ea3c module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Nov 12 12:48:24 orchtool.strayos.com dockerd[1116]: time="2024-11-12T12:48:24.901529882Z" level=info msg="ignoring event" container=19ca0fedfecb540d9d5b2d3f2e39f27b1861828884bc902cb0063175392930d9 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Nov 12 12:48:24 orchtool.strayos.com dockerd[1116]: time="2024-11-12T12:48:24.907118690Z" level=info msg="ignoring event" container=262917a8c48a4ca3775a4ed668de3e754607a90544585c3c04a25c39b31b887d module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Nov 12 12:48:29 orchtool.strayos.com dockerd[1116]: time="2024-11-12T12:48:29.660244722Z" level=error msg="Force shutdown daemon"
Nov 12 12:48:29 orchtool.strayos.com dockerd[1116]: time="2024-11-12T12:48:29.660419422Z" level=info msg="Daemon shutdown complete"
Nov 12 12:48:29 orchtool.strayos.com systemd[1]: docker.service: Succeeded.
Nov 12 12:48:29 orchtool.strayos.com systemd[1]: Stopped Docker Application Container Engine.
Nov 12 12:48:29 orchtool.strayos.com systemd[1]: Starting Docker Application Container Engine...

I’m not familiar with low-level stuff within Docker. If anyone could suggest where else I could investigate, it would be much appreciated. I just want to see if I can prevent or figure out a way to monitor this in the future.

Many thanks ahead

Depending on what the containers were doing and since how long, if we rule out cpu and memory resource issues, I could imagine the containers filling the disk completely on the container’s filesystem or in a temp folder which is cleaned after reboot. Or some IO operation issues which I experienced recently with Docker containers running in virtual machines and the host machine of the VMs basically stopped doing anything. I’m not sure if that was related to Docker. I wouldn’t think so, but it was almost certainly a disk IO issue after which the machine was in a state it could not recover from.

Thanks for the insight!
I don’t see any storage limit reached either, but if something is cleaned up after the restart, I couldn’t know. Narrowing down the problem seems to be difficult given the server is hosted on a cloud VM. I guess for now I would just have a cronjob to create/delete a dummy container regularly to monitor if anything is wrong.