I’m running a dozen or so applications in containers with docker in a swarm configuration. Over the weekend all of the containers exited and restarted at the exact same time (22:57 local). I cannot find any errors in the logs of the exited containers, in the logs for any of their services, and nothing in docker events output for that time frame. There’s no indication of any system issues at that time in all of the appropriate system logs. The docker daemon has been running uninterrupted since sometime last week. I’m at a loss as to what caused this issue and therefore unable to prevent it from occurring again. I’m posting this here with the hope that someone can point me in the right direction for debugging this problem.
OS: RHEL 7.9
If all of your containers restarted, container logs will not help you much probably. Check the logs of docker and containerd.
journalctl -xe -u docker or
journalctl -xe -u containerd can help. Usually I do this:
- try to find docker and containerd logs
- try to find other logs of events happened before the issue. I use journald for this too. Or syslog.
- Checking dmesg logs can also give you some information
You mentioned you don’t see issues with anything else but “uptime” can tell you if the machine rebooted. Maybe there was a memory overload or something. But I don’t use Docker Swarm so maybe someone else can give you a better advice or an explanation.
I’m following up here just in case someone else runs across this thread. Ultimately the problem was created by a failure of IPVS which was a result of running out of physical memory on the host. The log advice was the correct direction but complicated by some date/times being in UTC and others local. Neither dockerd nor the host restarted but the IPVS failure caused all the container healthchecks to fail resulting in the restart. Fortunately, the container restarts also freed up enough memory that docker and the rest of the system could resume functioning properly. Now to address the resource allocations…
Actually I have just had almost the same problem, except the components were different since it was in Kubernetes. I totally forgot about this topic until I came back today
Thanks for sharing your observation. I can only confirm it now that this can happen.
I actually knew it before…
So it’s a shame I didn’t remember it.