Hi All,
we are having a strange issue where containers are getting killed sporadically and see that resources utilisation is well below 100%. We do not have any health checks or resource constraints configured and server has 1.5 TB of RAM.
Below are our server details
OS: Ubuntu
Version: 22.04
Kernel version: 5.15.0-79-generic
Docker version: 24.0.7, build afdd53b
Containerd version: containerd containerd.io 1.6.24 61f9fd88f79f081d64d6fa3bb1a0dc71ec870523
for a container “8cc939f1f1bbf2” that got killed we see the below in /var/log/syslog
Oct 31 19:56:47 nl-live containerd[3084]: time="2023-10-31T19:56:47.940688981+01:00" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/8cc939f1f1bbf293738331d2c8661fa61882fd71c357de451299d07b3684a4cf pid=2814699 runtime=io.containerd.runc.v2
Oct 31 19:56:47 nl-live systemd[1]: Started libcontainer container 8cc939f1f1bbf293738331d2c8661fa61882fd71c357de451299d07b3684a4cf.
Nov 1 09:07:56 nl-live dockerd[3522]: time="2023-11-01T09:07:56.486950326+01:00" level=info msg="Container failed to exit within 10s of signal 15 - using the force" container=8cc939f1f1bbf293738331d2c8661fa61882fd71c357de451299d07b3684a4cf
Nov 1 09:08:06 nl-live dockerd[3522]: time="2023-11-01T09:08:06.498492876+01:00" level=error msg="Container failed to exit within 10s of kill - trying direct SIGKILL" container=8cc939f1f1bbf293738331d2c8661fa61882fd71c357de451299d07b3684a4cf error="context deadline exceeded"
Nov 1 09:08:10 nl-live dockerd[3522]: time="2023-11-01T09:08:10.499969384+01:00" level=error msg="error killing container: context deadline exceeded" container=8cc939f1f1bbf293738331d2c8661fa61882fd71c357de451299d07b3684a4cf error="tried to kill container, but did not receive an exit event"
Nov 1 09:08:10 nl-live dockerd[3522]: time="2023-11-01T09:08:10.500078615+01:00" level=error msg="Handler for POST /v1.43/containers/8cc939f1f1bbf293738331d2c8661fa61882fd71c357de451299d07b3684a4cf/stop returned error: cannot stop container: 8cc939f1f1bbf293738331d2c8661fa61882fd71c357de451299d07b3684a4cf: tried to kill container, but did not receive an exit event"
Nov 1 09:08:19 nl-live systemd[1]: docker-8cc939f1f1bbf293738331d2c8661fa61882fd71c357de451299d07b3684a4cf.scope: Deactivated successfully.
Nov 1 09:08:19 nl-live systemd[1]: docker-8cc939f1f1bbf293738331d2c8661fa61882fd71c357de451299d07b3684a4cf.scope: Consumed 5h 48min 14.021s CPU time.
Nov 1 09:08:19 nl-live containerd[3084]: time="2023-11-01T09:08:19.520235133+01:00" level=info msg="shim disconnected" id=8cc939f1f1bbf293738331d2c8661fa61882fd71c357de451299d07b3684a4cf
Nov 1 09:08:19 nl-live containerd[3084]: time="2023-11-01T09:08:19.520272323+01:00" level=warning msg="cleaning up after shim disconnected" id=8cc939f1f1bbf293738331d2c8661fa61882fd71c357de451299d07b3684a4c namespace=moby
Nov 1 09:08:19 nl-live dockerd[3522]: time="2023-11-01T09:08:19.520295703+01:00" level=info msg="ignoring event" container=8cc939f1f1bbf293738331d2c8661fa61882fd71c357de451299d07b3684a4cf module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Nov 1 09:08:19 nl-live dockerd[3522]: time="2023-11-01T09:08:19.528556489+01:00" level=warning msg="failed to close stdin: task 8cc939f1f1bbf293738331d2c8661fa61882fd71c357de451299d07b3684a4cf not found: not found"
I am checking os logs, system logs etc but really not finding a root cause as to why the containers are getting killed sporadically? Looks like dockerd sent the container SIGTERM but it did not respond to it and then it sent SIGKILL
Any help on identify what could be going wring here or how to find the root cause will be really great.