Cannot determine reason for Docker Engine segfault

mbezhanov · November 6, 2017, 8:49am

Hello everyone,

I am running a small application on an economical, general-purpose Linux virtual machine at Azure (B2MS to be more specific). The application is really tiny, so I am not using Swarm, but simply Docker CE + Docker Compose on a server with the following OS and software installed:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.3 LTS
Release:        16.04
Codename:       xenial

$ sudo docker --version
Docker version 17.09.0-ce, build afdb6d4

$ sudo docker-compose --version
docker-compose version 1.16.1, build 6d1ac21

Even during peaks, the server seems to have plenty of memory at its disposal:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7964         628        4184          81        3150        6836
Swap:             0           0           0

$ docker stats
CONTAINER           CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
9d579b74ab96        0.00%               2.168MiB / 512MiB   0.42%               1.23kB / 0B         0B / 0B             5
cff6e06b79de        0.05%               105.5MiB / 512MiB   20.60%              1.14kB / 0B         7.57MB / 21.1MB     30
114e0e0d92b6        0.08%               7.02MiB / 512MiB    1.37%               238kB / 24.9kB      0B / 0B             4
9f94ad22b1c4        0.00%               14.62MiB / 512MiB   2.86%               25.9kB / 237kB      0B / 0B             1
88e31890689d        0.00%               10.2MiB / 512MiB    1.99%               1.01kB / 0B         0B / 0B             3
eaace4db3e05        0.00%               2.273MiB / 512MiB   0.44%               1.01kB / 0B         0B / 8.19kB         3

Yet, for some reason Docker Engine keeps restarting at random intervals due to, it seems, segmentation faults, and I can’t figure out why:

$ less /var/log/syslog
Nov  6 08:04:00 usr systemd[1]: docker.service: Main process exited, code=killed, status=11/SEGV
Nov  6 08:04:00 usr systemd[1]: docker.service: Unit entered failed state.
Nov  6 08:04:00 usr systemd[1]: docker.service: Failed with result 'signal'.
Nov  6 08:04:00 usr systemd[1]: docker.service: Service hold-off time over, scheduling restart.
Nov  6 08:04:00 usr systemd[1]: Stopped Docker Application Container Engine.
Nov  6 08:04:00 usr systemd[1]: Closed Docker Socket for the API.
Nov  6 08:04:00 usr systemd[1]: Stopping Docker Socket for the API.
Nov  6 08:04:00 usr systemd[1]: Starting Docker Socket for the API.
Nov  6 08:04:00 usr systemd[1]: Listening on Docker Socket for the API.
Nov  6 08:04:00 usr systemd[1]: Starting Docker Application Container Engine...

As a result, all of the containers are stopped - a couple of the containers terminate with exit code 137, another one - with code 2, and the rest with 0. The logs of the containers contain no relevant information, and I’m not sure where to continue from in my attempt to resolve this issue.

So the main question is: are there any further logs that I can look at, or any additional things that I can do in general to get to the bottom of this issue?

P.S. The app was running on a less powerful server before (at another vendor) for months, without any issues.

P.P.S. I also experimented with setting the container restart policy to “always”, but things got even worse, as although that managed to keep all the containers running, any calls to the docker command resulted in Cannot connect to the Docker daemon. Is 'docker -d' running on this host? and calls to service docker status indicated that the engine is stuck in “activating” mode:

$ service docker status
● docker.service - Docker Application Container Engine
   Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
   Active: activating (start) since Sat 2017-11-04 17:36:57 UTC; 1 day 13h ago
     Docs: https://docs.docker.com
 Main PID: 124334 (dockerd)
    Tasks: 77
   Memory: 33.9M
      CPU: 3min 44.293s
   CGroup: /system.slice/docker.service
           ├─ 79162 docker-containerd-shim bd05db6078631136e4a2dad79dc9f95bf73cd443ad65990a18b82eab648f4b44 /var/run/docker/libcontainerd/bd05db6078631136e4a2dad79dc9f95bf73cd443ad65990a18b82eab648f4b44 dock
           ├─ 79228 docker-containerd-shim 718585c9c40b0cde14b33bc845f305ee8756a4ea331e00553c2f7a5e4cba8967 /var/run/docker/libcontainerd/718585c9c40b0cde14b33bc845f305ee8756a4ea331e00553c2f7a5e4cba8967 dock
           ├─ 79229 docker-containerd-shim 7e3141dcd3db8bc665ddcc924ffd19bbae054fa7d66d8e78a0f5a9d881b32e47 /var/run/docker/libcontainerd/7e3141dcd3db8bc665ddcc924ffd19bbae054fa7d66d8e78a0f5a9d881b32e47 dock
           ├─ 79422 docker-containerd-shim aba6bc499ae1bb4ecbaac4d466af3b2e2b7847f07a5ddab3f7f59c6bbd835fe7 /var/run/docker/libcontainerd/aba6bc499ae1bb4ecbaac4d466af3b2e2b7847f07a5ddab3f7f59c6bbd835fe7 dock
           ├─ 79517 docker-containerd-shim 0478d8e90de0839e94d11f5d692bc465aa1c70db7b4c96a929532c40f8992482 /var/run/docker/libcontainerd/0478d8e90de0839e94d11f5d692bc465aa1c70db7b4c96a929532c40f8992482 dock
           ├─ 79533 docker-containerd-shim 7af68612e0818d8b9253c3200d73ff1c858aedac7f28ddd7819a6d7757aba5fe /var/run/docker/libcontainerd/7af68612e0818d8b9253c3200d73ff1c858aedac7f28ddd7819a6d7757aba5fe dock
           ├─124334 /usr/bin/dockerd -H fd://
           └─124340 docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker

illesg · November 14, 2017, 11:03pm

Hi @mbezhanov
I think i ran into the same problem. What is your current kernel version?
There is a known issue with the linux-azure 4.11.0-1011 version
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1719045
I think it’s fixed now in 4.11.0-1013.13, but I haven’t tried it yet

mbezhanov · November 15, 2017, 7:51am

The kernel version is the older one that you mentioned:

$ uname -r
4.11.0-1011-azure

Thanks for letting me know about this issue!