Physical host went to panic mode when running docker for 8 hours

qiangkewei · October 10, 2017, 4:01pm

We are using spotify docker client API (maven version 8.8.2) to execute the docker container. There are 4 thread executes the container, each execute it every 5 minutes on a single host.

Here is the host environment:
CentOS release 6.5 (Final), Kernel 2.6.32-431.el6.x86_64, physical host., Docker version 1.7.1, build 786b29d/1.7.1

We found that after it running for 8 hours, the physical hosts went to the panic mode. This behavior can be reproduced. But we didn’t see this issue when use the VM with the same environment settings.

Could it be a known issue with the older docker version?
We are planning to upgrade to CentOS7 so that we can install the latest docker version on it. But still don’t know if this issue will happen again in the latest Docker version.

bscott13 · October 10, 2017, 6:01pm

We have been seeing similar issues.

Our setup:

3 identical servers running: RedHat Enterprise Linux Server release 7.3 (Maipo)
Kernel version: kernel-3.10.0-514.26.2.el7.x86_64
Docker version:

                                  Client:
                                                  Version:      17.06.0-ce
                                                  API version:  1.30
                                                  Go version:   go1.8.3
                                                  Git commit:   02c1d87
                                                  Built:        Fri Jun 23 21:20:36 2017
                                                  OS/Arch:      linux/amd64

                                       Server:
                                                 Version:      17.06.0-ce
                                                 API version:  1.30 (minimum version 1.12)
                                                 Go version:   go1.8.3
                                                 Git commit:   02c1d87
                                                 Built:        Fri Jun 23 21:21:56 2017
                                                 OS/Arch:      linux/amd64
                                                 Experimental: false

On a pretty consistent basis we are able to see one of the nodes lock up and can only be restarted by pressing the power button to reset it. We are using Docker Swarm to join the 3 servers that are running 9 different docker containers running various software packages. We are able to see the lock up when we perform the “docker swarm leave --force” from the manager node ssh’d into the other nodes to force them off the swarm. The lock up is not every time but one in five will result in a lockup on one of the nodes. The amount of time that docker has been running on these machines varies from 10 minutes to days.

Recently we have seen the lockup occur after the systems were running for 30 minutes and we did not perform the “docker swarm leave --force”. The containers were just chugging along and it locked up.

We thought it may be related to the version of the kernel we were running and though upgrading to the version listed above would fix it. We still ran into the lockup.

We are currently looking into the behavior of the node list. We have noticed that the “docker node list” presents a node that is in “Down” status but is listed as a duplicate of one the nodes that is in status of “Ready”. Curious if this down node is somehow causing problems with the docker network.

Any help in the realm would be greatly appreciated.

Ben