Docker swarm periodically restarts all services

Hi all,

I am trying to get some experience with Docker and Docker Swarm on a single machine.

So far experimenting works well but I am experiencing a strange behaviour in swarm mode. Periodically, but with no fixed regularity, all services restart for no reason. For example, the Gitea container originally created at 02:09, stopped, restarted at 22:06, stopped again (just after 2 minutes) and runs since 22:08.

Every single docker-compose includes the following code snippet:

version: "3"
    services:
    main:
image: gitea/gitea:latest
deploy:
  replicas: 1
  restart_policy:
    condition: on-failure

So, I first thought that the program crashed insight the container (just ignoring that all programs crashed at the same time :-)) and forced a restart to above policy.

10.255.0.2 - - [22/Feb/2019:09:48:54 +0000] “GET / HTTP/1.1” 500 1672

10.255.0.2 - - [22/Feb/2019:10:14:13 +0000] “GET / HTTP/1.0” 500 1655

10.255.0.2 - - [22/Feb/2019:15:38:35 +0000] “\x03” 400 226

10.255.0.2 - - [22/Feb/2019:15:38:35 +0000] “\x03” 400 226

10.255.0.2 - - [22/Feb/2019:17:03:31 +0000] “GET / HTTP/1.1” 200 5099

10.255.0.2 - - [22/Feb/2019:17:11:38 +0000] “GET / HTTP/1.1” 500 1666

Caugth signal SIGTERM, passing it to child processes…

Caugth signal SIGTERM, passing it to child processes…

[Fri Feb 22 21:07:04.893407 2019] [mpm_prefork:notice] [pid 122] AH00169: caught SIGTERM, shutting down

So, it looks like that the docker deamon send the SIGTERM.

The question is: why?

The system details are:

Containers: 21
Running: 7
Paused: 0
Stopped: 14
Images: 7
Server Version: 18.09.2
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: fb0bk1eyp70chxwtnmbczdzew
Is Manager: true
ClusterID: yvke64ipr0le72r6lfja9m7lo
Managers: 1
Nodes: 1
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 178.33.26.29
Manager Addresses:
178.33.26.29:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc version: 09c8266bf2fcf9519a651b04ae54c967b9ab86ec
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.0-8-amd64
Operating System: Debian GNU/Linux 9 (stretch)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 7.8GiB
Name: menkisyscloudsrv29
ID: KKDM:55YP:BSAM:FNOZ:AA55:O2IJ:LTIK:MCEI:X47U:DGJ3:AK5U:EZT3
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

Do you have any idea what’s happening? Where can I get more log details what dockerd is doing?

Thanks!

Did you check the service logs and your operating sytems syslog for hints?
Also a docker service ps {servicename} might provide some insights about the error reason.

If your containers consume too much ram, the operating system will start kiling random processes to free up ram. Also docker doesn’t like it if the storage has no space left.

If you run Java containers, you might want to take a closer look at their resource usage.
Java 10 is the first version beeing fully aware of beein run in a container. Java 9 has options to at least support cgroup limits. In Java 8 update 131 the Java 9 options got backported. Everything before is not remotely aware of cgroups or beeing run in a container at all.

Hi Metin,

thanks for your quick reply. Staying at my example Gitea I run your command with the following output:

ID                  NAME                IMAGE                NODE                 DESIRED STATE       CURRENT STATE           ERROR               PORTS
pdeipoqd6bvs        gitea_main.1        gitea/gitea:latest   menkisyscloudsrv29   Running             Running 22 hours ago
wzlciwqaqiqo         \_ gitea_main.1    gitea/gitea:latest   menkisyscloudsrv29   Shutdown            Shutdown 22 hours ago
tcdnlv21mb22         \_ gitea_main.1    gitea/gitea:latest   menkisyscloudsrv29   Shutdown            Shutdown 22 hours ago
pieodcryhv3c         \_ gitea_main.1    gitea/gitea:latest   menkisyscloudsrv29   Shutdown            Shutdown 42 hours ago
wle26krbmw49         \_ gitea_main.1    gitea/gitea:latest   menkisyscloudsrv29   Shutdown            Shutdown 42 hours ago

This is true for all services. The show the same behavior.

I checked my virtual server about RAM and disk shortages. Everything looks fine:

Filesystem      Size  Used Avail Use% Mounted on
udev            3.9G     0  3.9G   0% /dev
tmpfs           799M   80M  720M  10% /run
/dev/sda1        49G   26G   21G  56% /
tmpfs           3.9G     0  3.9G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.9G     0  3.9G   0% /sys/fs/cgroup
overlay          49G   26G   21G  56% /var/lib/docker/overlay2/c2c6032c02932a1df93920d0fe1aef4c86215e7b87333bec528ab726c996b2f0/merged
overlay          49G   26G   21G  56% /var/lib/docker/overlay2/c21c4312d93de8aeb5553e8f5740da805f8ecf45a8ed7a9f78773a0cc354c3d5/merged
overlay          49G   26G   21G  56% /var/lib/docker/overlay2/77e86f6dffe1f3c5907901ed164c4acaab06f546b760252b8a030e2df7c17015/merged
overlay          49G   26G   21G  56% /var/lib/docker/overlay2/db686f28b6d0e43ae6f92372864ed7d5bf4df3eac97c3136c950ca47b24e4767/merged
overlay          49G   26G   21G  56% /var/lib/docker/overlay2/43d9ca082d15b9e7c4dd652fe4a1b43b077e51088868b82c2684fe1805a81659/merged
overlay          49G   26G   21G  56% /var/lib/docker/overlay2/1b9d85053b4fe71cdaa7dd98a7ed232a8c597227bc007403abc19460f0bbf4d6/merged
shm              64M     0   64M   0% /var/lib/docker/containers/ee19075f23661f1b19bd1dc077df78a2eca6136c3f6b25576ed47d93a0964af9/mounts/shm
overlay          49G   26G   21G  56% /var/lib/docker/overlay2/62e5b358a9ea8a21cc3f7a4e7d234120b8f485ca4c6d15ae0608baac85680d14/merged
shm              64M     0   64M   0% /var/lib/docker/containers/4828835509abdfcc0559588df0fdbc697a237ba63865aa4695b2861a4ed7a914/mounts/shm
shm              64M     0   64M   0% /var/lib/docker/containers/d1868570bb5952a9d9e9bcaf0ea33658ab9f631f57b48961c4f97e4cdca8516b/mounts/shm
shm              64M     0   64M   0% /var/lib/docker/containers/a924bfefba1f71aa1c89bc927a77967796aea546d179eead81afd920ebde159d/mounts/shm
shm              64M     0   64M   0% /var/lib/docker/containers/cd11b392f7f1ed6cdc24ab2a3bfc9bd552762226c2f5624716d46f591ec2edbe/mounts/shm
shm              64M     0   64M   0% /var/lib/docker/containers/a3dc6f0484fa07854bd321b7caa020329bb12deecb87fe84666a55fc99790843/mounts/shm
shm              64M     0   64M   0% /var/lib/docker/containers/1cfcce261b479f67d26d39a31214f77a838d80d600d18154d4e8e93bcf595669/mounts/shm
tmpfs           799M     0  799M   0% /run/user/0

It’s strange that for get free RAM all docker processes at the same time would get killed and no other non-Docker services.

All of my Docker images do not use JAVA. Further, I use official vendor Docker builds from Docker.

Where do I find the docker deamon logs?

journalctl -u docker

and

tail -f /var/log/daemon.log

do not show anything related to Docker (I can post the content if you like).

Thanks again.

The blockdevice has sufficient space.

In htop the green bar for memory is the accumulated memory usage of all processes. Somtimes there are also blue and yellow bars which indicate buffers and caches. Your ram is closed to be maxed out.

Well, on RHEL/CentOS the details are present in dmesg.
Though, I have no idea if this works for Ubuntu as well.

@theriddler82 Did you ever discover a root cause? I’m having a very similar problem and haven’t been able to figure out the reason yet.

Sorry for the late response, I just got the info that somebody answered.

The reason was indeed the RAM. I have migrated to a new VPS with a really good Memory management and the issue is gone. Since migration (2 weeks ago), I had no issues at all. Before it was an ESX, now it’s Hyper-V with dynamic memory allocation which did the trick.

I had the same problem, but my used RAM was not even 1/3 of the total.
I fixed it by changing the MACAddressPolicy option, as given here:

I hope it can be useful to someone else…