Docker swarm cluster with more than 60+ nodes automatically restart all containers

I have a docker swarm cluster with more than 60+ nodes(1 manager), which is running in production environment, we meet a very difficult problem, all the containers will be restarted automatically sometimes, could some one give some guidence? Appreciates for your help.

journalctl -u docker.service outputs of manager node:

Dec 09 11:32:13 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:13.881052082+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39336\": EOF" module=grpc
Dec 09 11:32:15 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:15.581259315+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:39798"
Dec 09 11:32:18 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:18.881458249+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39394\": EOF" module=grpc
Dec 09 11:32:20 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:20.582011748+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:47059"
Dec 09 11:32:23 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:23.580730960+08:00" level=warning msg="Health check for container 6a42a166a86914b92e724fce2fc1fd2e8a7174e1965047e711e054d3fcb2c8a9 error: context deadline exceeded"
Dec 09 11:32:23 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:23.878335604+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39438\": EOF" module=grpc
Dec 09 11:32:25 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:25.587366233+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:51345"
Dec 09 11:32:26 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:26.618406623+08:00" level=info msg="NetworkDB stats host-172-17-28-136(36fd4f7dcb85) - netID:eg6ios0n42rolpm8xffdvxj39 leaving:false netPeers:43 entries:84 Queue qLen:0 netMsg/s:1"
Dec 09 11:32:28 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:28.884212941+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39462\": EOF" module=grpc
Dec 09 11:32:30 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:30.588788615+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:53479"
Dec 09 11:32:33 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:33.880829064+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39482\": EOF" module=grpc
Dec 09 11:32:35 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:35.594238965+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:56144"
Dec 09 11:32:38 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:38.880164807+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39500\": EOF" module=grpc
Dec 09 11:32:40 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:40.594545896+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:33954"
Dec 09 11:32:43 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:43.878232310+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39518\": EOF" module=grpc
Dec 09 11:32:45 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:45.594935631+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:38073"
Dec 09 11:32:48 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:48.878437023+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39534\": EOF" module=grpc
Dec 09 11:32:50 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:50.600089293+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:42235"
Dec 09 11:32:53 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:53.878770607+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39550\": EOF" module=grpc
Dec 09 11:32:55 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:55.542443102+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:42143"
Dec 09 11:32:58 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:32:58.880799252+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39562\": EOF" module=grpc
Dec 09 11:33:00 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:00.546895778+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:36005"
Dec 09 11:33:03 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:03.878459214+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39576\": EOF" module=grpc
Dec 09 11:33:05 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:05.551876945+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:33857"
Dec 09 11:33:08 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:08.878753916+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39590\": EOF" module=grpc
Dec 09 11:33:10 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:10.557064636+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:35743"
Dec 09 11:33:13 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:13.878226424+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39604\": EOF" module=grpc
Dec 09 11:33:15 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:15.561920134+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:49116"
Dec 09 11:33:18 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:18.881451168+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39620\": EOF" module=grpc
Dec 09 11:33:20 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:20.566874367+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:37224"
Dec 09 11:33:23 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:23.626283600+08:00" level=warning msg="Health check for container 6a42a166a86914b92e724fce2fc1fd2e8a7174e1965047e711e054d3fcb2c8a9 error: context deadline exceeded"
Dec 09 11:33:23 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:23.885073517+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39632\": EOF" module=grpc
Dec 09 11:33:25 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:25.569279528+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:58914"
Dec 09 11:33:28 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:28.925749108+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39646\": EOF" module=grpc
Dec 09 11:33:30 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:30.570819619+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:48191"
Dec 09 11:33:33 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:33.924628462+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39664\": EOF" module=grpc
Dec 09 11:33:35 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:35.573992066+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:50646"
Dec 09 11:33:38 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:38.880401328+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39678\": EOF" module=grpc
Dec 09 11:33:40 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:40.579199371+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:55638"
Dec 09 11:33:43 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:43.924627631+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39692\": EOF" module=grpc
Dec 09 11:33:45 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:45.584391307+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:53806"
Dec 09 11:33:48 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:48.922577843+08:00" level=warning msg="grpc: Server.Serve failed to complete security handshake from \"172.17.28.168:39710\": EOF" module=grpc
Dec 09 11:33:50 host-172-17-28-136 dockerd[806]: time="2019-12-09T11:33:50.586868904+08:00" level=error msg="[resolver] more than 100 concurrent queries from 172.19.0.2:48380"

docker info outputs of manager node :

[root@host-172-17-28-136 ~]# docker info
    Containers: 13
     Running: 13
     Paused: 0
     Stopped: 0
    Images: 13
    Server Version: 18.09.8
    Storage Driver: overlay2
     Backing Filesystem: xfs
     Supports d_type: true
     Native Overlay Diff: true
    Logging Driver: json-file
    Cgroup Driver: cgroupfs
    Plugins:
     Volume: local
     Network: bridge host macvlan null overlay
     Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
    Swarm: active
     NodeID: q2tagzsa3p0vghuohyru14nwe
     Is Manager: true
     ClusterID: vxkafmvt1n25qsyupx4e46fa6
     Managers: 1
     Nodes: 44
     Default Address Pool: 10.0.0.0/8  
     SubnetSize: 24
     Orchestration:
      Task History Retention Limit: 5
     Raft:
      Snapshot Interval: 10000
      Number of Old Snapshots to Retain: 0
      Heartbeat Tick: 1
      Election Tick: 10
     Dispatcher:
      Heartbeat Period: 5 seconds
     CA Configuration:
      Expiry Duration: 3 months
      Force Rotate: 9
     Autolock Managers: false
     Root Rotation In Progress: false
     Node Address: 172.17.28.136
     Manager Addresses:
      172.17.28.136:2377
    Runtimes: runc
    Default Runtime: runc
    Init Binary: docker-init
    containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
    runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
    init version: fec3683
    Security Options:
     seccomp
      Profile: default
    Kernel Version: 3.10.0-862.el7.x86_64
    Operating System: CentOS Linux 7 (Core)
    OSType: linux
    Architecture: x86_64
    CPUs: 8
    Total Memory: 15.51GiB
    Name: host-172-17-28-136
    ID: FZKU:BOVV:HRTB:DNLU:N5C5:PSOI:7KHC:GTOA:RD7G:ZYTN:ANSB:N4B3
    Docker Root Dir: /var/lib/docker
    Debug Mode (client): false
    Debug Mode (server): false
    Registry: https://index.docker.io/v1/
    Labels:
    Experimental: false
    Insecure Registries:
     172.17.28.136
     127.0.0.0/8
    Live Restore Enabled: false
    Product License: Community Engine

    WARNING: API is accessible on http://0.0.0.0:2375 without encryption.
             Access to the remote API is equivalent to root access on the host. Refer
             to the 'Docker daemon attack surface' section in the documentation for
             more information: https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface

I resolved this problem by taking below too actions:

  1. disable NetworkManager
  2. increase docker swarm heartbeat interval from default 5s to 60s

Hi, thanks for the information. Could you please suggest how did you disable the “NetworkManager”? Why did you need to do that?