Docker swarm sudden break... i need help!

dockerinfo

Containers: 13
Running: 3
Paused: 0
Stopped: 10
Images: 32
Server Version: 1.13.1
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 147
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: active
NodeID: nuta2re8ttqhcr01ajc15yxuh
Is Manager: true
ClusterID: nig649rempakpdeb54uz3mc3i
Managers: 1
Nodes: 2
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 18 years
Node Address: 172.20.200.200
Manager Addresses:
172.20.200.200:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1
runc version: 9df8b306d01f59d3a8029be411de015b7304dd8f
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.13.0-32-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 7.701 GiB
Name: HTBDN1
ID: IZRI:SABD:5ZWX:FS35:RXRM:LGYX:GTE5:DG7O:K62F:DY74:J2D3:LW7R
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: hdservicedocker
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
172.20.200.200:18500
127.0.0.0/8
Live Restore Enabled: false

/var/log/syslog

Feb 6 15:53:55 HTBDN1 dockerd[12801]: time=“2018-02-06T15:53:50.264763830+09:00” level=warning msg="memberlist: Was able to reach HTBDN2-2497123b02fb via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP"
Feb 6 15:54:30 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:29.510961992+09:00” level=info msg="memberlist: Suspect HTBDN2-2497123b02fb has failed, no acks received"
Feb 6 15:54:30 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:30.982661944+09:00” level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing"
Feb 6 15:54:33 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:33.155841610+09:00” level=info msg="libcontainerd: new containerd process, pid: 7707"
Feb 6 15:54:34 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:34.130519992+09:00” level=info msg="libcontainerd: new containerd process, pid: 7708"
Feb 6 15:54:34 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:34.739700223+09:00” level=warning msg="memberlist: Failed to send TCP ack: write tcp 172.20.200.200:7946->172.20.200.201:42102: i/o timeout from=172.20.200.201:42102"
Feb 6 15:54:35 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:35.183269149+09:00” level=warning msg="memberlist: Failed to send TCP ack: write tcp 172.20.200.200:7946->172.20.200.201:42106: i/o timeout from=172.20.200.201:42106"
Feb 6 15:54:35 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:35.183408243+09:00” level=info msg="memberlist: Suspect HTBDN2-2497123b02fb has failed, no acks received"
Feb 6 15:54:35 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:35.304819189+09:00” level=info msg="memberlist: Marking HTBDN2-2497123b02fb as failed, suspect timeout reached"
Feb 6 15:54:35 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:35.644181981+09:00” level=info msg="libcontainerd: new containerd process, pid: 7712"
Feb 6 15:54:35 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:35.972572963+09:00” level=warning msg="memberlist: Refuting a suspect message (from: HTBDN2-2497123b02fb)"
Feb 6 15:54:36 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:36.865855863+09:00” level=info msg="libcontainerd: new containerd process, pid: 7718"
Feb 6 15:54:38 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:38.418388462+09:00” level=info msg="libcontainerd: new containerd process, pid: 7727"
Feb 6 15:54:39 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:39.866231188+09:00” level=info msg="libcontainerd: new containerd process, pid: 7736"
Feb 6 15:54:40 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:40.777534015+09:00” level=warning msg="failed to retrieve containerd version: rpc error: code = 14 desc = grpc: the connection is unavailable"
Feb 6 15:54:41 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:41.366004862+09:00” level=info msg="libcontainerd: new containerd process, pid: 7752"
Feb 6 15:54:42 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:42.866572088+09:00” level=info msg="libcontainerd: new containerd process, pid: 7765"
Feb 6 15:54:43 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:43.114158404+09:00” level=warning msg="failed to retrieve containerd version: rpc error: code = 14 desc = grpc: the connection is unavailable"
Feb 6 15:54:44 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:44.365966022+09:00” level=info msg="libcontainerd: new containerd process, pid: 7781"
Feb 6 15:54:45 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:45.504952262+09:00” level=error msg=“agent: session failed” error=“rpc error: code = 4 desc = context deadline exceeded” module="node/agent"
Feb 6 15:54:45 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:45.505021818+09:00” level=error msg=“agent: session failed” error=“rpc error: code = 5 desc = node not registered” module="node/agent"
Feb 6 15:54:49 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:49.878464534+09:00” level=error msg=“failed to deactivate service binding for container ht-homepage-webapp.1.wi20c1kqx315ipwxra92lmzt3” error=“No such container: ht-homepage-webapp.1.wi20c1kqx315ipwxra92lmzt3” module="node/agent"
Feb 6 15:54:49 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:49.902267597+09:00” level=error msg=“failed to deactivate service binding for container ht-monitoring-webapp.1.lchisks5fj4y0x16z6xhtwpra” error=“No such container: ht-monitoring-webapp.1.lchisks5fj4y0x16z6xhtwpra” module="node/agent"
Feb 6 15:54:50 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:49.902308248+09:00” level=error msg=“failed to deactivate service binding for container ht-core-oauth.1.wvfsnkxz4bepavra86p2su98v” error=“No such container: ht-core-oauth.1.wvfsnkxz4bepavra86p2su98v” module="node/agent"
Feb 6 15:54:50 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:49.902391539+09:00” level=error msg=“failed to deactivate service binding for container patch-server.1.vq28d9ijvgs3wfttrfirjbn6g” error=“No such container: patch-server.1.vq28d9ijvgs3wfttrfirjbn6g” module="node/agent"
Feb 6 15:54:50 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:49.942055955+09:00” level=error msg=“failed to deactivate service binding for container ht-application.1.4eq6zoh3ijhhx7ekdu9yl0atd” error=“No such container: ht-application.1.4eq6zoh3ijhhx7ekdu9yl0atd” module="node/agent"
Feb 6 15:54:50 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:49.942177594+09:00” level=error msg=“failed to deactivate service binding for container ht-eachupdate-webapp.1.osfemyum4m6822g6enz27t2u0” error=“No such container: ht-eachupdate-webapp.1.osfemyum4m6822g6enz27t2u0” module="node/agent"
Feb 6 15:54:50 HTBDN1 dockerd[12801]: time=“2018-02-06T15:54:50.002385777+09:00” level=error msg=“failed to deactivate service binding for container ht-core-framework.1.82jz8kqllr6xp3w3sfdm1xzzn” error=“No such container: ht-core-framework.1.82jz8kqllr6xp3w3sfdm1xzzn” module=“node/agent”

docker suddenly breaks…

and all service had container running in node 1… and after break down… all containers maintained by swarm moved to node 2

any possible solution?

Hi,
Have you see the first error line:

Was able to reach HTBDN2-2497123b02fb via TCP but not UDP, network may be misconfigured and not allowing bidirectional UD

Has you checked the UDP connectivity??? Maybe Selinux, apparmor, iptables??

nc -vz -u NODE_IP 7946
nc -vz -u NODE_IP 4789

nc -vz -u XXXXXXXXXXX 4789
Connection to XXXXXXXXXXX 4789 port [udp/*] succeeded!

Regards

i checked connection in both nodes and connection succeeded in both ways…

this break down happens with some sort of overload in resources…