Docker Community Forums

Share and learn in the Docker community.

Nodes unable rejoin swarm on restart


(Alan Pitts) #1

I have created as swarm with UCP-Beta using virtual box as a working lab. It went pretty smoothly. However, I noticed that following restarts nodes randomly may not rejoin the swarm. All the boxes in the lab are

Linux ucp-master 4.2.0-23-generic #28~14.04.1-Ubuntu SMP Thu Dec 31 13:40:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

I noticed the following in the ucp-swarm-manager logs:

time="2016-01-19T22:21:02Z" level=info msg="Registered Engine ucp-slave2 at 192.168.100.12:12376" 
time="2016-01-19T22:21:02Z" level=info msg="Registered Engine ucp-slave1 at 192.168.100.11:12376" 
time="2016-01-19T22:21:03Z" level=info msg="Registered Engine ucp-master at 192.168.100.10:12376" 
time="2016-01-19T22:21:03Z" level=info msg="Registered Engine ucp-reg at 192.168.100.13:12376" 
time="2016-01-20T17:54:00Z" level=info msg="Listening for HTTP" addr=":2375" proto=tcp 
time="2016-01-20T17:54:00Z" level=info msg="Leader Election: Cluster leadership lost" 
time="2016-01-20T17:54:03Z" level=error msg="Leader Election: watch leader channel closed, the store may be unavailable..." 
time="2016-01-20T17:54:03Z" level=error msg="Discovery error: client: etcd cluster is unavailable or misconfigured" 
time="2016-01-20T17:54:03Z" level=info msg="Leader Election: Cluster leadership acquired" 
time="2016-01-20T17:54:03Z" level=error msg="Discovery error: 102: Not a file (/docker/swarm/nodes) [847]" 
time="2016-01-20T17:54:19Z" level=info msg="Registered Engine ucp-master at 192.168.100.10:12376" 
time="2016-01-20T17:54:21Z" level=info msg="Registered Engine ucp-slave1 at 192.168.100.11:12376" 
time="2016-01-20T17:54:43Z" level=error msg="Get https://192.168.100.12:12376/v1.15/info: dial tcp 192.168.100.12:12376: getsockopt: connection refused" 
time="2016-01-20T17:55:05Z" level=error msg="Get https://192.168.100.13:12376/v1.15/info: dial tcp 192.168.100.13:12376: getsockopt: connection refused" 
time="2016-01-20T17:57:51Z" level=info msg="Listening for HTTP" addr=":2375" proto=tcp 
time="2016-01-20T17:57:51Z" level=info msg="Leader Election: Cluster leadership lost" 
time="2016-01-20T17:57:54Z" level=error msg="Discovery error: client: etcd cluster is unavailable or misconfigured" 
time="2016-01-20T17:57:54Z" level=error msg="Leader Election: watch leader channel closed, the store may be unavailable..." 
time="2016-01-20T17:57:56Z" level=error msg="Discovery error: 102: Not a file (/docker/swarm/nodes) [905]" 
time="2016-01-20T17:57:56Z" level=info msg="Leader Election: Cluster leadership acquired" 
time="2016-01-20T17:57:57Z" level=info msg="Registered Engine ucp-master at 192.168.100.10:12376" 

If I install --fresh-install and join --fresh-install everything comes back ok. what was running on the nodes will still be visible.

Any ideas on stability over restarts ? Cert problems maybe ??

Thanks


(Vivek Saraswat) #2

Hi Alan,

Thanks for sending this. Stability post-restart is a current known issue, we’ve made some corrections that should help in the next beta version (coming soon!)


(Alan Pitts) #3

good to know… thanks for the quick response!