Docker-machine state suddenly error... :s

Hello everybody,

I’ve setup a docker ucp test environment which i deployed from a deployment machine with docker-machine.
UCP is working fine and docker-machine ls on the deployment machien was also working. But I ran into the following problem:

NAME                    ACTIVE   DRIVER          STATE     URL                        SWARM   DOCKER    ERRORS
devstackdockerengine1   -        generic         Running   tcp://192.168.123.1:2376           v1.11.1
esxdockerengine1        -        vmwarevsphere   Error                                        Unknown
esxdockerengine2        -        vmwarevsphere   Error                                        Unknown
esxdockerengine3        -        vmwarevsphere   Error                                        Unknown

As you can see the esxdockerengines are showing state error; This is quite strange and i want to know how I can fix this. Note: ucp and my swarm are still working just fine (the swarm manager is located on the esxdockerengine1 node).
Since they are in state error, I can no longer ssh into my nodes :frowning:

I have no idea why it broke, here is the output of the docker info command

[root@localhost ~]# docker info
Containers: 24
 Running: 23
 Paused: 0
 Stopped: 1
Images: 53
Server Version: swarm/1.1.3
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 4
 devstackdockerengine1: 192.168.123.1:12376
  └ Status: Healthy
  └ Containers: 3
  └ Reserved CPUs: 0 / 4
  └ Reserved Memory: 0 B / 8.11 GiB
  └ Labels: executiondriver=, kernelversion=3.16.0-4-amd64, location=on_premise_BE, operatingsystem=Debian GNU/Linux 8 (jessie), provider=generic, storagedriver=aufs, target=apps, type=devstacker
  └ Error: (none)
  └ UpdatedAt: 2016-05-23T10:19:53Z
 esxdockerengine1: 192.168.123.14:12376
  └ Status: Healthy
  └ Containers: 10
  └ Reserved CPUs: 0 / 8
  └ Reserved Memory: 0 B / 64.42 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=4.1.19-boot2docker, location=on_premise_BE, operatingsystem=Boot2Docker 1.10.3 (TCL 6.4.1); master : 625117e - Thu Mar 10 22:09:02 UTC 2016, provider=vmwarevsphere, storagedriver=aufs, target=apps, type=controllers
  └ Error: (none)
  └ UpdatedAt: 2016-05-23T10:19:36Z
 esxdockerengine2: 192.168.123.15:12376
  └ Status: Healthy
  └ Containers: 6
  └ Reserved CPUs: 0 / 8
  └ Reserved Memory: 0 B / 64.42 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=4.1.19-boot2docker, location=on_premise_BE, operatingsystem=Boot2Docker 1.10.3 (TCL 6.4.1); master : 625117e - Thu Mar 10 22:09:02 UTC 2016, provider=vmwarevsphere, storagedriver=aufs, target=apps, type=secondary
  └ Error: (none)
  └ UpdatedAt: 2016-05-23T10:19:54Z
 esxdockerengine3: 192.168.123.39:12376
  └ Status: Healthy
  └ Containers: 5
  └ Reserved CPUs: 0 / 8
  └ Reserved Memory: 0 B / 64.42 GiB
  └ Labels: executiondriver=, kernelversion=4.1.19-boot2docker, location=on_premise_BE, operatingsystem=Boot2Docker 1.11.0 (TCL 7.0); HEAD : 32ee7e9 - Wed Apr 13 20:06:49 UTC 2016, provider=vmwarevsphere, storagedriver=aufs, target=loadbalancer, type=loadbalancing
  └ Error: (none)
  └ UpdatedAt: 2016-05-23T10:19:45Z
Cluster Managers: 1
 192.168.123.14: Healthy
  └ Orca Controller: https://192.168.123.14:443
  └ Swarm Manager: tcp://192.168.123.14:3376
  └ KV: etcd://192.168.123.14:12379
Plugins:
 Volume:
 Network:
Kernel Version: 4.1.19-boot2docker
Operating System: linux
Architecture: amd64
CPUs: 28
Total Memory: 201.4 GiB
Name: ucp-controller-esxdockerengine1
ID: 4DAZ:FR3E:32PA:N2IG:MHHC:AXO3:L4MH:C2WQ:NL7S:IFRK:JLLA:WONP

ps: I asked the same question @ Docker machine shows suddenly error · Issue #3449 · docker/machine · GitHub

EDIT:
When I want to set the environment to a machine manually I get the following error:

eval "$(docker-machine env esxdockerengine2)"
Error checking TLS connection: Host is not running

But it is running, and I can even ping the address.
Whenever I execute the restart command, I get stuck at “waiting for ssh to be available”

docker-machine -D restart esxdockerengine2
Docker Machine Version:  0.6.0, build e27fb87
Found binary path at /usr/local/bin/docker-machine
Launching plugin server for driver vmwarevsphere
Plugin server listening at address 127.0.0.1:43493
() Calling .GetVersion
Using API Version  1
() Calling .SetConfigRaw
() Calling .GetMachineName
command=restart machine=esxdockerengine2
Restarting "esxdockerengine2"...
(esxdockerengine2) Calling .GetState
Error getting machine state: vm 'esxdockerengine2' not found
(esxdockerengine2) Calling .GetState
Error getting machine state: vm 'esxdockerengine2' not found
Waiting for SSH to be available...
Getting to WaitForSSH function...
(esxdockerengine2) Calling .GetSSHHostname
Error getting ssh command 'exit 0' : Host is not running

I can confirm that ssh is running because when I do ssh -lroot 192.168.123.14, I get prompted for a password.

Since I was using docker-machine version 0.6.0, I updated to 0.7.0, by issuing the command:

curl -L https://github.com/docker/machine/releases/download/v0.7.0/docker-machine-`uname -s-uname -m` > /usr/local/bin/docker-machine && chmod +x /usr/local/bin/docker-machine

But I still get the same error…

But it is running, and I can even ping the address.
If the STATE is “Error” the machine is not running.

When I do docker ps, it shows all the containers running on the hosts that are in error state.
Also docker UCP (which runs on the “error” nodes) is still online and I can still launch containers from UCP, I also see them in the swarm, so they are NOT down.

Is there a way to manually ssh into the machines with the certificates in .docker/machine/machines/esxdockerengine* ?
edit: There is and it’s ssh -ldocker -i id_rsa 192.168.123.15 , it worked, so the machine is NOT down.

The docker ps command would list running containers only on an active machine. Has the machine listed with Error state been made the Active machine?

I don’t know what you"re trying to say, but the machines in the error state are just working fine. i can ssh into them (not by using docker-machine, but just normal ssh) and the containers on the error nodes are also running just fine. I can start/stop/create containers on the ‘Error’ nodes.

I’ve never heard of ‘The Active machine’

Edit: by active machine you mean probably the * in front of the machine when you do the “docker-machine ls” command. I’m not using that, since I’ve set up the environment variable set up and talking to swarm.

Is Docker Machine use to provision Docker Swarm?

Yes, Docker machine was used to provision the machines on a vmware ESX setup. The state was showing fine, but now it shows ‘error’. The nodes continue to work however (I can still deploy containers on them, view the webgui that’s hosted on them etc.)

If State showing error has to be some error.

Yeah, but when I do docker info on that same machine,I get an overview of my nodes and it shows ‘Error: (none)’ with the machines that show an error in docker-machine. Very strange.

[root@localhost ~]# docker info
Containers: 38
 Running: 32
 Paused: 0
 Stopped: 6
Images: 53
Server Version: swarm/1.1.3
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 4
 devstackdockerengine1: 192.168.123.1:12376
  └ Status: Healthy
  └ Containers: 6
  └ Reserved CPUs: 0 / 4
  └ Reserved Memory: 0 B / 8.11 GiB
  └ Labels: executiondriver=, kernelversion=3.16.0-4-amd64, location=on_premise_                                                                                                                               BE, operatingsystem=Debian GNU/Linux 8 (jessie), provider=generic, storagedriver                                                                                                                               =aufs, target=apps, type=devstacker
  └ Error: (none)
  └ UpdatedAt: 2016-05-30T13:12:19Z
 esxdockerengine1: 192.168.123.14:12376
  └ Status: Healthy
  └ Containers: 12
  └ Reserved CPUs: 0 / 8
  └ Reserved Memory: 0 B / 64.42 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=4.1.19-boot2docker, locati                                                                                                                               on=on_premise_BE, operatingsystem=Boot2Docker 1.10.3 (TCL 6.4.1); master : 62511                                                                                                                               7e - Thu Mar 10 22:09:02 UTC 2016, provider=vmwarevsphere, storagedriver=aufs, t                                                                                                                               arget=apps, type=controllers
  └ Error: (none)
  └ UpdatedAt: 2016-05-30T13:12:22Z
 esxdockerengine2: 192.168.123.15:12376
  └ Status: Healthy
  └ Containers: 14
  └ Reserved CPUs: 0 / 8
  └ Reserved Memory: 0 B / 64.42 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=4.1.19-boot2docker, locati                                                                                                                               on=on_premise_BE, operatingsystem=Boot2Docker 1.10.3 (TCL 6.4.1); master : 62511                                                                                                                               7e - Thu Mar 10 22:09:02 UTC 2016, provider=vmwarevsphere, storagedriver=aufs, t                                                                                                                               arget=apps, type=secondary
  └ Error: (none)
  └ UpdatedAt: 2016-05-30T13:12:44Z
 esxdockerengine3: 192.168.123.39:12376
  └ Status: Healthy
  └ Containers: 6
  └ Reserved CPUs: 0 / 8
  └ Reserved Memory: 0 B / 64.42 GiB
  └ Labels: executiondriver=, kernelversion=4.1.19-boot2docker, location=on_prem                                                                                                                               ise_BE, operatingsystem=Boot2Docker 1.11.0 (TCL 7.0); HEAD : 32ee7e9 - Wed Apr 1                                                                                                                               3 20:06:49 UTC 2016, provider=vmwarevsphere, storagedriver=aufs, target=loadbala                                                                                                                               ncer, type=loadbalancing
  └ Error: (none)
  └ UpdatedAt: 2016-05-30T13:12:59Z
Cluster Managers: 1
 192.168.123.14: Healthy
  └ Orca Controller: https://192.168.123.14:443
  └ Swarm Manager: tcp://192.168.123.14:3376
  └ KV: etcd://192.168.123.14:12379
Plugins:
 Volume:
 Network:
Kernel Version: 4.1.19-boot2docker
Operating System: linux
Architecture: amd64
CPUs: 28
Total Memory: 201.4 GiB
Name: ucp-controller-esxdockerengine1

The following command is not required.
eval “$(docker-machine env esxdockerengine2)”

To run the docker command against a particular Docker machine the machine has to be first made active and the preceding command makes a docker machine active.

The command to connect with a Swarm manager is different.
eval “$(docker-machine env --swarm swarm-manager)”

Refer section
Connect node environments with Machine

I didn’t run that command. I’m using UCP and I set the environment variables by executing the env.sh file from the client bundle. I (almost) never switch to a machine manually. And If I do that, I ssh into it.

But that still doesn’t explain while some of my nodes are in the Error state…

The env.src procedure is not described in the documentation.

That’s because you’re looking at the swarm documentation and not at the UCP documentation…

But, isn’t docker machine used for swarm? Please provide link for documentation referred to.

It’s inside the UCP documentation: https://docs.docker.com/ucp/install-sandbox/
I’m not doing anything wrong, because it worked before without a problem…

Is the same environment used as in the documentation?

This evaluation installs UCP on top of the open source software version of Docker Engine inside of a VirtualBox VM which is running the small-footprint boot2docker.iso Linux. Such a configuration is not supported for UCP in production.