Troble with discovery and TLS

I have been trying to get swarm working for a week or so now.

I am not using Virtualbox, but rather Hyper-V. I have 3 Ubuntu 14.04 VMs.
VM1: swam manager, node, CA
VM2: swarm node
VM3: swarm node

All three are configured for TLS (following the swarm TLS instructions for self signed certs) . I can connect to them all from my windows host, after copying my client certs into my .docker folder. I can also connect to the swarm. Further, all three vms can connect to each other (when I pass the client certs into the docker command). they all also have hostnames set.

another important point is that they use Hyper-v NAT network on my windows machine, and all internet access is via an NTLM proxy (so not everything that might want to connect out the internet works, hence why I have not tried token discovery)

My issue is discovery.
If I try and do file discovery or a node list, docker info gives me the following output:

$ docker info
Containers: 0
Images: 0
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 3
 (unknown): 192.168.137.2:2376
  └ Status: Pending
  └ Containers: 0
  └ Reserved CPUs: 0 / 0
  └ Reserved Memory: 0 B / 0 B
  └ Labels:
  └ Error: (none)
  └ UpdatedAt: 2016-03-17T11:19:19Z
 (unknown): 192.168.137.3:2376
  └ Status: Pending
  └ Containers: 0
  └ Reserved CPUs: 0 / 0
  └ Reserved Memory: 0 B / 0 B
  └ Labels:
  └ Error: (none)
  └ UpdatedAt: 2016-03-17T11:19:19Z
 (unknown): 192.168.137.4:2376
  └ Status: Pending
  └ Containers: 0
  └ Reserved CPUs: 0 / 0
  └ Reserved Memory: 0 B / 0 B
  └ Labels:
  └ Error: (none)
  └ UpdatedAt: 2016-03-17T11:19:19Z
Kernel Version: 4.2.0-27-generic
Operating System: linux
CPUs: 0
Total Memory: 0 B
Name: 5409c13c32bd

running the manager with the following command:

sudo docker run -i -t -p 3376:3376 -v /home/administrator/.certs:/certs:ro --name=SwarmManager --restart=always swarm:1.1.0 manage --tlsverify --tlscacert=/certs/ca.pem --tlscert=/certs/cert.pem --tlskey=/certs/key.pem --host=0.0.0.0:3376 nodes://192.168.137.[2:4]:2376

gives me the following output:

INFO[0000] Listening for HTTP                            addr=0.0.0.0:3376 proto=tcp

and a node:

sudo docker run -i -t -v /home/administrator/.certs:/certs:ro --name=SwarmNode --restart=always swarm:1.1.0 join --ttl "180s" --discovery-opt kv.cacertfile=/certs/ca.pem --discovery-opt kv.certfile=/certs/cert.pem --discovery-opt kv.keyfile=/certs/key.pem --advertise 192.168.137.3:2376 nodes://192.168.137.[2:4]:2376

this output:

time=“2016-03-17T11:25:20Z” level=info msg=“Registering on the discovery service every 1m0s…” addr=“192.168.137.3:2376” discovery="nodes://192.168.137.[2:4]:2376"
time=“2016-03-17T11:25:20Z” level=error msg="not implemented in this discovery service"
time=“2016-03-17T11:26:20Z” level=info msg=“Registering on the discovery service every 1m0s…” addr=“192.168.137.3:2376” discovery="nodes://192.168.137.[2:4]:2376"
time=“2016-03-17T11:26:20Z” level=error msg=“not implemented in this discovery service”

The thing is, I was only trying the node discovery because I was having issues with consul.

With consul, I felt the error messages I was getting suggested that I needed to set up consul for TLS too, so I did that with this config file

{
 "ca_file": "/certs/ca.pem",
 "cert_file": "/certs/cert.pem",
 "key_file": "/certs/key.pem",
 "verify_incoming": true,
 "verify_outgoing": true,
 "Client_addr": "0.0.0.0",
 "addresses": {
   "https": "0.0.0.0"
 },
 "ports": {
   "https": 8080
 }
}

running consul interactively, everything looks fine:

# docker start --attach Discovery
==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting raft data migration...
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
     Node name: '9346b8cbd86e'
    Datacenter: 'dc1'
        Server: true (bootstrap: true)
   Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: 8080, DNS: 8600, RPC: 8400)
  Cluster Addr: 172.17.0.2 (LAN: 8301, WAN: 8302)
Gossip encrypt: false, RPC-TLS: true, TLS-Incoming: true
         Atlas: <disabled>

==> Log data will now stream in as it occurs:

2016/03/17 08:57:53 [INFO] serf: EventMemberJoin: 9346b8cbd86e 172.17.0.2
2016/03/17 08:57:53 [INFO] serf: EventMemberJoin: 9346b8cbd86e.dc1 172.17.0.2
2016/03/17 08:57:53 [INFO] raft: Node at 172.17.0.2:8300 [Follower] entering Follower state
2016/03/17 08:57:53 [INFO] consul: adding server 9346b8cbd86e (Addr: 172.17.0.2:8300) (DC: dc1)
2016/03/17 08:57:53 [INFO] consul: adding server 9346b8cbd86e.dc1 (Addr: 172.17.0.2:8300) (DC: dc1)
2016/03/17 08:57:53 [ERR] agent: failed to sync remote state: No cluster leader
2016/03/17 08:57:55 [WARN] raft: Heartbeat timeout reached, starting election
2016/03/17 08:57:55 [INFO] raft: Node at 172.17.0.2:8300 [Candidate] entering Candidate state
2016/03/17 08:57:55 [INFO] raft: Election won. Tally: 1
2016/03/17 08:57:55 [INFO] raft: Node at 172.17.0.2:8300 [Leader] entering Leader state
2016/03/17 08:57:55 [INFO] consul: cluster leadership acquired
2016/03/17 08:57:55 [INFO] consul: New leader elected: 9346b8cbd86e
2016/03/17 08:57:55 [INFO] raft: Disabling EnableSingleNode (bootstrap)
2016/03/17 08:57:55 [INFO] raft: Added peer 172.17.0.3:8300, starting replication
2016/03/17 08:57:55 [ERR] raft: Failed to AppendEntries to 172.17.0.3:8300: dial tcp 172.17.0.3:8300: connection refused
2016/03/17 08:57:55 [INFO] raft: Removed peer 172.17.0.3:8300, stopping replication (Index: 20)
2016/03/17 08:57:55 [ERR] raft: Failed to AppendEntries to 172.17.0.3:8300: dial tcp 172.17.0.3:8300: connection refused
2016/03/17 08:57:55 [INFO] consul: member '9346b8cbd86e' joined, marking health alive
2016/03/17 08:57:55 [INFO] consul: member '271af29fda42' reaped, deregistering
2016/03/17 08:57:56 [INFO] agent: Synced service 'consul'

but when I run the manager:

# docker run -i -t -p 3376:3376 -v /home/administrator/.certs:/certs:ro --name=SwarmManager --restart=always swarm:1.1.0 manage --tlsverify --tlscacert=/certs/ca.pem --tlscert=/certs/cert.pem --tlskey=/certs/key.pem --discovery-opt kv.cacertfile=/certs/ca.pem --discovery-opt kv.certfile=/certs/cert.pem --discovery-opt kv.keyfile=/certs/key.pem --host=0.0.0.0:3376 consul://swarm:8080
INFO[0000] Initializing discovery with TLS
INFO[0000] Listening for HTTP                            addr=0.0.0.0:3376 proto=tcp
ERRO[0000] Discovery error: Get https://swarm:8080/v1/kv/docker/swarm/nodes?consistent=: remote error: handshake failure
ERRO[0000] Discovery error: Put https://swarm:8080/v1/kv/docker/swarm/nodes: remote error: handshake failure
ERRO[0000] Discovery error: Unexpected watch error
ERRO[0060] Discovery error: Get https://swarm:8080/v1/kv/docker/swarm/nodes?consistent=: remote error: handshake failure
ERRO[0060] Discovery error: Put https://swarm:8080/v1/kv/docker/swarm/nodes: remote error: handshake failure
ERRO[0060] Discovery error: Unexpected watch error

and if I run docker info:

$ docker info
Containers: 0
Images: 0
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 0
Kernel Version: 4.2.0-27-generic
Operating System: linux
CPUs: 0
Total Memory: 0 B
Name: 161897188e4e

I feel I am very close here, I just can’t understand why discovery is not working for me. I did have it briefly working before I enabled TLS, but before I made any changes I was still seeing the (pending) issue with docker info on the swarm. I have been able to confirm that TLS 1.2 is available on the consul image, but I can’t get bash or sh in the swarm image and confirm the TLS version there (the swarm image is is 1.1.0 as I was planning to try and join a windows container server to the swarm too, and the only swarm image I could find for that was 1.1.0)

am I missing something here?

Update

In /var/log/upstart/docker.log I am seeing a lot of TLS handshake errors: remote error: bad certificate.

I generated the certificates following the instructions found here https://docs.docker.com/swarm/configure-tls/

the exact commands I used to create them are:

openssl genrsa -out ca-priv-key.pem 2048
openssl req -config /usr/lib/ssl/openssl.cnf -new -key ca-priv-key.pem -x509 -days 1825 -out ca.pem

openssl genrsa -out swarm-priv-key.pem 2048
openssl req -subj "/CN=swarm" -new -key swarm-priv-key.pem -out swarm.csr
openssl x509 -req -days 1825 -in swarm.csr -CA ca.pem -CAkey ca-priv-key.pem -CAcreateserial -out swarm-cert.pem -extensions v3_req -extfile /usr/lib/ssl/openssl.cnf
openssl rsa -in swarm-priv-key.pem -out swarm-priv-key.pem

openssl genrsa -out node1-priv-key.pem 2048
openssl req -subj "/CN=swarm" -new -key node1-priv-key.pem -out node1.csr
openssl x509 -req -days 1825 -in node1.csr -CA ca.pem -CAkey ca-priv-key.pem -CAcreateserial -out node1-cert.pem -extensions v3_req -extfile /usr/lib/ssl/openssl.cnf
openssl rsa -in node1-priv-key.pem -out node1-priv-key.pem


openssl genrsa -out node2-priv-key.pem 2048
openssl req -subj "/CN=swarm" -new -key node2-priv-key.pem -out node2.csr
openssl x509 -req -days 1825 -in node2.csr -CA ca.pem -CAkey ca-priv-key.pem -CAcreateserial -out node2-cert.pem -extensions v3_req -extfile /usr/lib/ssl/openssl.cnf
openssl rsa -in node2-priv-key.pem -out node2-priv-key.pem

openssl genrsa -out client-priv-key.pem 2048
openssl req -subj "/CN=swarm" -new -key client-priv-key.pem -out client.csr
openssl x509 -req -days 1825 -in client.csr -CA ca.pem -CAkey ca-priv-key.pem -CAcreateserial -out client-cert.pem -extensions v3_req -extfile /usr/lib/ssl/openssl.cnf
openssl rsa -in client-priv-key.pem -out client-priv-key.pem

I have also added this line

subjectAltName = IP:192.168.137.2

under the v3_req section of my /usr/lib/ssl/openssl.cnf file, but I have since updated the /etc/hosts files on each machine and I am addressing machines using hostname rather than IP.

I have also just checked, and all the servers are set to the exact same time, so I am not sure what is bad about this certificate