I have been trying to get swarm working for a week or so now.
I am not using Virtualbox, but rather Hyper-V. I have 3 Ubuntu 14.04 VMs.
VM1: swam manager, node, CA
VM2: swarm node
VM3: swarm node
All three are configured for TLS (following the swarm TLS instructions for self signed certs) . I can connect to them all from my windows host, after copying my client certs into my .docker folder. I can also connect to the swarm. Further, all three vms can connect to each other (when I pass the client certs into the docker command). they all also have hostnames set.
another important point is that they use Hyper-v NAT network on my windows machine, and all internet access is via an NTLM proxy (so not everything that might want to connect out the internet works, hence why I have not tried token discovery)
My issue is discovery.
If I try and do file discovery or a node list, docker info gives me the following output:
$ docker info
Containers: 0
Images: 0
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 3
(unknown): 192.168.137.2:2376
└ Status: Pending
└ Containers: 0
└ Reserved CPUs: 0 / 0
└ Reserved Memory: 0 B / 0 B
└ Labels:
└ Error: (none)
└ UpdatedAt: 2016-03-17T11:19:19Z
(unknown): 192.168.137.3:2376
└ Status: Pending
└ Containers: 0
└ Reserved CPUs: 0 / 0
└ Reserved Memory: 0 B / 0 B
└ Labels:
└ Error: (none)
└ UpdatedAt: 2016-03-17T11:19:19Z
(unknown): 192.168.137.4:2376
└ Status: Pending
└ Containers: 0
└ Reserved CPUs: 0 / 0
└ Reserved Memory: 0 B / 0 B
└ Labels:
└ Error: (none)
└ UpdatedAt: 2016-03-17T11:19:19Z
Kernel Version: 4.2.0-27-generic
Operating System: linux
CPUs: 0
Total Memory: 0 B
Name: 5409c13c32bd
running the manager with the following command:
sudo docker run -i -t -p 3376:3376 -v /home/administrator/.certs:/certs:ro --name=SwarmManager --restart=always swarm:1.1.0 manage --tlsverify --tlscacert=/certs/ca.pem --tlscert=/certs/cert.pem --tlskey=/certs/key.pem --host=0.0.0.0:3376 nodes://192.168.137.[2:4]:2376
gives me the following output:
INFO[0000] Listening for HTTP addr=0.0.0.0:3376 proto=tcp
and a node:
sudo docker run -i -t -v /home/administrator/.certs:/certs:ro --name=SwarmNode --restart=always swarm:1.1.0 join --ttl "180s" --discovery-opt kv.cacertfile=/certs/ca.pem --discovery-opt kv.certfile=/certs/cert.pem --discovery-opt kv.keyfile=/certs/key.pem --advertise 192.168.137.3:2376 nodes://192.168.137.[2:4]:2376
this output:
…
time=“2016-03-17T11:25:20Z” level=info msg=“Registering on the discovery service every 1m0s…” addr=“192.168.137.3:2376” discovery="nodes://192.168.137.[2:4]:2376"
time=“2016-03-17T11:25:20Z” level=error msg="not implemented in this discovery service"
time=“2016-03-17T11:26:20Z” level=info msg=“Registering on the discovery service every 1m0s…” addr=“192.168.137.3:2376” discovery="nodes://192.168.137.[2:4]:2376"
time=“2016-03-17T11:26:20Z” level=error msg=“not implemented in this discovery service”
…
The thing is, I was only trying the node discovery because I was having issues with consul.
With consul, I felt the error messages I was getting suggested that I needed to set up consul for TLS too, so I did that with this config file
{
"ca_file": "/certs/ca.pem",
"cert_file": "/certs/cert.pem",
"key_file": "/certs/key.pem",
"verify_incoming": true,
"verify_outgoing": true,
"Client_addr": "0.0.0.0",
"addresses": {
"https": "0.0.0.0"
},
"ports": {
"https": 8080
}
}
running consul interactively, everything looks fine:
# docker start --attach Discovery
==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting raft data migration...
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
Node name: '9346b8cbd86e'
Datacenter: 'dc1'
Server: true (bootstrap: true)
Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: 8080, DNS: 8600, RPC: 8400)
Cluster Addr: 172.17.0.2 (LAN: 8301, WAN: 8302)
Gossip encrypt: false, RPC-TLS: true, TLS-Incoming: true
Atlas: <disabled>
==> Log data will now stream in as it occurs:
2016/03/17 08:57:53 [INFO] serf: EventMemberJoin: 9346b8cbd86e 172.17.0.2
2016/03/17 08:57:53 [INFO] serf: EventMemberJoin: 9346b8cbd86e.dc1 172.17.0.2
2016/03/17 08:57:53 [INFO] raft: Node at 172.17.0.2:8300 [Follower] entering Follower state
2016/03/17 08:57:53 [INFO] consul: adding server 9346b8cbd86e (Addr: 172.17.0.2:8300) (DC: dc1)
2016/03/17 08:57:53 [INFO] consul: adding server 9346b8cbd86e.dc1 (Addr: 172.17.0.2:8300) (DC: dc1)
2016/03/17 08:57:53 [ERR] agent: failed to sync remote state: No cluster leader
2016/03/17 08:57:55 [WARN] raft: Heartbeat timeout reached, starting election
2016/03/17 08:57:55 [INFO] raft: Node at 172.17.0.2:8300 [Candidate] entering Candidate state
2016/03/17 08:57:55 [INFO] raft: Election won. Tally: 1
2016/03/17 08:57:55 [INFO] raft: Node at 172.17.0.2:8300 [Leader] entering Leader state
2016/03/17 08:57:55 [INFO] consul: cluster leadership acquired
2016/03/17 08:57:55 [INFO] consul: New leader elected: 9346b8cbd86e
2016/03/17 08:57:55 [INFO] raft: Disabling EnableSingleNode (bootstrap)
2016/03/17 08:57:55 [INFO] raft: Added peer 172.17.0.3:8300, starting replication
2016/03/17 08:57:55 [ERR] raft: Failed to AppendEntries to 172.17.0.3:8300: dial tcp 172.17.0.3:8300: connection refused
2016/03/17 08:57:55 [INFO] raft: Removed peer 172.17.0.3:8300, stopping replication (Index: 20)
2016/03/17 08:57:55 [ERR] raft: Failed to AppendEntries to 172.17.0.3:8300: dial tcp 172.17.0.3:8300: connection refused
2016/03/17 08:57:55 [INFO] consul: member '9346b8cbd86e' joined, marking health alive
2016/03/17 08:57:55 [INFO] consul: member '271af29fda42' reaped, deregistering
2016/03/17 08:57:56 [INFO] agent: Synced service 'consul'
but when I run the manager:
# docker run -i -t -p 3376:3376 -v /home/administrator/.certs:/certs:ro --name=SwarmManager --restart=always swarm:1.1.0 manage --tlsverify --tlscacert=/certs/ca.pem --tlscert=/certs/cert.pem --tlskey=/certs/key.pem --discovery-opt kv.cacertfile=/certs/ca.pem --discovery-opt kv.certfile=/certs/cert.pem --discovery-opt kv.keyfile=/certs/key.pem --host=0.0.0.0:3376 consul://swarm:8080
INFO[0000] Initializing discovery with TLS
INFO[0000] Listening for HTTP addr=0.0.0.0:3376 proto=tcp
ERRO[0000] Discovery error: Get https://swarm:8080/v1/kv/docker/swarm/nodes?consistent=: remote error: handshake failure
ERRO[0000] Discovery error: Put https://swarm:8080/v1/kv/docker/swarm/nodes: remote error: handshake failure
ERRO[0000] Discovery error: Unexpected watch error
ERRO[0060] Discovery error: Get https://swarm:8080/v1/kv/docker/swarm/nodes?consistent=: remote error: handshake failure
ERRO[0060] Discovery error: Put https://swarm:8080/v1/kv/docker/swarm/nodes: remote error: handshake failure
ERRO[0060] Discovery error: Unexpected watch error
and if I run docker info:
$ docker info
Containers: 0
Images: 0
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 0
Kernel Version: 4.2.0-27-generic
Operating System: linux
CPUs: 0
Total Memory: 0 B
Name: 161897188e4e
I feel I am very close here, I just can’t understand why discovery is not working for me. I did have it briefly working before I enabled TLS, but before I made any changes I was still seeing the (pending) issue with docker info on the swarm. I have been able to confirm that TLS 1.2 is available on the consul image, but I can’t get bash or sh in the swarm image and confirm the TLS version there (the swarm image is is 1.1.0 as I was planning to try and join a windows container server to the swarm too, and the only swarm image I could find for that was 1.1.0)
am I missing something here?
Update
In /var/log/upstart/docker.log I am seeing a lot of TLS handshake errors: remote error: bad certificate.
I generated the certificates following the instructions found here https://docs.docker.com/swarm/configure-tls/
the exact commands I used to create them are:
openssl genrsa -out ca-priv-key.pem 2048
openssl req -config /usr/lib/ssl/openssl.cnf -new -key ca-priv-key.pem -x509 -days 1825 -out ca.pem
openssl genrsa -out swarm-priv-key.pem 2048
openssl req -subj "/CN=swarm" -new -key swarm-priv-key.pem -out swarm.csr
openssl x509 -req -days 1825 -in swarm.csr -CA ca.pem -CAkey ca-priv-key.pem -CAcreateserial -out swarm-cert.pem -extensions v3_req -extfile /usr/lib/ssl/openssl.cnf
openssl rsa -in swarm-priv-key.pem -out swarm-priv-key.pem
openssl genrsa -out node1-priv-key.pem 2048
openssl req -subj "/CN=swarm" -new -key node1-priv-key.pem -out node1.csr
openssl x509 -req -days 1825 -in node1.csr -CA ca.pem -CAkey ca-priv-key.pem -CAcreateserial -out node1-cert.pem -extensions v3_req -extfile /usr/lib/ssl/openssl.cnf
openssl rsa -in node1-priv-key.pem -out node1-priv-key.pem
openssl genrsa -out node2-priv-key.pem 2048
openssl req -subj "/CN=swarm" -new -key node2-priv-key.pem -out node2.csr
openssl x509 -req -days 1825 -in node2.csr -CA ca.pem -CAkey ca-priv-key.pem -CAcreateserial -out node2-cert.pem -extensions v3_req -extfile /usr/lib/ssl/openssl.cnf
openssl rsa -in node2-priv-key.pem -out node2-priv-key.pem
openssl genrsa -out client-priv-key.pem 2048
openssl req -subj "/CN=swarm" -new -key client-priv-key.pem -out client.csr
openssl x509 -req -days 1825 -in client.csr -CA ca.pem -CAkey ca-priv-key.pem -CAcreateserial -out client-cert.pem -extensions v3_req -extfile /usr/lib/ssl/openssl.cnf
openssl rsa -in client-priv-key.pem -out client-priv-key.pem
I have also added this line
subjectAltName = IP:192.168.137.2
under the v3_req section of my /usr/lib/ssl/openssl.cnf file, but I have since updated the /etc/hosts files on each machine and I am addressing machines using hostname rather than IP.
I have also just checked, and all the servers are set to the exact same time, so I am not sure what is bad about this certificate