Expected behavior
expect swarm to be resilient with multiple manager nodes
Actual behavior
The swarm dies a lot (every couple of hours / days it seems). Apparently If a heartbeat message to leader is lost, a reelection is triggered, but votes get lost due to some kind of deadlock or something similar. That means the swarm is down - it is impossible to get any information about whatever happens inside the swarm, as all commands simply complain that there’s no master.
As I had 3 Manager nodes, I tried to start 2 at a time as per [1.12rc4] Occasionally Swarm cluster does not work with 2 of 3 manager nodes online · Issue #25055 · moby/moby · GitHub, but to no avail. I tried all 3 possible combinations (stopping all 3, starting 2 of them)
I then started a new swarm as described in Administer and maintain a swarm of Docker Engines | Docker Docs. I can’t join the other 2 manager nodes to that swarm, because I can’t get them to leave their swarm:
docker swarm leave --force
Error response from daemon: context deadline exceeded
I can’t even force-init a new cluster on those nodes, with the same error message.
Also my new 1-manager cluster is absolutely useless - while I can see my services and tasks, and even the nodes, all nodes are failing their heartbeat check. I assume that’s happening due to the token being changed when re-initializing the swarm…
Questions
A) how can I recover from this state? Is it possible at all? My only lead would be to change the tokens in dynamodb to the values produced by the new init, stop the two other managers and start 2 new ones (they should join the swarm using the tokens from dynamodb). Then start the ssh daemon container on the worker nodes manually with -h worker_node_host and join the swarm with their respective docker-services (that way the maybe still running services should be integrated - alternatively just scale up a few new worker nodes and then destroy the old ones)
Is there an easier way? Actually creating a new stack and tearing down the old seems easier than my solution…
B) As per github link above, this is going to be fixed in docker 1.12.1. When is it going to be released / integrated in the AMI for the cloud formation template (beta5?)?
C) Is a leaderless swarm still restarting containers? Or does that mean everything is down, not just the CLI?
Additional Information
This is a huge blocker for our usage of docker-swarm mode on AWS… I hoped that the rough edges would be more in the cloud-formation template - but I did expect docker 1.12 to be actually stable with its swarm implementation…
What exactly is causing this deadlock / endless reelection cycle? Could anything I did change have something to do with it? As per my other posts in here, I added a VPC Peering to default VPC and the according routing table entries.
so 172.31.0.0/16 is forwarded to my default VPC.
Except for that, I was deploying like 100+ containers total, about half of which still had errors and were restarting (fighting with the change from docker-compose to service format). I was using micro instances for manager nodes, but load was low (single digit cpu/memory).
Before that 3 manager node attempt I was having a 5 Manager (c4.xlarge) with 3 worker nodes (c4.xlarge as well). This worked with one service (50 tasks) over the weekend. On monday I tried to add additional services (again, with errors =>restarting). I was using the manager nodes for containers as well, so load was a bit higher, but shouldn’t have been too high (lower double-digit if at all). After a few hours the swarm died the same way as it did with my 3-manager swarm. I thought maybe the problem was that the other containers took too much resources and the manager(s) got disturbed, so I destroyed the swarm and created a new one (the 3 manager-swarm from above)
