AWS beta 4
Expect the swarm manager(s) to recover after an instance reboot, or an instance terminate/replace.
Actual behavior
When a stack with a single manager has the manager replaced, or even just rebooted, the manager does not recover. You can ssh into the manager and run docker info, the manager shows that the instance is no longer a manager.
Additional Information
Is it possible for a stack to recover the termination and replacement of a manager? Or a worker?
Steps to reproduce the behavior
Launch a stack with a single manager
Terminate the manager instance
The ASG replaces the manager instance with a new one
I don’t know if this is possible with a single manager. A single manager setup isn’t recommended for anything besides local development and testing. Because it is only one manager, we can’t expect it to work correctly for failover reasons.
I hear you, it is hard to recover from a complete failure like that. But these things do happen in production. In our own clusters we have had to implement an automatic disaster recovery when every master in a zookeeper ensemble fails (for example). This can happen for fairly innocent reasons, like when an AWS stack is updated, which may cause the ASG to be replaced with a new one, rather than updating the existing one. And this causes all the instances in the ASG to be terminated and replaced.
So whatever the answer is, we will at least need a way do do disaster recovery of a set of swarm managers to recover from the worst cases.
I have not tried replacing a single instance on a 3 manager swarm yet.
@luv2code are you only using 1 manager or 3,5 ? If you are only using one manager, then you might need to delete the stack and recreate. There might be a way to recover, but not 100% sure.
For 1 manager cluster, don’t think swarm could recover itself after the only manager is terminated. When the EC2 instance is terminated, the root volume of the EC2 instance is also deleted. So all data on the swarm manager node will get lost. The swarm manager data is not replicated or backed up to the swarm worker nodes. Unless, docker for AWS backs up the swarm manager somewhere else. How could it be possible to recover after the single manager is terminated?