AWS EC2 Instance restart recovery

Expected behavior

AWS beta 4
Expect the swarm manager(s) to recover after an instance reboot, or an instance terminate/replace.

Actual behavior

When a stack with a single manager has the manager replaced, or even just rebooted, the manager does not recover. You can ssh into the manager and run docker info, the manager shows that the instance is no longer a manager.

Additional Information

Is it possible for a stack to recover the termination and replacement of a manager? Or a worker?

Steps to reproduce the behavior

  1. Launch a stack with a single manager
  2. Terminate the manager instance
  3. The ASG replaces the manager instance with a new one
  4. The new manager instance is not a manager.

Thanks for reporting this - we’re currently working on re-designing this part of Docker for AWS so we may not pursue a fix of the current design.

Please keep the feedback coming!

I don’t know if this is possible with a single manager. A single manager setup isn’t recommended for anything besides local development and testing. Because it is only one manager, we can’t expect it to work correctly for failover reasons.

Does the same thing happen on a 3 manager swarm?

I hear you, it is hard to recover from a complete failure like that. But these things do happen in production. In our own clusters we have had to implement an automatic disaster recovery when every master in a zookeeper ensemble fails (for example). This can happen for fairly innocent reasons, like when an AWS stack is updated, which may cause the ASG to be replaced with a new one, rather than updating the existing one. And this causes all the instances in the ASG to be terminated and replaced.

So whatever the answer is, we will at least need a way do do disaster recovery of a set of swarm managers to recover from the worst cases.

I have not tried replacing a single instance on a 3 manager swarm yet.

It should work on a 3 or 5 manager swarm, if it doesn’t please let us know.

I’ve run into the same issue. Do I have to delete the stack and recreate? Is there some way to get the node promoted so that I have a manager again?

@luv2code are you only using 1 manager or 3,5 ? If you are only using one manager, then you might need to delete the stack and recreate. There might be a way to recover, but not 100% sure.

Yes. only using 1. manager asked me to spin the instances down over the weekend.

That’s not something that we’ve tested, but I’ll add it as something to look into!

For production, should run 3 or 5 manager nodes.

For 1 manager cluster, don’t think swarm could recover itself after the only manager is terminated. When the EC2 instance is terminated, the root volume of the EC2 instance is also deleted. So all data on the swarm manager node will get lost. The swarm manager data is not replicated or backed up to the swarm worker nodes. Unless, docker for AWS backs up the swarm manager somewhere else. How could it be possible to recover after the single manager is terminated?