AWS EC2 Instance restart recovery

dlaidlaw · August 9, 2016, 2:35pm

Expected behavior

AWS beta 4
Expect the swarm manager(s) to recover after an instance reboot, or an instance terminate/replace.

Actual behavior

When a stack with a single manager has the manager replaced, or even just rebooted, the manager does not recover. You can ssh into the manager and run docker info, the manager shows that the instance is no longer a manager.

Additional Information

Is it possible for a stack to recover the termination and replacement of a manager? Or a worker?

Steps to reproduce the behavior

Launch a stack with a single manager
Terminate the manager instance
The ASG replaces the manager instance with a new one
The new manager instance is not a manager.

friism · August 12, 2016, 4:21am

Thanks for reporting this - we’re currently working on re-designing this part of Docker for AWS so we may not pursue a fix of the current design.

Please keep the feedback coming!
Michael

kencochrane1 · August 12, 2016, 1:03pm

I don’t know if this is possible with a single manager. A single manager setup isn’t recommended for anything besides local development and testing. Because it is only one manager, we can’t expect it to work correctly for failover reasons.

Does the same thing happen on a 3 manager swarm?

dlaidlaw · August 12, 2016, 7:20pm

I hear you, it is hard to recover from a complete failure like that. But these things do happen in production. In our own clusters we have had to implement an automatic disaster recovery when every master in a zookeeper ensemble fails (for example). This can happen for fairly innocent reasons, like when an AWS stack is updated, which may cause the ASG to be replaced with a new one, rather than updating the existing one. And this causes all the instances in the ASG to be terminated and replaced.

So whatever the answer is, we will at least need a way do do disaster recovery of a set of swarm managers to recover from the worst cases.

I have not tried replacing a single instance on a 3 manager swarm yet.

kencochrane1 · August 12, 2016, 7:30pm

It should work on a 3 or 5 manager swarm, if it doesn’t please let us know.

luv2code · August 15, 2016, 7:05pm

I’ve run into the same issue. Do I have to delete the stack and recreate? Is there some way to get the node promoted so that I have a manager again?

kencochrane1 · August 15, 2016, 8:26pm

@luv2code are you only using 1 manager or 3,5 ? If you are only using one manager, then you might need to delete the stack and recreate. There might be a way to recover, but not 100% sure.

luv2code · August 15, 2016, 8:27pm

Yes. only using 1. manager asked me to spin the instances down over the weekend.

friism · August 16, 2016, 12:50am

That’s not something that we’ve tested, but I’ll add it as something to look into!

junius · September 15, 2017, 8:19pm

For production, should run 3 or 5 manager nodes.

For 1 manager cluster, don’t think swarm could recover itself after the only manager is terminated. When the EC2 instance is terminated, the root volume of the EC2 instance is also deleted. So all data on the swarm manager node will get lost. The swarm manager data is not replicated or backed up to the swarm worker nodes. Unless, docker for AWS backs up the swarm manager somewhere else. How could it be possible to recover after the single manager is terminated?

Topic		Replies	Views
Is the manager ASG supposed to heal itself? General aws , amazonwebservices	12	2628	June 14, 2017
Swarm in Broken State after ASG replaced 2 out of 3 Managers General aws , docker , swarm	1	1774	August 1, 2017
Restarting single Docker manager of swarm Swarm	3	3344	August 17, 2019
Swarm Manager at EC2 instance gone die suddenly and unexpectedly General aws	0	1149	August 24, 2017
Swarm failure - voting does not work General aws	4	2466	August 29, 2016

AWS EC2 Instance restart recovery

Expected behavior

Actual behavior

Additional Information

Steps to reproduce the behavior

Related topics