Is the manager ASG supposed to heal itself?

hairyhenderson · November 25, 2016, 4:05am

Expected behavior

When I terminate a manager instance, the ASG spawns a new instance to get back up to the desired number.

I expect that instance should join the swarm as a manager.

Actual behavior

The instance boots, but doesn’t join the swarm at all.

Additional Information

Using Docker for AWS 1.13.0-rc2 (beta12)

Steps to reproduce the behavior

Terminate a manager
Wait for the ASG to spawn a new instance
Watch docker node ls, and see the new instance not appear…

kencochrane1 · November 25, 2016, 7:55pm

It should automatically join the swarm. A few questions:

How are you terminating the instance? Also, why are you terminating the instance? Did it crash?
Are you terminating the leader manager, or a secondary manager? How many managers do you have?
How long are you waiting before typing docker node ls, it could take a couple of minutes for the manager to join after it is considered up by EC2 console.

hairyhenderson · November 25, 2016, 8:00pm

Hi Ken,

How are you terminating the instance? Also, why are you terminating the instance? Did it crash?

I terminated it through the AWS console manually. Why? Because I wanted to see what happened - it didn’t crash (I’m working on putting together a talk for a Docker Meetup, so wanted to explore failure modes)

Are you terminating the leader manager, or a secondary manager? How many managers do you have?

3 managers, I killed the leader.

How long are you waiting before typing docker node ls, it could take a couple of minutes for the manager to join after it is considered up by EC2 console.

I did a watch docker node ls --filter role=manager for well over an hour.

One thing I did do (which maybe I shouldn’t have) is after a few minutes I docker node rmed the dead instance (after demoting it).

Thanks!
-Dave

kencochrane1 · November 25, 2016, 8:10pm

I terminated it through the AWS console manually. Why? Because I wanted to see what happened - it didn’t crash (I’m working on putting together a talk for a Docker Meetup, so wanted to explore failure modes)

ok, that is good that it didn’t crash. When you kill via the console, it doesn’t send any signals to the manager, so it isn’t able to do any cleanup before it shuts down. Which could leave things in a bad state. We are still working on ways to minimize this risk, but we aren’t 100% there yet.

3 managers, I killed the leader.

Ok, that is what I thought. When you killed the leader, that left docker for AWS in a bad state, We store the leader IP in dynamodb, and new nodes (managers and workers) use that IP for the swarm join command. Since you killed the server the way you did, it wasn’t able to do a cleanup before it went away (which would have updated the IP with another manager), and thus the IP in dynamo, was out of sync. When it tried to connect with the given IP, it wouldn’t do anything because the server is no longer there.

I’m working on a way to make it recover better when the leader suddenly goes away, but I haven’t finished it yet, hopefully we can get it in one of the next couple of betas. If you were to do the same with a manager that isn’t the leader it should recover nicely. So until then I would say, “Don’t manually kill the leader node”

I hope that helps, sorry for the issue it might have caused.

hairyhenderson · November 25, 2016, 9:03pm

Ok, cool

So in theory if I cause a secondary manager to die (through some mechanism other than a manual terminate), it should recover?

I’ll test that scenario and see how it behaves.

FWIW, the remaining managers did re-elect a new leader and the swarm remained operational. I probably could’ve manually joined the new manager.

kencochrane1 · November 25, 2016, 9:12pm

So in theory if I cause a secondary manager to die (through some mechanism other than a manual terminate), it should recover?

yes that should work today, with no issues. If it doesn’t please let us know.

FWIW, the remaining managers did re-elect a new leader and the swarm remained operational. I probably could’ve manually joined the new manager.

Yeah, that part is solid, the missing piece is updating the dynamodb table with the new leaders IP. We should be able to fix that, just need to get the code written, and tested out.

To manually fix, you could have manually updated the dynamodb table’s leader IP, and it would have self healed itself after that. It might have taken a little while for the first node to timeout, but after that, it should have worked.

hairyhenderson · November 26, 2016, 5:49pm

yes that should work today, with no issues. If it doesn’t please let us know.

Yup - confirmed. I deleted a secondary manager’s main disk partition, then rebooted it, and it died a horrible death. The ASG fired up a new instance and that one joined as a manager. The old node hung around in the docker node ls output, though, which I guess means it’s still in the raft peer set.

After this I killed the leader in the same way, and the swarm stopped responding - I’m guessing this is because there were now 4 nodes listed as managers, with 2 unresponsive, so consensus would’ve been impossible. Oops.

If I had been patient for a few more minutes, would the first dead node have gotten cleared out of the swarm? Or is the expectation that some form of alerting should occur so someone can SSH in and clean it up manually?

kencochrane1 · December 16, 2016, 10:07pm

Sorry, missed the reply.

If the node crashes, we can’t assume it might not come back into the swarm, since the swarm managers don’t know if it just lost connection and will be right back, or if it crashed hard. In these cases we can’t clean up the node, since we don’t want to break things, so you would need to manually clean those up.

It would be nice if there was some sort of alerting to let you know when this happens, maybe something built into cloudwatch can do this?

jakobkylberg · March 22, 2017, 1:09pm

We ran into similar issues running aws-v1.13.1-ga-2 with three managers where we in succession, a couple of minutes apart, had our managers forcibly restarted.

@kencochrane1 Do you have an update on when the improved recovery functionality will be available?

kencochrane1 · April 6, 2017, 4:04pm

I have some code that is currently getting tested. If it looks good, it will be included in the 17.05 CE edge release.

hairyhenderson · June 14, 2017, 2:54pm

Hey @kencochrane1 - did that code make it into 17.05? or will it be in 17.06?

kencochrane1 · June 14, 2017, 5:26pm

@hairyhenderson yes that code should have been included in 17.05. Is there a reason why you ask, is it not working?

hairyhenderson · June 14, 2017, 5:38pm

@kencochrane1 Only reason is I think we ran into the issue on a swarm that had 17.04 running - AWS decided to terminate managers 3 times in the span of a few hours (due to failed ELB health checks) over a weekend and we didn’t notice until it was too late.

We ended up with what looked like a raft peer set of 6 managers (3 of which were the old terminated ones), and consensus would no longer have been possible, so the swarm stopped operating entirely. We ended up having to redeploy. But, we updated to 17.05 in the process, so in theory this won’t happen again

Topic		Replies	Views
Swarm in Broken State after ASG replaced 2 out of 3 Managers General aws , docker , swarm	1	1773	August 1, 2017
AWS EC2 Instance restart recovery General aws , beta , amazonwebservices	9	4154	September 15, 2017
Swarm failure - voting does not work General aws	4	2465	August 29, 2016
Manager node fails while idle General aws	1	956	October 18, 2016
New Docker swarm 1.12, automation and single point of failure Swarm docker , amazonwebservices	2	2392	March 14, 2017

Is the manager ASG supposed to heal itself?

Expected behavior

Actual behavior

Additional Information

Steps to reproduce the behavior

Related topics