Is the manager ASG supposed to heal itself?

Expected behavior

When I terminate a manager instance, the ASG spawns a new instance to get back up to the desired number.

I expect that instance should join the swarm as a manager.

Actual behavior

The instance boots, but doesn’t join the swarm at all.

Additional Information

Using Docker for AWS 1.13.0-rc2 (beta12)

Steps to reproduce the behavior

  1. Terminate a manager
  2. Wait for the ASG to spawn a new instance
  3. Watch docker node ls, and see the new instance not appear…
1 Like

It should automatically join the swarm. A few questions:

  1. How are you terminating the instance? Also, why are you terminating the instance? Did it crash?
  2. Are you terminating the leader manager, or a secondary manager? How many managers do you have?
  3. How long are you waiting before typing docker node ls, it could take a couple of minutes for the manager to join after it is considered up by EC2 console.

Hi Ken,

  1. How are you terminating the instance? Also, why are you terminating the instance? Did it crash?

I terminated it through the AWS console manually. Why? Because I wanted to see what happened - it didn’t crash :wink: (I’m working on putting together a talk for a Docker Meetup, so wanted to explore failure modes)

  1. Are you terminating the leader manager, or a secondary manager? How many managers do you have?

3 managers, I killed the leader.

  1. How long are you waiting before typing docker node ls, it could take a couple of minutes for the manager to join after it is considered up by EC2 console.

I did a watch docker node ls --filter role=manager for well over an hour.

One thing I did do (which maybe I shouldn’t have) is after a few minutes I docker node rmed the dead instance (after demoting it).

Thanks!
-Dave

I terminated it through the AWS console manually. Why? Because I wanted to see what happened - it didn’t crash :wink: (I’m working on putting together a talk for a Docker Meetup, so wanted to explore failure modes)

ok, that is good that it didn’t crash. When you kill via the console, it doesn’t send any signals to the manager, so it isn’t able to do any cleanup before it shuts down. Which could leave things in a bad state. We are still working on ways to minimize this risk, but we aren’t 100% there yet.

3 managers, I killed the leader.

Ok, that is what I thought. When you killed the leader, that left docker for AWS in a bad state, We store the leader IP in dynamodb, and new nodes (managers and workers) use that IP for the swarm join command. Since you killed the server the way you did, it wasn’t able to do a cleanup before it went away (which would have updated the IP with another manager), and thus the IP in dynamo, was out of sync. When it tried to connect with the given IP, it wouldn’t do anything because the server is no longer there.

I’m working on a way to make it recover better when the leader suddenly goes away, but I haven’t finished it yet, hopefully we can get it in one of the next couple of betas. If you were to do the same with a manager that isn’t the leader it should recover nicely. So until then I would say, “Don’t manually kill the leader node” :slight_smile:

I hope that helps, sorry for the issue it might have caused.

Ok, cool :slight_smile:

So in theory if I cause a secondary manager to die (through some mechanism other than a manual terminate), it should recover?

I’ll test that scenario and see how it behaves.

FWIW, the remaining managers did re-elect a new leader and the swarm remained operational. I probably could’ve manually joined the new manager.

So in theory if I cause a secondary manager to die (through some mechanism other than a manual terminate), it should recover?

yes that should work today, with no issues. If it doesn’t please let us know.

FWIW, the remaining managers did re-elect a new leader and the swarm remained operational. I probably could’ve manually joined the new manager.

Yeah, that part is solid, the missing piece is updating the dynamodb table with the new leaders IP. We should be able to fix that, just need to get the code written, and tested out.

To manually fix, you could have manually updated the dynamodb table’s leader IP, and it would have self healed itself after that. It might have taken a little while for the first node to timeout, but after that, it should have worked.

yes that should work today, with no issues. If it doesn’t please let us know.

Yup - confirmed. I deleted a secondary manager’s main disk partition, then rebooted it, and it died a horrible death. The ASG fired up a new instance and that one joined as a manager. The old node hung around in the docker node ls output, though, which I guess means it’s still in the raft peer set.

After this I killed the leader in the same way, and the swarm stopped responding - I’m guessing this is because there were now 4 nodes listed as managers, with 2 unresponsive, so consensus would’ve been impossible. Oops. :wink:

If I had been patient for a few more minutes, would the first dead node have gotten cleared out of the swarm? Or is the expectation that some form of alerting should occur so someone can SSH in and clean it up manually?

Sorry, missed the reply.

If the node crashes, we can’t assume it might not come back into the swarm, since the swarm managers don’t know if it just lost connection and will be right back, or if it crashed hard. In these cases we can’t clean up the node, since we don’t want to break things, so you would need to manually clean those up.

It would be nice if there was some sort of alerting to let you know when this happens, maybe something built into cloudwatch can do this?

We ran into similar issues running aws-v1.13.1-ga-2 with three managers where we in succession, a couple of minutes apart, had our managers forcibly restarted.

@kencochrane1 Do you have an update on when the improved recovery functionality will be available?

1 Like

I have some code that is currently getting tested. If it looks good, it will be included in the 17.05 CE edge release.

Hey @kencochrane1 - did that code make it into 17.05? or will it be in 17.06?

@hairyhenderson yes that code should have been included in 17.05. Is there a reason why you ask, is it not working?

@kencochrane1 Only reason is I think we ran into the issue on a swarm that had 17.04 running - AWS decided to terminate managers 3 times in the span of a few hours (due to failed ELB health checks) over a weekend and we didn’t notice until it was too late.

We ended up with what looked like a raft peer set of 6 managers (3 of which were the old terminated ones), and consensus would no longer have been possible, so the swarm stopped operating entirely. We ended up having to redeploy. But, we updated to 17.05 in the process, so in theory this won’t happen again :wink: