I terminated it through the AWS console manually. Why? Because I wanted to see what happened - it didn’t crash (I’m working on putting together a talk for a Docker Meetup, so wanted to explore failure modes)
ok, that is good that it didn’t crash. When you kill via the console, it doesn’t send any signals to the manager, so it isn’t able to do any cleanup before it shuts down. Which could leave things in a bad state. We are still working on ways to minimize this risk, but we aren’t 100% there yet.
3 managers, I killed the leader.
Ok, that is what I thought. When you killed the leader, that left docker for AWS in a bad state, We store the leader IP in dynamodb, and new nodes (managers and workers) use that IP for the swarm join command. Since you killed the server the way you did, it wasn’t able to do a cleanup before it went away (which would have updated the IP with another manager), and thus the IP in dynamo, was out of sync. When it tried to connect with the given IP, it wouldn’t do anything because the server is no longer there.
I’m working on a way to make it recover better when the leader suddenly goes away, but I haven’t finished it yet, hopefully we can get it in one of the next couple of betas. If you were to do the same with a manager that isn’t the leader it should recover nicely. So until then I would say, “Don’t manually kill the leader node”
I hope that helps, sorry for the issue it might have caused.