Nodes can be added to the cluster after losing the initial manager node.
Nodes that are brought online using the Auto Scaling Groups fail to join.
The issue seems to occur because the first manager node that comes online registers it’s IP address in DynamoDB and it’s never updated even if the first manager node is lost and replaced by the Auto Scaling Group. When new instances are launched my the Auto Scaling Groups, they pull the first manager’s IP address from DynamoDB and try to join the cluster using it. If that instance is not accessible, or has been replaced and has a new IP address, the new nodes fail to join the cluster.
Steps to reproduce the behavior
- Launch Cloudformation Stack
- Find the IP address of the initial manager by inspecting the DynamoDB table
- Terminate the instance that has that IP address
- Wait until the manager Auto Scaling Group will replace it
- Once the new manager is online, verify that it is not part of the swarm:
docker node ls
- Confirm that it attempted to join using the now terminated instances IP:
docker logs $(docker ps --all --filter ancestor=docker4x/init-aws:aws-v1.12.0-beta4 --quiet)