Does SWARM survive VMs stop/start (re-launches)?... Doesn't look like it

I’ve noticed one thing… If swarm VMs are stopped and then started (which results in private&public IP changes), the SWARM is not ‘re-established’… Can someone confirm this is the correct behavior?

If this functionality is not yet implemented, any plans to put that in? (eg, have startup scripts running of the SWARM VMs which interrogate currently running EC2 instances and execute ‘docker swarm init/join’ on SWARM VMs if quorum is lost)?..

1 Like

No, this is not a current design use case. Do you have more details on why you stop and then start VMs that are part of a swarm?

Once it was human error. Another time it seemed to have been a result of an outage in AWS (a number of VMs got stopped and then started up automatically). In theory, i won’t preclude the case of VM core-dumping on occasion…

I having seen some unexpected behavior, When I restart de Docker VM such as Worker or Manager, a new VM become to launch, creating a new VM at my cluster, some times that new instance not join at SWARM cluster, already happen some day at friday, I restart some VM and EC2 begin a loop of VM recreation until monday, was a crazy mess.

Its worse than that, swarm services don’t come back.

Well, i’m getting even more weird behavior now… For some reason, it looks like the swarm just resets itself, on quite regular basis. Not quite sure whether it’s smth that AWS is to blame, but it’s annoying as hell… Basically, the VM names remain the same, but all IPs & FQDNs change and when i log in to the manager i see that the swarm has not been initialized. The autoscaling group also survives intact unlike the VMs…

Is it possible that the healthcheck setup on the autoscaling group is set to be too agressive and autoscaler thinks the VMs are non-responsinve which results in forceful restart?.. But why the swarm is not getting reinstated in this case?..Can it be related to the type of VMs i’m using (t2.large / t2.medium)?

This particular issue is a real show stopper for prod adoptoin, since we can’t affor swarm to dissipate like that without any reason… ANyone experienced the same issue?.. It looks like VMs just get relaunched for some reason…

ok. Mystery is kind of solved… Autoscaling group has an audit record, that says ELB HC has failed:

   Cause:CauseAt 2017-07-17T13:29:20Z an instance was taken out of service in response to a ELB system health check failure.

So, it looks like it’s an overly aggressive LB rules which make it bring manager node down… THe problem is though, that as soon as SWARM MANAGER vm is re-launhced, swarm is destroyed… Has anyone experienced these issues in the past?..

Current HC is probing on port 44554 and in my case it was mgmt node that was brought down… But i’m not sure whether worker node would have been able to re-join the swarm…

Mystery continues… Now, once in a while, the worker nodes would drop (i have 1 manager and 2 worker nodes), which leaves manager node to takeon all the load. I’ve created simple shell scripts to ‘reinstante’ the cloud (master + nodes), which queries existing EC2 infrastructure hunting for the EC2 instances by stack tag name. Would it help to incorporate them as startup scripts on the manager node?.. I’m not sure if it’s only me who is having these issues, but so far, the swarm drops every 2-3 days for different reasons… Not sure how whether i would be able to use it in prod setup in its current form…

I have more issue, take a look @volenin Swarm Manager at EC2 instance gone die suddenly and unexpectedly