Hi, I want to perform disaster recovery of my swarm cluster hosted on azure.
Consider a case, when the entire region which is hosting my swarm cluster goes down and we are not able to access any of the node. How can we achieve disaster recovery in this scenario.
Is there any method to take the backup of the whole swarm with the containers running on it. So that i can restore it in the different region and get my containers up running.
Hi @rahulishu1993 for stateless services, you can do the following:
Save the parameters used to deploy the ARM template initially and use the same set of parameters to redeploy in any other region and get a clone of your initial setup going in terms of configuration of nodes in the cluster. Note that if you scale the number of nodes up/down or update the configuration post deployment, those changes will need to be reflected in the parameter set used to redeploy.
Deploy your containers through swarm services using a Docker compose file. If you backup the compose file and use that to redeploy your containers, you will end up with all the necessary containers that you had going initially before the outage.
For stateful services using cloudstor volumes, you will need to also backup the storage account and specifically the File Storage objects used by cloudstor to another storage account. When you need to restore, you will need to configure cloudstor to use the backup storage account to get back your data.
Thanks @ddebroy, it was the first solution that came in our mind to tackle the problem of region failure (in case of disaster). But the solution includes a lot of manual work before and after the failure and downtime also.
We are planning to deploy a cross region cluster to solve the problem. We are trying to keep half of the nodes in one region and remaining half of the nodes in another region and connect the two vnets using vnet to vnet connection in azure. Each region will have sufficient number of manager and worker node so that whenever any region goes down another region can handles the load.
We just need to scale the nodes and services when one region goes down.
Will it work in the production environment, Kindly provide your reviews on this approach. If you find something wrong in our approach kindly tell us.