Production - Updating

Anyone updating their production D4AWS environment? How are you doing it? I am current on 17.12CE…two versions behind.

I have updated D4AWS twice while in development following the instructions - https://docs.docker.com/docker-for-aws/upgrade/ - 17.061CE to 17.09CE and 17.09CE to 17.12CE. Both times: I lost shared data (data in EFS), relocatable volumes wasnt created properly, and container transfer was buggy. Basically the update process would have lost production data; if it was in production. I have absolutely zero confidence in updating the D4AWS cloudformation templates using the proccess outlined in their documentation.

I am proposing to my management to create a new D4AWS environment running the latest D4AWS version, then manually transferring data from the old D4AWS env to the new D4AWS env. There will be some downtime due to re-loading the swarm resources (configurations, secrets, volumes, etc) and transferring the production data from old EFS to new EFS.

Hey jkhongusc,

I haven’t tried a production update myself, but it is super concerning to hear that you have had these upgrade issues, as I had assumed that the upgrade process would be solid.

Have you customized the resources in your Swarm at all? Wondering if maybe you had made some custom changes that are interfering with the upgrade. I am on the verge of making some changes of my own to the ELB and autoscaling configurations to improve security and stability of the stack, and am now a little worried this could interfere with upgrades.

Wish I could provide more help with your upgrade issues, but haven’t attempted one yet. But as a fellow user of docker for aws in production would be glad to share experiences with it.

I haven’t tried a production update myself, but it is super concerning to hear that you have had these upgrade issues, as I had assumed that the upgrade process would be solid.

I do not recommend updating using Docker’s standard update process… unless you can verify that the process works for an identical test/dev environment. Read the release notes too. Some releases require shutdown of containers. The problems I have encounter are all due to limitations/issues with D4AWS cloudformation templates. They just do not work well (WRT updating)… so test before trying on production environment.

Have you customized the resources in your Swarm at all?

I use the Docker cloudformation template without any modifications to any of the Docker created resources. I have a second cloudformation template that adds in my application required resources: RDS, EC2, S3, CloudFront, EFS, EBS, etc; I have input parameters to insert those resources into the Docker created VPC , subnets, and security groups.

I am on the verge of making some changes of my own to the ELB and autoscaling configurations to improve security and stability of the stack, and am now a little worried this could interfere with upgrades.

If you made any (manual) changes on the Docker resources, the changes will be lost when you update Docker. If you do make manual changes, document them so the changes can be re-applied after an update.

Good luck to you. I wish I was able to convince my manage to use ECS instead of D4AWS (I tried)… but now I am already down that D4AWS path.

Thanks for the tips on upgrading, I’ll be sure to thoroughly test in our staging environment before trying anything in Prod.

I have a need to modify the load balancer that comes as a part of the default cloudformation template in order to restrict access to our internal company network. I was then going to set up a second ‘public’ load balancer so that we can achieve network level isolation between public and private services.

My approach thus far has been to use Ansible to query EC2 for the docker swarm resources and modify them in place, or add to them as necessary. My thought was that this would be portable between upgrade versions, as long as the D4AWS topology did not change significantly between versions. In any event, this would at least serve as documentation to reapply changes after an update or reproduce in another environment.

Your approach of a second cloudformation template is intriguing though. Have you found that to be portable between upgrades? Does it modify existing D4AWS resources at all or is it solely for separate resources that the swarm integrates with? Are you happy with that approach overall?

So far I have been loving D4AWS for it’s ease of use, and we’ve integrated some really nice workflow enhancers with docker-flow-monitor and docker-flow-proxy along with Jenkins CI/CD, so it’s very powerful. But as I’m digging in I’m growing concerned with the robustness of the platform. After all, the gains of ease of use won’t be very helpful if the platform turns out to be unstable.

My approach thus far has been to use Ansible to query EC2 for the docker swarm resources and modify them in place, or add to them as necessary. My thought was that this would be portable between upgrade versions, as long as the D4AWS topology did not change significantly between versions.

When updating D4AWS, it will create new EC2 instances for the master/worker nodes; and entirely new volumes (EBS and EFS). D4AWS will attempt to migrate your containers and volumes over… but as I mentioned, it was buggy. EFS data is lost entirely.

Your approach of a second cloudformation template is intriguing though. Have you found that to be portable between upgrades?

This part worked well because the VPC, subnets, security groups did not change between D4AWS updates. All my added AWS resources were still running and able to access new Docker Swarm (and vice versa).

I have a need to modify the load balancer that comes as a part of the default cloudformation template in order to restrict access to our internal company network.

I forgot I did make two small changes to D4AWS resources. I did manually removed the Docker default ELB SG and added my own SGs - only allow CloudFront or my company’s subnets to access the ELB. I had to re-apply that change after the upgrade.

I do agree about D4AWS ease-of-use. I have run into a few Docker Swarm bugs which is why I need to update at some point. I am using CodePipeline to do CICD. CodePipeline is limited in functionality compared to Jenkins, but it was really easy to setup. We have all git updates immediately deploy (if tests successfully) to our test Docker container. The problem with D4AWS 17.12CE is that the new containers do not respond for 10 minutes after an update. Previous two versions of D4AWS, there was no downtime at all. My guess is a mesh routing issue; my Docker experience is too limited to debug further.

WRT robustness… so far so good. But my current production system is a low-traffic WordPress system. The plan is to deploy more systems to AWS, but I will be recommending the ECS or EKS route.

Blockquote I forgot I did make two small changes to D4AWS resources. I did manually removed the Docker default ELB SG and added my own SGs - only allow CloudFront or my company’s subnets to access the ELB. I had to re-apply that change after the upgrade.

That’s great to know, so it sounds like my ansible approach will just have to be reapplied after upgrade, which is perfectly fine by me.

Blockquote The problem with D4AWS 17.12CE is that the new containers do not respond for 10 minutes after an update. Previous two versions of D4AWS, there was no downtime at all. My guess is a mesh routing issue; my Docker experience is too limited to debug further.

That’s odd to hear. We are using 17.12 CE as well and haven’t had an issue like that. Upgrading stacks typically takes less than a minute in my experience. Must be due to some specific interaction in your environment, but you probably already suspect that.

Good to hear that it’s robust for you so far. I am only a couple months into working with D4AWS and we currently don’t have any bonafide production services in it, but our intention is to use it for some webpages as well as a scalable IoT telemetry data gateway, so we’ll definitely need robustness.

Sounds like the real pain point is upgradeability, thinking we will invest in thorough backups to make the upgrade process less potentially destructive when we do want to get to the next version.

That’s odd to hear. We are using 17.12 CE as well and haven’t had an issue like that. Upgrading stacks typically takes less than a minute in my experience. Must be due to some specific interaction in your environment, but you probably already suspect that.

This is how I do container updates. I have CodePipeline call a lambda function… which runs an ssh command on a master node:

docker service update -d --force --image username/imagename myproductionservice

My docker compose runs two replicas and updates one replica at a time. I see in CloudWatch that both of my containers are running after “docker service update”, but https requests dont get routed to the containers till 10 minutes later. Your comment makes me want to try a different way to update containers. Thanks.

Ah, I see. We have a jenkins agent running on a swarm manager node that runs something like this:
docker stack deploy --with-registry-auth --prune -c stackname.yml stackname

I don’t have anything with rolling updates yet, so that could have something to do with it, but also I was under the impression that you had to use docker stack instead of docker service if you wanted the containers to be spread around the cluster, rather than running on just the one node you ran the command on. Not sure if that’s actually related to your issue but hopefully it helps.

I haven’t read this whole thread TLDR, but upgrading production (even non-prod) environments is a common problem. There are 2 major parts here of course. One is recycling your infrastructure components (EC2 instances and possibly other resources (such as your launch config) in your case) and the other is the applications deployed as containers across your cluster. The second of these is what your container management app is for and forms parts of the self-healing capability. So if a cluster host becomes unhealthy for whatever reason (including if you deliberately terminate it as part of a rolling upgrade) then the scheduler will spot that desired state is not maintained and start up some new ones on healthy hosts. If a host recycle is a planned event, you can control that process yourself, i.e de-register a node from the cluster, scheduler starts replacement containers on another node, infrastructure desired state is reconciled (ASG) so a new host appears, new host auto-registers to the cluster, de-registered node is terminated, wait until all replacement containers report as healthy… rinse and repeat for all other nodes in the cluster. There may be many reasons for replacing nodes like an upgrade to Docker but the process is the same regardless. BTW in-place upgrades are an anti-pattern IMO, so stick with immutable for your infrastructure in the same way you do for containers and you won’t go far wrong.

If the above is too risky for business acceptance or your have some oddities like your EFS mounts become corrupt or disappear, then you can always go for a more redundancy approach such as full-on blue/green and manage the DNS switch over (weighted draining or direst depending on whether you need to worry about in flight transactions or other conditions like that or not)

Just to update my progress… I decided that using blue-green deployment was the best scenario for us. Luckily our application (WordPress) didnt have much data; and we were able to tell our editors to not make any changes during the migration. Migrations (dev, test, then prod environments) went very smoothly.

Lessons learned

1.How long it takes to create a completely new environment. About 3 hours: multiple CFT stacks (D4AWS, multiple AWS resource stacks, CICD, monitoring, lambda, etc), creating/loading Docker data (secrets, config, volumes, scripts). I could script more to automate creation and reduce the time.
2.Docker Swarm 18.03 CE - ran into new problems, sigh. I noticed that the D4AWS DynamoDB table (primary_manager) value was changing weekly, switching between manager nodes (IP addresses). This should not be happening. I also found that the cloudstor plugin was disabling itself; so we lost connection to our EFS instances.

FYI - I’ve recommended to my management to migrate our environment from D4AWS to ECS/Fargate.