Post Mortem For Docker Hub Automated Build Issue Nov 16th - Nov 17th

Summary: Docker Hub’s Automated Build Service issues on Nov 16th - Nov 17th.

During the early morning PST of Nov. 16th, the Docker Hub Automated Build service experienced failures at an alarming rate. The service recovered about 24 hours later, around Nov 17th at 9am. Here, we want to explain what happened, the steps that we’ve taken to improve our service, and to apologize to users for the inconvenience in service disruption.

What happened

In the prior weeks, we have been testing and rolling out a new backend system for automated builds. The main goals of the new build system is security (isolation of different users’ build execution) and QoS predictability (a build in the the legacy system could complete anywhere from 2 minutes to 2 hours). When we rolled out the new auto build system to 100% of the builds on Nov 9th it was handling the load really well. It nicely handled the regular load from our users, and a storm of rebuild requests as we updated almost all of the official images on Docker Hub. At its peak, the new system scaled beyond the maximum daily load ever put on the legacy system, so we were confident the new backend for auto-builds was working and scaling as designed.

During the rollout, we were affected by two issues.

On the morning of Nov 16th, we noticed some builds had started to fail with authentication errors. In this scenario only builds under organizational accounts were affected. At around 1 PM, we found that authentication for organization was failing against the V1 registry due to fallbacks from V2. The build system uses ephemeral account access keys, a new authentication mechanism, to access the registry. It was discovered that, the access key authentication had a bug against the V1 registry for organization accounts. The fix for this issue was ready by 6pm that same day and rolled it out to production. This issue impacted a few users, but it was not widely spread.

At this time, we started noticing a significant number of push failures to the V2 registry once the builds were complete. This was due to networking timeouts between our build nodes and the CDN. At 7pm, we confirmed that this was not a global issue within Docker Hub registry, but isolated to the cloud environment that the build system ran in. (The networking and routing problem is also the suspected cause of the V1 registry fallback described above.) We worked with our CDN provider to debug the issue and ultimately switched back to the legacy build system (by 8am the following day) hosted on a different infrastructure provider to resume service.

What we learned
We’ve since been adding more tests for automated builds to cover the breadth of issues resulting from possible failures. We are working closely with our infrastructure provider to resolve the connectivity and routing issues between Automated build nodes and CDN. We continue to monitor the situation while we work to resolve the root cause. In addition, we are working on putting more redundancies in place to minimize the impact on our users in case of a future infrastructure outage.

Again, we apologize for the service interruption and will continue to work hard to provide a world class service in autobuilds and all of the Docker cloud services. We sincerely appreciate your continued feedback which will ultimately result in a better service for all.

Docker Hub Team

1 Like