Docker Community Forums

Share and learn in the Docker community.

Post Mortem for Automated Build Service Disruption on Aug 30th


(Ryan Kennedy) #1

Summary: Post Mortem for Automated Build Service Disruption on Aug 30th

Exposure period: 30 Aug 14:00 - 23:40 PDT

During the early afternoon PDT of Aug 30th, the Automated Build service used by Docker Hub and Docker Cloud experienced larger than normal traffic, which caused severe performance degradation of the build service. The service recovered around 11:40 pm and is now fully operational. We would like to explain what happened, the steps that we’ve taken to improve our service, and to apologize to users for the inconvenience in service disruption.

The Automated Build Service used by Docker Hub and Docker Cloud is a scalable, distributed build-scheduler service designed for security and reliability.

Around 2pm on Aug 30th, the automated build system received an unusually large storm of rebuild requests when several official images were updated on Docker Hub, which caused severe performance degradation of the build service. The build service was scaled up to reduce API response times, however, the new deployments did not address the issue. The volume of builds in the backlog was still an order of magnitude larger than the number of distinct image:tag combinations, which led to the discovery of a bug in the build trigger logic that was creating a significant number of copies of the same build request.

The team began the process of purging extraneous build requests and soon after that process was complete, we were back to normal with a manageable backlog of builds running on a fresh deployment. As of 10:40 pm, all users who were not parallelism constrained were having their builds go through as normal.

The team is actively working on a fix to ensure the issue causing extraneous build requests does not happen again. We are also introducing a series of improvements to the build API in order to improve overall system performance and reliability.

Again, we apologize for the service interruption and will continue to work hard to provide a world class autobuild service and all of the Docker Cloud services. We sincerely appreciate your continued feedback which will ultimately result in a better service for all.

-The Docker Team


(Ryan Kennedy) #2