Post Mortem for Registry Service Disruption on Oct 3rd

pkennedyr · October 7, 2016, 4:52pm

Summary: Postmortem for Docker Hub Registry Service Disruption on Oct 3rd, 2016

Outage period: 2016/10/03 14:12 to 15:12 PDT (60 minutes)

During the early afternoon PDT of Oct 3rd, the hosted registry service backing Docker Hub and Docker Cloud experienced higher than normal latency after an infrastructure migration. This caused a service disruption for approximately 60 minutes.

The registry service recovered at approximately 15:12 (3:12 PM) PDT and was fully operational shortly thereafter. In the interests of transparency with the community, we’d like to explain what happened, share the steps we’ve taken to improve our service, and apologize to users for the inconvenience of service disruption.

The hosted Docker Registry that provides image repository services for Docker Cloud, Docker Hub, and Docker Store is a scalable Docker image storage and distribution service designed for security, reliability and high scalability with over 6 billion pulls to-date.

Around 14:12 (2:14 PM) PDT on Oct 3rd, the team began a planned infrastructure migration for the registry service. During this migration, the registry started experiencing higher than normal request latency on new instances, and operations such as docker login, push, and pull stopped functioning as expected. The team investigated, but was ultimately unable to determine the root cause due to constraints in monitoring granularity, and moved to restore service rather than continue investigating. While restoring service, we were forced to make the Registry unavailable while we provisioned new registry nodes, which led to additional downtime. The registry service was scaled up, and by 15:12 (3:12 PM) PDT the registry service had been restored and all users were able to push and pull images as usual.

To improve overall system performance and reliability the team is actively taking steps to ensure that this does not happen again. Unfortunately, we still do not know the root cause of this outage. However, while investigating we discovered gaps in our infrastructure processes, including monitoring and reporting, which we are now working to fix. Once we have better visibility into our infrastructure components, we’ll be able to better detect, report on, and resolve failures before they affect our users.

Again, we apologize for the service interruption, and are working hard to provide a world-class experience for all of the Docker hosted services. We sincerely appreciate your continued feedback, which will ultimately result in a better service for all.

-The Docker Team