Docker Community Forums

Share and learn in the Docker community.


(Ryan Marr) #71

Mine has not recovered yet.

(Czechjiri) #72

same here, some functionality is back, but its a bit flaky

(Borja Burgos) #73

Apologies for everyone that was affected by this DDoS attack to our DNS service. While our application was architected such that Cloudflare’s Virtual DNS Service would have protected our customer’s applications from experiencing any disruption in service on such event, this clearly was not the case.

Again we apologize for the issues. Please be reassured the team has already began working on additional measures to prevent this from happening again. On such measures will be to migrate public/external endpoints for containers and services to an external DNS provider for added redundancy.

As some of you may know, Service Discovery is the only “customer runtime” service that remains centralized in Docker Cloud. Moving forward, this will no longer be the case. We’ve been working over the past few months to decentralize it, and as Docker Cloud support for Docker 1.12 is introduced, Service Discovery will no longer be centralized. This is important, as your applications will no longer be affected in the event that Docker Cloud itself were to go down or fail, as you experienced today.

Thanks again for your ongoing support.

(Fehguy) #74

Thanks @borja. This was an eye opener for us, do you have any sense of a timeframe for removing the runtime dependency on the docker cloud? I had always been under the impression that runtime did not depend on the cloud service. That was clearly wrong, and it raised a lot of questions.

(Borja Burgos) #75

That has been always the goal: that are customer’s applications can continue to run without any dependencies on Docker Cloud.

The changes necessary for Docker Cloud to make use of Docker 1.12+ and its new features around clustering, networking, service discovery, etc. aren’t trivial. But at this time, I can tell you it is a top priority for the team, and hope to have something to share with our customers soon. Unfortunately, given that Docker 1.12 still has not been released, and the additional time required to bring its features to Docker Cloud, it’ll be a few months before we’re able to have something ready for you to try. Hope this helps.

Please do note that we’ll also be adding redundancy and preventive mechanisms in the immediate future to avoid issues such as the ones we experience today. Thanks!

(Sebastianvilla) #76

I’d like to understand what happened today as well and what steps we can take as users to protect ourselves from this in the future.

Our apps on Docker Cloud failed terribly today, but the issue didn’t seem to be a DNS issue. Linked containers (i.e. database containers) in the same stack were unreachable internally. In one stack with 2 services for example, database and web, both services were running properly, the web container would respond to http requests just fine and the database container was accessible and all data was available; however, the web container could not reach the database container regardless of what we tried (also tried redeploying the stack and single services numerous times)

The same happened for other containers in different stacks. Docker Cloud’s UI would show the linked to and from containers as running with no indication of any errors. Using the Docker CLI on the host would show all containers running as well, however, when checking for linked containers, all would come back empty (empty array):

docker inspect -f "{{ .HostConfig.Links }}"

I would have imagined since containers run on our own infrastructure, our services would continue to be operational even in the event of a failure with Docker Cloud (as long as we weren’t using Docker Cloud’s DNS, as we just point to the IP of the host machines), but I guess today’s issue indicates this isn’t the case. Upon submitting a support ticket we were advised this was due to the DNS issue.

Docker Cloud adds a lot of value over just using docker-compose on our own infrastructure for example, but it must be reliable and resilient. We come from years of using Openshift for many apps and had never come across this issue (and with Openshift Online we do rely on their DNS and CNAMES as the only choice)

Again, I’d like to better understand (as I am sure everyone here does) what went on today and why it affected service discovery; events like this are fine in development, but a real nightmare in production environments, with clients reaching out to us concerned that their apps are down, which leaves us wondering if Docker Cloud is ready for production.

(Aaronjudd) #77

If at some point this was resolved, it does appear to be back with a vengeance. Our sites just went down as well.

(Andrelackmann) #78

We are still seriously affected at this stage with a complete outage for a large amount of our business day. The status page seems to suggest there is still an issue. I’d have expected Docker to be more communicative here (or elsewhere) on progress dealing with the problem.

(Dialonce) #79

Same here. Our production is down and Docker Cloud status page is ‘Green’.

No answer from support. We have very high latency on all our services and they lose connection between them constantly, load cannot be balanced because of high latency too.

An answer or a status update would be greatly appreciated, since we have no news for more than 10h and it is getting worse. If it will still be down for an undefined duration, we have to know it now so we can do something to move production and get it working, I hope you understand.

(Borja Burgos) #80

@sebastianvilla as per my post yesterday, service discovery (the internal hostname resolution used by your services) is centralized in Docker Cloud today; that is, your “web” container resolving “db” to get an internal private IP in the overlay. That name resolution was affected by the DDoS service attack yesterday. While the applications continued to run, if you relied on links/name resolution, then they were unable to find each other (resolve for an IP). There are ongoing efforts (top priority) to delegate this to each of your clusters to eliminate any and all dependencies on a centralized service running in Docker Cloud.

@dialonce The high network latency issues you are experiencing are not related to the DNS attack that we suffered yesterday. DNS availability is up again at 100% since the second attack last night concluded (ET). The overlay network in itself runs in your infrastructure and is independent of Docker Cloud being available or not.

@andrelackmann how are you affected?

We continue to work together with Cloudflare to better understand what happened and why the Virtual DNS didn’t function as expected/planned and will provide more in-depth information as soon as possible. Thanks for your ongoing support.

(Dialonce) #81

@borja so we had no issue for several months, all of our timeout issues started last night, and you suggest it is a coincidence?

I checked at the servers after your message, but it seems that our haproxy containers are all at 100% CPU, and this is what is causing the lag because host servers have a struggling weave process.

We don’t have more traffic than yesterday, do you realy think there is not something related?

I still have many issues on Docker Cloud (can’t update stack, can’t pull images)

Random new error:
ERROR: lb-eu-central-3: Get dial tcp: lookup no such host
(I can reproduce it at will)

You sure your DNS are fine?

(Borja Burgos) #82

@dialonce Yes, those issues are unrelated to the issue being discussed in the thread.

I have asked support to take a look at your weave (overlay network) containers and restart them if necessary, as it’s likely there’s an issue if they are at 100% CPU. The lookup fail for is your host’s inability to resolve the public hostname for the Docker Registry, and again it is unrelated to the DNS service that Docker Cloud provides for service discovery and public facing container/service endpoints. Hope this helps.

(Dialonce) #83

@borja I will investigate further and see what is hiding there.

My first guess: yesterday downtimes put our services off and now a lot of users wants to push their night usages and they keep trying at our door and they are more and more. Now we have 1k conn./sec on each LB and it can’t scale and everything have a very high timeout.

I’ll update this post (even if not related, since I posted on it) when I get more info.

(BTW on host machine directly I can wget just fine)

Edit: seems to be related to weave. One or two machines/services can’t handle the load we have since the DoS interruption, and adding more services/machines makes weave climb to 100%CPU all the time. We can’t run our production, still on it to try to find an issue. Will open another topic for the weave thing.

(Marcinw) #84

We’ve started seeing an issue where pointing to an internal Docker Cloud IP of a linked container we’re getting timeouts. Is this related or perhaps yet another incident?

(Bencollier) #85

Any word from Cloudflare why this wasn’t mitigate? They’re usually really good…

We felt a little helpless during the outage and that no-one at Docker was aware of the issue; what steps are being put in place to improve communications and detection?

Glad to hear steps are being made to decouple the internal services from Docker Cloud.


(Mwaaas) #86

Still getting this issue, in links

(Czechjiri) #87

I have suddenly issues pinging hosts (internally between as well as in the stacks)… its up and down like a yoyo… am I the only one?

(Fernando Mayo) #88

There are no known issues about the DNS service at the moment. Could you share some logs? Perhaps it’s an issue with the overlay network.

(Czechjiri) #89

this happened Friday afternoon around 3pm Pacific time for maybe 15-20min…(it was working before and its working now). at first I was not able to resolve DNS between 2 services in one stack, then DNS started working and ping could not reach IP. This was on existing AWS containers, tried to recreate new service or even re-push new stack with the same code, was getting same result. I don’t use links.

Its possible (and now when I am thinking about it, very probable) it was overlay network. Do you have any canned set of tests users can run which can help narrowing the issues and sharing the logs? Might be good starting point.

(Andrelackmann) #90

@borja we had a total outage across all our nodes. Docker Cloud GUI showed all containers as stopped and was unable to redeploy or start any services. We were eventually able to determine the Docker Agent service had crashed in some way and had entered a race condition (?) effectively using 100% of 1 CPU on each node.

Once we determine this, restarting the node(s) cleared the issue. We use a BYON configuration, so it’s slightly non-standard, but it seems the DNS issue triggered a crash of some sort in the agent as a secondary effect.

If you need more information to debug further, happy to help. DNS Down? (September)