Docker Cloud Overlay Network issues

I’m about to give up on Docker Cloud. Network errors and poor performance between containers means our customers are getting errors. We’ve been coding around the problems, building in retries etc. which are good for our code long-term, but sometimes it’s just not enough.

It’s a shame, because I like the way Docker Cloud is configured. It works the right way. If the network worked, and DNS lookups worked, we’d be sticking around for years.

Has anyone moved off of Docker Cloud and has advice for a similar service with a) Docker and b) easy configuration?

3 Likes

@evanp same here. It’s sucks. We have 27 nodes between clusters and over 50 services and we are growing. It’s causing us way too many problems.

We are looking at alternatives right now. Rancher has been pretty good in our initial testing. Not as polished for some UI things, but a lot more poweful and configurable in others. The networking has been stable so far…

We are facing the same issues and i looked into every solution i could find. Rancher surely looks the most promising for now, but there are a few drawbacks for me that do not justify the amount of work that would be involved in migrating:

  • Services are not editable (except for a few basic things). I know there is an upgrade service function but that isn’t just as convenient and fast as e.g. adding an environment variable and clicking redeploy in docker-cloud.
  • The integrated load-balancer of rancher is garbage compared to dockercloud-haproxy. It doesn’t automatically update, it is by far not as feature-rich and i did not find a way to add a new service that wouldn’t lead to complete downtime of all services behind, which is not acceptable to me.
  • But by far the most critical drawback is that it is self-hosted. It is another point-of-failure that we would have to care about. If the rancher server fails, you are lost without ssh-ing and server-side tinkering, and setting up an HA rancher server is not easy, as you would also need to cluster a database behind it. Also, this is not what we want to care about. We are software-developers and want to care as little as possible about the operations part.

I also looked into docker 1.12 swarm-mode, but it isn’t there yet. The API isn’t complete yet ( https://github.com/docker/docker/issues/25303 ), but i hope for 1.13 to get it more production ready and not too long for dockercloud to make the jump (as mentioned here Passing DNS search setting to containers).

Until then, if you are not dependent on the service dns round-robin and you do not recreate your services often, try to switch to the container-ips (10.7.0.0/16) directly or add them manually into the hosts file of the containers by using the extra_hosts parameter in your stack file, as it seems the root of all evil is the DNS and not the internal network (at least in my tests).

Same problem here. Poor performance and stability of the overlay network makes us to move to another orchestration solution.
Rancher so far the best one, the overlay network is quite stable and very fast. We’ve encountered minor problems with interface binding - workaround is to sleep for several seconds before binding the interface. The same however applues to ethwe on Docker Cloud.

Rancher is quite an important single point of failure. We’ve tested out the high availabilty cluster, however it seemed pretty unstable and introduced more issues with availabilty than the single node deployment.

Rancher works very well with community and responds to issues on github. For us, it is clearly a way to go. Docker cloud has been very foggy about when we can expect changes in networking and what changes can we expect.

If you are looking for full managed docker hosting try https://sloppy.io It’s for users which don’t want to care about vm’s, network, storage etc. but just want to deploy their containers.

Hello everyone, Borja from Docker here. First of all, I want to apologize for any issues you may be experiencing, and I’d like to invite all of you having such problems to please reach out to our customer support for help troubleshooting these situations.

As you may all know an overlay network isn’t a trivial piece of technology. When it comes to networking, it’s always been our goal to give our users the simplest and most performant, out of the box experience that works for all your networking needs.

The reality is that since we introduced 1.11 to nodes we’ve seen an increase in DNS lookup latency. Having said that, networking performance has not been affected. Weave running in user-space makes the overlay network performance to be dependant on CPU resources, as such it may be slow at times, or possibly break if there aren’t enough CPU resources available in the host. Today, Docker Cloud runs weave in user-space mode on any insecure private network, namely anywhere but AWS.

With the introduction of Docker 1.12 the networking stack in Docker is now able to satisfy the requirements of Docker Cloud. We have began work to switch to Docker 1.12 embedded DNS and swarm overlay network driver and rely on its performance and robustness for our user’s applications and networking.

We plan to provide more insight about this in a future post. Please continue to raise issues as you face them, and work with our support team to troubleshoot them. Your concerns are not going unheard and the team is working hard to provide you with the simplest and most performant way to run Docker in the cloud. Thank you!

1 Like