Docker Community Forums

Share and learn in the Docker community.

How can we automatically recover from "unreachable" nodes?

(Tai Lee) #1

This page says nodes can go unreachable if containers running on that node use too much memory. This seems like a major loophole and a serious issue for anyone wanting to use Docker Cloud for production deployments. If a rogue container has a memory leak and is allowed to use up all resources on the host node causing it to become unreachable, then ALL containers/services hosted on that node will go down, and Docker Cloud is unable to re-provision a new node and re-deploy services automatically.

Am I missing something? Are we required to divide up the amount of memory on the host node by the number of containers we run and manually specify resource limits for all containers in order to reserve enough memory for the Docker Cloud Agent to continue running even if a container goes out of control?

This would make it more difficult to add new containers to existing nodes (as the resource limits for all containers would need to be adjusted) or redeploy to a new node cluster with more memory available to each node.

(Michael Clifford) #2

I think it is more about proper monitoring and instrumentation on your behalf.

At anytime your underlying architecture can and will fail. Docker Cloud doesn’t (yet) have the ability to scale on its own behalf but most (if not all) of the underlying providers expose the necessary data so you can implement failover and scaling policies on your own (and Docker Cloud exposes the web hooks you would need).

I think it would be difficult for Docker Cloud to prevent over-provisioning, especially if a individual container went rogue and ran away with system resources. There are a number of reasons why Docker Cloud woud not be able to reach your nodes and in some cases it may not mean the node is unavailable. Systems like New Relic, Data Dog, and others should be involved so you are aware of it as it is happening or even before it happens.

I do think there are better ways Docker Cloud can assist with this but for now it is possible to protect yourself without the native ability.

(Tai Lee) #3

I don’t think AWS provides data on Docker Cloud Agent running on Docker Cloud provisioned nodes. If all Docker Cloud users on AWS have to implement their own monitoring of the Docker Cloud Agent, that is a major blow to the otherwise stellar end-to-end simplicity of deploying to Docker Cloud and not having to worry or care about the infrastructure.

The specific problem I have experienced is Docker Cloud nodes becoming unreachable for apparently no reason. The only explanation I have been able to find is that I must have accidentally provisioned one too many container services to the node, which apparently takes down Docker Cloud Agent and therefore makes all containers/services on those node invisible to Docker Cloud.

Couldn’t Docker Cloud Agent configure and start Docker Engine with a global resource limit that reserves a small allocation of CPU/memory (just enough for Docker Cloud Agent itself to continue running)?

Surely it’s of paramount importance that Docker Cloud is able to contact its agents in order to ascertain the status of containers running on that node and be able to implement the autoredeploy policy for those containers when they misbehave?

At the very minimum, Docker Cloud could be configured to terminate and re-deploy to a new node if an agent becomes unreachable. That’s a fairly inelegant and brute force approach, but it seems like that is what I have to do manually, now.

(Tai Lee) #4

For anyone else struggling with this, I found a script that runs on AWS lambda that terminates unreachable nodes and redeploys to a new reachable node:

There should be a configurable option (for each node cluster, perhaps) that causes Docker Cloud to do this for us.

(Michael Clifford) #5

I think what you are encountering are issues when all memory is being used on an instance and no swap is available (most EC2 instances aren’t configured with swap space). I experience that occasionally on one of my farms but am usually alerted to it. Generally what happens is I shutdown the instance via AWS since the instance won’t respond to anything other than a shutdown. At that point I have to scale in/out my service on Docker Cloud in order to ensure the correct number of containers are running. It ends up being a bit of a hassle. I agree that there is certainly room for improvement.

(Tai Lee) #6

That sounds exactly like the problem. Can Docker Cloud not create EC2 instances with swap enabled, to avoid this issue from happening? Or do we have to manually create our own EC2 instances to enable swap and install Docker Cloud Agent on them and use them as BYON?

The latter would be much less convenient, as we won’t be able to take advantage of Docker Cloud’s quick and easy node cluster scaling, and will have to manage the infrastructure ourselves (which we wanted to avoid by using Docker Cloud).

(Michael Clifford) #7

Creating nodes with swap will certainly help but that creates a new set of problems. High swap usage or large swap volumes can get costly and it still doesn’t solve the scenario where a service uses all memory and swap.

I’ve found the BYON option to be nice for testing something quick but as you’ve pointed out it removes a lot of the reasons why we want Docker Cloud in the first place.

(Tai Lee) #8

I’d much rather a node and some/all services in containers on that node run a bit slower (or even a lot slower) and then rely on my own performance monitoring to report this and help me troubleshoot it, than have the entire node and all services become completely unreachable because a new container was provisioned that tripped memory usage over the edge, requiring that I manually login to AWS console and find/reboot the affected node or write or rely on some custom monitoring/scripts to do it for me.

To add some additional context, we are seeing this problem on a “staging” node cluster that we frequently deploy several and new staging sites for clients to while we evaluate Docker Cloud for production use.

With just Docker Cloud and AWS alone, it’s very difficult to even ascertain how much memory is being used on a node. AWS doesn’t report this, only CPU. Docker Cloud doesn’t report this, even though it has an agent running on the node that probably could report it.

New deploys are done by several people who won’t necessarily be aware of the memory requirements of all other stacks and services deployed to a given node, and redeploys happen automatically on post-commit (build/test/push/redeploy).

We also don’t even explicitly know which nodes will be running which containers. They are deployed and redeployed according to the deployment strategy listed in each stack file.

So, I think this is a really important issue for Docker Cloud to solve. It worries me greatly that production nodes could easily lock up when they are running close to their memory limits. I think Docker Cloud could:

  1. Provision nodes with swap enabled – this would allow services to keep running, albeit more slowly.

  2. Configure and run Docker Engine on each node with memory resource limits in place such that Docker Cloud Agent will continue to run, even if a rogue container or all containers on that node collectively are using all available memory – this would allow users to more easily recover from within the Docker Cloud web UI or Docker Cloud API, without having to manually restart the node via AWS console.

  3. Docker Cloud Agent could report in the Docker Cloud web UI how much memory is being used by individual and all containers collectively for a given node – this could allow users to configure warning or scaling events based on resource utilisation, for example to scale up the node cluster and redeploy across the cluster according to the deployment strategy, when a node hits 80% memory utilisation.

It seems like Docker Cloud is attempting to abstract away the management of underlying infrastructure. Docker Cloud should therefore provide enough baseline monitoring and health checks on the underlying infrastructure so that it can automatically respond to failures, instead of requiring every user to implement their own custom infrastructure management on the side, which would likely tightly couple them to an infrastructure provider and make it more difficult to say switch from AWS to Digital Ocean, or run services across multiple infrastructure providers.