I’d much rather a node and some/all services in containers on that node run a bit slower (or even a lot slower) and then rely on my own performance monitoring to report this and help me troubleshoot it, than have the entire node and all services become completely unreachable because a new container was provisioned that tripped memory usage over the edge, requiring that I manually login to AWS console and find/reboot the affected node or write or rely on some custom monitoring/scripts to do it for me.
To add some additional context, we are seeing this problem on a “staging” node cluster that we frequently deploy several and new staging sites for clients to while we evaluate Docker Cloud for production use.
With just Docker Cloud and AWS alone, it’s very difficult to even ascertain how much memory is being used on a node. AWS doesn’t report this, only CPU. Docker Cloud doesn’t report this, even though it has an agent running on the node that probably could report it.
New deploys are done by several people who won’t necessarily be aware of the memory requirements of all other stacks and services deployed to a given node, and redeploys happen automatically on post-commit (build/test/push/redeploy).
We also don’t even explicitly know which nodes will be running which containers. They are deployed and redeployed according to the deployment strategy listed in each stack file.
So, I think this is a really important issue for Docker Cloud to solve. It worries me greatly that production nodes could easily lock up when they are running close to their memory limits. I think Docker Cloud could:
Provision nodes with swap enabled – this would allow services to keep running, albeit more slowly.
Configure and run Docker Engine on each node with memory resource limits in place such that Docker Cloud Agent will continue to run, even if a rogue container or all containers on that node collectively are using all available memory – this would allow users to more easily recover from within the Docker Cloud web UI or Docker Cloud API, without having to manually restart the node via AWS console.
Docker Cloud Agent could report in the Docker Cloud web UI how much memory is being used by individual and all containers collectively for a given node – this could allow users to configure warning or scaling events based on resource utilisation, for example to scale up the node cluster and redeploy across the cluster according to the deployment strategy, when a node hits 80% memory utilisation.
It seems like Docker Cloud is attempting to abstract away the management of underlying infrastructure. Docker Cloud should therefore provide enough baseline monitoring and health checks on the underlying infrastructure so that it can automatically respond to failures, instead of requiring every user to implement their own custom infrastructure management on the side, which would likely tightly couple them to an infrastructure provider and make it more difficult to say switch from AWS to Digital Ocean, or run services across multiple infrastructure providers.