Resource management with Swarm in production

Hi guys,

I’m currently building a infrastructure based on Docker and Swarm on top of AWS, and I wonder how to manage efficiently resources and performances in a cluster.

Docker introduced new orchestration methods, adopted by Swarm, Kubernetes, Rancher… They all have in common an automated container scheduling. It looks very promising, but does anybody can provide a feedback on large cluster and at scale ? I mean, for a production setup, living longer than a 15 minutes demonstration ?

Weakness I see in modern container lifecycle:

  • In a web stack, resource consumption is not flat: day vs night, weekday vs weekend, December vs January (in e-commerce)… => Scheduling strategies don’t take into account that available CPU and RAM can change over the time.
  • Scheduling 42 single threaded NodeJs containers on a 32-CPU host => context switching, swap, inefficient
  • Hard to know which service should be manually scaled when a VM -hosting random containers- consume 100% of CPU.
  • Service can run slower because co-hosted containers eat resources
  • After a node failure, containers are re-scheduled on different VM, but it makes much more context switching, and can slow down a whole app. Why not keeping containers stopped until a new host is up, affecting only a few micro-service load and response time ?
  • What about databases ? They often need dedicated resources and should not share CPU/RAM with other containers. What happens in an infrastructure average response time when a VM die and Swarm move the db onto an host already running random containers (even db) ?
  • When an autoscaling group detect the load is going up, it starts a new instance. But how to reschedule some containers on it ?

Starting more VM than needed is the easiest work-around/bullshit, but it”s definitely a waste and a bad practice !!
Should we kill containers on a regular basis to rebalance containers on “available” VMs ?

Did anybody try to solve this kind of issues ?

Have not tested all the orchestration options but with Kubernetes CPU requests and limits may be specified and autoscaling set for varying loads. Some references:

I didn’t try yet, but randomness should be addressed using labels on nodes and services.
I think you can label group of nodes and create services using those labels to restrict which services run on which nodes

How can I auto-migrate / rebalance containers of already running service to a newly added worker node?
Do I really need to kill the running containers?!

@dvohra K8s seems to offer more options for optimisations :clap:

@robymes Yes, but it does not solve most of problems I mentioned :frowning:

sl4dy Same problem here. Docker can pause containers. But Swarm is still not able to hot-move them like we used to do with virtual machines. Need confirmation on that.