Hi guys,
I’m currently building a infrastructure based on Docker and Swarm on top of AWS, and I wonder how to manage efficiently resources and performances in a cluster.
Docker introduced new orchestration methods, adopted by Swarm, Kubernetes, Rancher… They all have in common an automated container scheduling. It looks very promising, but does anybody can provide a feedback on large cluster and at scale ? I mean, for a production setup, living longer than a 15 minutes demonstration ?
Weakness I see in modern container lifecycle:
- In a web stack, resource consumption is not flat: day vs night, weekday vs weekend, December vs January (in e-commerce)… => Scheduling strategies don’t take into account that available CPU and RAM can change over the time.
- Scheduling 42 single threaded NodeJs containers on a 32-CPU host => context switching, swap, inefficient
- Hard to know which service should be manually scaled when a VM -hosting random containers- consume 100% of CPU.
- Service can run slower because co-hosted containers eat resources
- After a node failure, containers are re-scheduled on different VM, but it makes much more context switching, and can slow down a whole app. Why not keeping containers stopped until a new host is up, affecting only a few micro-service load and response time ?
- What about databases ? They often need dedicated resources and should not share CPU/RAM with other containers. What happens in an infrastructure average response time when a VM die and Swarm move the db onto an host already running random containers (even db) ?
- When an autoscaling group detect the load is going up, it starts a new instance. But how to reschedule some containers on it ?
Starting more VM than needed is the easiest work-around/bullshit, but it”s definitely a waste and a bad practice !!
Should we kill containers on a regular basis to rebalance containers on “available” VMs ?
Did anybody try to solve this kind of issues ?