Resource management with Swarm in production

samber · July 29, 2016, 3:51pm

Hi guys,

I’m currently building a infrastructure based on Docker and Swarm on top of AWS, and I wonder how to manage efficiently resources and performances in a cluster.

Docker introduced new orchestration methods, adopted by Swarm, Kubernetes, Rancher… They all have in common an automated container scheduling. It looks very promising, but does anybody can provide a feedback on large cluster and at scale ? I mean, for a production setup, living longer than a 15 minutes demonstration ?

Weakness I see in modern container lifecycle:

In a web stack, resource consumption is not flat: day vs night, weekday vs weekend, December vs January (in e-commerce)… => Scheduling strategies don’t take into account that available CPU and RAM can change over the time.

Scheduling 42 single threaded NodeJs containers on a 32-CPU host => context switching, swap, inefficient
Hard to know which service should be manually scaled when a VM -hosting random containers- consume 100% of CPU.
Service can run slower because co-hosted containers eat resources
After a node failure, containers are re-scheduled on different VM, but it makes much more context switching, and can slow down a whole app. Why not keeping containers stopped until a new host is up, affecting only a few micro-service load and response time ?
What about databases ? They often need dedicated resources and should not share CPU/RAM with other containers. What happens in an infrastructure average response time when a VM die and Swarm move the db onto an host already running random containers (even db) ?
When an autoscaling group detect the load is going up, it starts a new instance. But how to reschedule some containers on it ?

Starting more VM than needed is the easiest work-around/bullshit, but it”s definitely a waste and a bad practice !!
Should we kill containers on a regular basis to rebalance containers on “available” VMs ?

Did anybody try to solve this kind of issues ?

dvohra · August 1, 2016, 4:02pm

Have not tested all the orchestration options but with Kubernetes CPU requests and limits may be specified and autoscaling set for varying loads. Some references:

robymes · August 1, 2016, 8:00pm

I didn’t try yet, but randomness should be addressed using labels on nodes and services.
I think you can label group of nodes and create services using those labels to restrict which services run on which nodes

sl4dy · August 2, 2016, 12:51pm

How can I auto-migrate / rebalance containers of already running service to a newly added worker node?
Do I really need to kill the running containers?!

samber · August 3, 2016, 8:05am

@dvohra K8s seems to offer more options for optimisations

@robymes Yes, but it does not solve most of problems I mentioned

sl4dy Same problem here. Docker can pause containers. But Swarm is still not able to hot-move them like we used to do with virtual machines. Need confirmation on that.

Topic		Replies	Views
Service Scheduling on Swarm Swarm	2	1528	July 13, 2016
Swarm instability Swarm	6	2219	March 30, 2019
Docker in production General docker	5	1801	February 24, 2017
Docker Swarm Series: #1st Setup the Environment General swarm , tutorial , tips	0	795	June 21, 2023
Any active swarm communities? Swarm	3	978	July 27, 2020

Resource management with Swarm in production

Related topics