Swarm instability

OS: CentOS
Docker CE version - usually latest or close to latest, currently 18.09.3

I’ve been dreading moving to K8s as it’s a big project and will take time we don’t have right now. Overall, docker swarm works well for me for the most part.

One issue - we have 6x workers and 3x managers - and it seems that over time things F out. Worker nodes would be stuck with 100M memory (all nodes 4x16) and be borderline inaccessible, eventually show as “down”.

The nodes have nothing running besides some corporate monitoring software and Docker CE (private / internal VM’s - not public cloud).

Only way to restore stuff is to restart the node - is it docker swarm managing memory badly stopping and starting containers all the time? It’s a relatively high-load environment with ~13 dotnet core apps running in the cluster, lots of traffic, current lots of stop/starts as new versions are continuously deployed.

I’ve read stuff about bad OOM management, etc etc.

What is the recommended setup here or am I doing something wrong?

Did you setup ressource limits and reservations? If not, what prevents each of the container you operate to consume the whole ram (which might result in all containers fighting for their share)?

You definitly want to control ressource available to the container:

1 Like

Haai -

Yes, for each of the .net core apps I have a memory minimum/reservation of 1024MB or less depending on the expected demand - and a limit of 4096MB.

My understanding is that it then won’t schedule a container for the app on a node that doesn’t at least have 1024MB free on it, and it won’t consume more than 4096MB per container.

In reality these apps use MUCH less ram than these limits.

The issue is just that it seems to deteriorate over time and the node is stuck with 100mb free and doesn’t schedule anything, or falls over?

hmm, that’s odd. The swarm scheduler should make sure that deployment constraints are met from a potential target node before a task is scheduled to it.

You could use a system monitor to monitor metrics of your nodes and containers, like prometheus/grafana (https://github.com/stefanprodan/swarmprom). It helps a lot to actualy see what your containers realy do.

I would put my money on something that puts load on the system. Last time we had such a problem we had fucctions that used an outer transaction and an inner transaction on database operations, where the inner transaction took so long, that the connection pool exhausted, which then stopped the outer transaction to pick up it’s work and finish. Making the transaction pool bigger made it even worse. Once we fixed the cause, the overall load dropped from insane 20 - 70 to reasonable values between close to 0 and 4.

A huge load could prevent the applications inside your containers from doing their housekeeping tasks that would normaly result in freeing up memory.

Thanks - yes I have used swarmprom, I stopped it as thhe cluster had less resources before (all 1x4, now 4x16)

That said, my issue isn’t about container memoery per say - more about the actual host becoming unresponsive after a while when it’s just running docker… if that makes sense

Did you check the system load with top/htop?

Yeah, that’s the problem … I I am referring to system load, free memory, etc. It’s low AF on a 4x16 server with nothing else running besides a monitoring agent, puppet agent and docker swarm.