Docker swarm overloads Manager but not nodes?

Hi

I have 1 Swarm Manager Node (Intel XEON 6 core 16GB RAM) and 3 x Swarm Nodes (Intel Xeon’s 12 core, 32GB RAM).

I’m trying to spin up about 200-600 containers in the swarm that runs a small python script. Each container is 1GB in size.

My problem is the Manager doesn’t seem to distribute load evenly. It takes on almost 90% of the work on itself as a node, then pushes a little bit to the other 3 nodes. The result is the Manager is running all cores at 100% CPU and I can’t even type docker node or service commands on the command line to check status - its practically dead. While the 3 worker nodes idle with maybe 10 containers each running at the same time.

Am I doing something wrong? I can’t find in the docs how to give weighting to the nodes so that Manager always only gets 30% of load, and the other 3 worker nodes get the remaining 70%.

Thx.

Please share the compose file used to deploy your stack, so we can get an idea of what you actually do.

Swarm service should be placed round-robin on nodes that fulfill resource and deployment constraints.
You could use a placement constraint to only deploy service to nodes with specific attributes or node tags.

E.g. use a deployment constraint to only deploy payload to the worker nodes:

version: "3.8"
services:
  myservice:
    ...
    deploy:
      placement:
        constraints:
          - "node.role==worker"
   ... 

See https://docs.docker.com/engine/reference/commandline/service_create/#constraint for available constraints.

Um, im not using a compose file?

I pulled an image on the master node and i’m running:

docker service create --replicas 600 --update-parallelism 5 --with-registry-auth --name mytestscript registry.blahblah.com/myregistry/myimage

It then spins up but at around 200 it dies because the Manager host is now at 100% CPU on all cores because 150 are running on itself and only 50 got sent to the other nodes.

BUT, I did notice something now. I left it to run for as few hours and for some reason i don’t understand, after about 2 hours it re-distributed all the load and has now equally put it all accross the workers. Its weird it doesn’t do it upfront but much later it does it.

We run only small loads on Swarm managers, all heavy load is explicitly limited to workers.

Docker Swarm likes to have 3+ manager nodes for redundancy.

You can set CPU and memory constraints per service container, maybe try that.

Also check your network, any VLANs, VSwitches or VPN in between? Wrong MTU can be an issue, as requests >~1400 bytes fail.