Docker swarm scheduling help

Hello,

I currently have a docker swarm with 1 manager and 4 workers (each with 4 cores, for a total of 16 cores) built using docker version 18.03.1. More info further below. I am trying to run many, short, one-time jobs on the swarm – that is, I create 100, 2-minute jobs, with 1 replicas and restart-condition set to none. Example of 1 such job (that just sleeps for 120 seconds for now) is shown below:

$ docker service create --name test_job0  --detach --restart-condition none --replicas 1 --reserve-cpu 1 --limit-cpu 1 --reserve-memory 1GB manager:5000/code-docker-image sleep 120

PROBLEM: However, the problem is that only 4 services run simultaneously, instead of the 16 available cores. In fact, I noticed that jobs are scheduled only on the last worker-node (hadoop4 state goes “active” but other worker state remains “draining”) in the swarm and never on the other workers. I would like to have 16 services run simultaneously.

I have tried several different settings, specifying constraints etc. but none of them worked out. The service’s --no-trunc output reports – “no suitable node ( nodes not available for new tasks; insufficient resources on 1 node)”

Any suggestions to get 16 services to run simultaneously is appreciated.


Output showing pending services not running due to “no suitable node” error:

ID                  NAME                IMAGE                                                      NODE                DESIRED STATE       CURRENT STATE                     ERROR                              PORTS
lwxadszrv344        test_job8.1         manager:5000/code-docker-image:latest                       Running             Pending 5 seconds ago             "no suitable node (4 nodes not…"   
n1erpzp05zb8        test_job7.1         manager:5000/code-docker-image:latest                       Running             Pending 5 seconds ago             "no suitable node (4 nodes not…"   
49hk53l67yxv        test_job6.1         manager:5000/code-docker-image:latest                       Running             Pending 5 seconds ago             "no suitable node (4 nodes not…"   
pyhtgvigu8dx        test_job5.1         manager:5000/code-docker-image:latest                       Running             Pending 6 seconds ago             "no suitable node (4 nodes not…"   
l54a5yaxkzej        test_job4.1         manager:5000/code-docker-image:latest                       Running             Pending 6 seconds ago             "no suitable node (4 nodes not…"   
idxhyxamf43l        test_job3.1         manager:5000/code-docker-image:latest   hadoop4             Running             Accepted less than a second ago                                      
0aeecexapcex        test_job2.1         manager:5000/code-docker-image:latest                       Running             Pending 7 seconds ago             "no suitable node (4 nodes not…"   
39a9fmm654n6        test_job1.1         manager:5000/code-docker-image:latest   hadoop4             Running             Assigned less than a second ago                                      
h9g259hbqwj1        test_job0.1         manager:5000/code-docker-image:latest   hadoop4             Running             Assigned less than a second ago                                      

Output showing list of nodes (while some of the jobs were running):

$ docker node ls
ID                            HOSTNAME                     STATUS              AVAILABILITY        MANAGER STATUS  
ENGINE VERSION
 n48mil4rvzj3k2eix38v3eph4     hadoop1                      Ready               Drain                                   18.03.1-ce
wqihfh77so5k1q6hej1lkneaa     hadoop2                      Ready               Drain                                   18.03.1-ce
ep8rl3yrwmv3pqlb96j75ol5z     hadoop3                      Ready               Drain                                   18.03.1-ce
gqlgvflky2li104d4st2i10jc     hadoop4                      Ready               Active                                  18.03.1-ce
znwmm4cmrz7a4b0yby11yncw1 *   manager                      Ready               Drain               Leader              18.03.1-ce

Any suggestions to get 16 services running on this swarm is appreciated.

Manually changing node availability from drain to active fixed the issue for me

$ docker node update --availability active hadoop3
1 Like

You might want to dig deeper in the topic: