Service update does not resolve conflicts on generic resources

therealbazhen · July 23, 2024, 1:35pm

Hello!

I have a swarm cluster that uses generic resources (GPUs). All nodes have several slots, to simplify let it be two nodes with two slots each. One of services consumes much resources, so I use --replicas-max-per-node=1 for it (to be more precise, I use serviceSpec.TaskTemplate.Placement.MaxReplicas field of golang API, but I don`t think that really matters). I also use start-first order for update to minimise the downtime (it sometimes takes a long time to pull images). Today I faced following situation while updating that service (i.e. service_a):

node_1
a. Slot 1 - service_a:previous_version
b. Slot 2 - empty
node_2
a. Slot 3 - service_b
b. Slot 4 - service_c

Update was stalled with reason “no suitable node”. I solved it with docker service update --force service_b, that made Slot 3 on node_2 empty. Is there a way to resolve such conflicts automatically? Or writing some automation myself is the only way?

bluepuma77 · July 23, 2024, 2:47pm

Interesting, I thought devices like GPU are not supported in Docker Swarm (issue).

When we update Traefik, which uses the “scarce” resource port 80+443, we usually use stop-first, to not get into a resource conflict.

therealbazhen · July 23, 2024, 3:15pm

You are right - there is no official support of GPUs. However, using nvidia container toolkit + generic resources helps to solve that problem. When I have researched, I haven`t seen a complete up-to-date guide, but I can provide you with mine if you need.

Using stop-first can solve the conflict here, but in my case it causes situations, when there is no running instance of service (when we update model weights, it causes update to large docker layers, up to 1.5-2 GiB). And the situation in original post does not have conflict actually - there is enough resources for all running instances and one deploying instance, the conflict is caused by greedy resource reservation algorithm. In this situation conflict can be easily discovered and resolved, but I am afraid that if conflict appears after automatic redeployment (i.e. node outage), it would be quiet challenging even to discover that conflict.

Topic		Replies	Views
How to make docker swarm more robust by allowing random generic-resource-matching? Swarm swarm	3	459	March 27, 2023
Rolling updates with update_config:order: start-first \| stop-first General swarm	0	3290	February 28, 2019
Swarm - attach to network and run with gpu General swarm	4	4059	January 28, 2023
Does Docker support Generic Resource/Third Party Resource (like GPU, FPGA, etc)currently? General docker	0	895	February 20, 2019
Unable to create service on docker swarm managers, only on workers Swarm	16	3960	May 19, 2023

Service update does not resolve conflicts on generic resources

Related topics