Mesh routing and service lifecycles

midnightw · August 26, 2020, 10:48am

I am running docker stacks on a swarm cluster.

What I expect when I update service is the following sequence for each container needs updating.

Route new traffic to other functioning containers, no new traffic will be routed to the stopping container.
Waiting X seconds before sending signals to stop the container.
Start a new container with the updated images.
Wait for another Y seconds to make sure the new container is stable.
Start routing new traffic to the updated (new) container.

The documents I found do not quite clear about how mesh routing and service lifecycle interacts.
My guess is that Y goes to:

deploy:
  update_config:
    monitor: Y

Is this correct?
Or the value there does not affect mesh routing in any way, only affecting the container lifecycle?
Also, I can’t seem to find where to specify X.
Could you tell me where?
Or if I am missing something, could you tell me how does mesh routing actually works in this situation? Or, where can I find the related info?

meyay · September 19, 2020, 12:12pm

Though, you did read the docker compose reference, didn’t you ?

I feel deploy.update_config.order: start-first is what you need. Though is not what you want. Your processing sequence is differnt. Instead It will first start the new container and unregister the old container with an overlap. What you want also requires that your image or swarm deployment declares a heath check that only returns success if the service actualy is responding (instead of a simple “the port can be reached” type of check). deploy.update_config.monitor is used to declare the delay used before the first positive healthcheck result is expexted on the new container, in order to start the action your defined in failure_action if it failed.

midnightw · September 19, 2020, 1:53pm

Hello
Thanks a lot @meyay .

The mechanism is different from my expectation, so I don’t find the solution from where I stand, at the time, easily.
In docker, there are multiple ways to do things; I was a bit confusing at first. (most tutorials don’t use docker stacks)

I did read the reference, and I tried many things until I found out about the health check thing. What can be improved IMO is to mention about health check in the routing mesh page.

I implemented the health check with netcat (nc -vz) and quite happy about the result for some time.
However, only half of my issues are solved.
The issue related to starting up seems to be solved but shutting down.

In some environments (like nodejs application), it is hard to handle the shutting down signal.

Specific to nodejs, they provide process.on to trap signals. But in practice, trapping the signal and wait for every pending request to finish is quite a challenge. The problem lies with the waiting step, not the signal intercepting.

From the docker perspective, I am totally agreed that this is an application’s fault, but it is a quirk I don’t know a way around (yet).

Does docker provide some mechanism to delay between accepting the last TCP request and sending SIGTERM. Or rather, how do you ensure every pending request is finished before the container stop?

meyay · September 19, 2020, 7:13pm

You should make it a habit ^^ Back in the days while I was working with swarm stacks daylie, most times this was my entrypoint when I had to lookup details.

Not realy. Though, you can determine which signal to send to terminate the container and you can define the stop grace period, which defines the delay between the terminate signal and a SIGKILL if the container didn’t stop before the grace period passed.

Other than that: tracking active requests is not a container problem - it exists regardless your nodejs app is run in a container or native on the host.

Topic		Replies	Views
Docker service restart policy - stop old service when new service is really ready General docker , swarm	0	953	June 4, 2019
Does the docker ingress routing mesh respect service health? Swarm docker , swarm	0	523	March 12, 2021
Warm up delay (for LB servicing requests) on rolling update in Swarm mode Swarm	2	1237	January 6, 2020
Replicated container on swarmode for zero downtime with stop_grace_period Docker Hub docker , swarm	3	1254	July 11, 2022
Problem with premature state "running" on service Swarm swarm	2	939	May 8, 2017

Mesh routing and service lifecycles

Related topics