When setting a node to drain, how to get swarm to actually wait for the original container to stop before starting the new

X-Post from docker subreddit

I have some containers which can have long running jobs that take longer than 60s to complete, and can interfere if another instance of the container starts up on another node. If I set the stop_grace_period:, the original container will stay up until either the job finishes or it reaches the grace period, but no matter what setting I try, the new container gets started after 60s. I thought it might respect one of the delay: options under deploy, but nothing I put there seemed to change that timing, and nothing else in the docs seems to give anything the suggests a way to do this.

I could keep a lock file (like a .pid file) and remove it after the job is finished, but that just kicks the can down the road. If the container gets killed or host fails before the job is finished the new container could be waiting indefinitely (and this seems kludgy to begin with, since unlike processes that run in the same environment, I can’t check if the pidfile still points to the ‘right’ process.)

My google-fu is failing me here, or is this just something not supported by swarm?

I must be missing something, as I am not quite sure if I understood your expected outcome.

if stop_grace_periond: 60 is used, and the process inside the container does not act on SIGTERM, it will take up to 60 seconds until the process receives a SIGKILL. As a result the service should be killed, which ends the evicted service task, the scheduler should detect a drift between current state and desired state and schedule the start of a new service task to remedy the drift.

Regardless, have you tried to tweak the restart policy? Depending on how you look at it, a node drain will stop the service task in a controlled way, so that it might be considered as a restart. I doubt that any setting underneath update_config will influence the behavior, as it only applies when the configuration of a service task gets updated, which is not the case on a node drain.

I must be missing something, as I am not quite sure if I understood your expected outcome.

  • Task is running on node1.
    • stop_grace_period is 15m
    • all of the settings under deploy (update_config, rollback_config, etc.) are set to order: stop-first
  • docker node update node1 --availability drain is run
  • task running on node1 correctly gets SIGTERM and begins shutting down gracefully.
  • a new task is prepared on node<n> and is in “Ready State”
  • 60s after it is in Ready State, the task on node<n> starts regardless of the fact that the task on node1 is still shutting down, but has not finished, and has not exceeded the stop_grace_period yet. This time never changes, it’s always 60s after the new task has finished preparation and is in ready state, no matter that the original task has not finished it’s stop yet, and no matter what values I specify under deploy:<update_config|rollback_config|restart_policy>:delay.

My expectation is that the new task should actually wait for the first task to stop before starting, when told to stop-first

Regardless, have you tried to tweak the restart policy?

Yes, this is what I meant by “I thought it might respect one of the delay: options under deploy, but nothing I put there seemed to change that timing, and nothing else in the docs seems to give anything the suggests a way to do this.”

Thank you for filling in the gaps :slight_smile:

It makes sense that neither update_config nor rollback_config applies. As a node drain does not update the service configuration.

Apparently it doesn’t seem to be considered a restart, like I thought, as It already schedules the new service task, even before the old service task is exited.

I doubt there is any configuration in the compose file specs that supports your use case.

You could raise a feature request in the Github Issues of the swarmkit project: https://github.com/moby/swarmkit/issues

As a workaround, you could follow the lock file approach in your entrypoint script you mentioned:

  • check if the file exists
  • if it does: wait in your entrypoint script until the file is older than x minutes
  • then carry on with the task
  • and cleanup the lock file before the container is existed.