First of all excuse my English, it’s not my mother language.
We’re having weird errors and behaviour on some apps we’ve dockerized and we think we’ve pinpointed the origin, but we don’t know how or if we can change this behaviour.
The problem happens when a worker node loses connectivity with the managers.
We have some services that are constrained to only run on one particular node. Also they’re created with --network host
.
When connectivity with the managers is lost in this node (basically the internet sometimes goes down for a short period of time) the containers keep running on this node, but the node appears as down on the managers. Then, when connectivity is restored, docker “kills” the running containers and runs them again.
The problems we’re having seem to stem from the fact that it looks like that docker is deploying the services before killing the old ones. This leads to sockets still in use, the newly deployed services connecting to the old ones shortly before they die, etc.
To test this, I’ve used iptables to disallow communication with the managers so the node is down, and then I’ve allowed it again.
You can see how the “old” containers are still running when the new ones are deployed (lines were too long, I’ve deleted most of the output):
jue ago 22 08:56:04 CEST 2019
CREATED STATUS PORTS NAMES
About a minute ago Up About a minute 8080/tcp adg-dev.1. ...
2 minutes ago Up 2 minutes lb. ...
2 minutes ago Up 2 minutes net-doc. ...
2 minutes ago Up 2 minutes 8080/tcp adg-dev.2. ...
2 minutes ago Up 2 minutes websock-dev.1. ...
2 minutes ago Up 2 minutes cp-dev.1. ...
jue ago 22 08:56:05 CEST 2019
CREATED STATUS PORTS NAMES
3 seconds ago Up Less than a second net-doc. ...
3 seconds ago Up Less than a second lb. ...
3 seconds ago Up Less than a second 8080/tcp adg-dev.2. ...
3 seconds ago Up Less than a second cp-dev.1. ...
3 seconds ago Up Less than a second 8080/tcp adg-dev.1. ...
3 seconds ago Up Less than a second websock-dev.1. ...
About a minute ago Up About a minute 8080/tcp adg-dev.1. ...
2 minutes ago Up 2 minutes lb. ...
2 minutes ago Up 2 minutes net-doc. ...
2 minutes ago Up 2 minutes 8080/tcp adg-dev.2. ...
2 minutes ago Up 2 minutes websock-dev.1. ...
2 minutes ago Up 2 minutes cp-dev.1. ...
And they are gradually killed until some seconds later they’re all gone:
jue ago 22 08:56:16 CEST 2019
CREATED STATUS PORTS NAMES
13 seconds ago Up 11 seconds net-doc. ...
13 seconds ago Up 11 seconds lb. ...
13 seconds ago Up 10 seconds 8080/tcp adg-dev.2. ...
13 seconds ago Up 11 seconds cp-dev.1. ...
13 seconds ago Up 10 seconds 8080/tcp adg-dev.1. ...
13 seconds ago Up 11 seconds websock-dev.1. ...
2 minutes ago Up 2 minutes lb. ...
2 minutes ago Up 2 minutes net-doc. ...
jue ago 22 08:56:17 CEST 2019
CREATED STATUS PORTS NAMES
14 seconds ago Up 12 seconds net-doc. ...
14 seconds ago Up 12 seconds lb. ...
14 seconds ago Up 11 seconds 8080/tcp adg-dev.2. ...
14 seconds ago Up 12 seconds cp-dev.1. ...
14 seconds ago Up 12 seconds 8080/tcp adg-dev.1. ...
14 seconds ago Up 12 seconds websock-dev.1. ...
We’d like to know if it’s possible to either:
· When the node becomes ready from being down, if it was already running the services that we were going to deploy, prevent docker from killing and redeploying them and just keep them running.
· If that’s not possible, stopping the old containers before running the new ones.
As I’ve said, we’ve been unable to find how to do either. The services are already created with
--update-order=stop first
, --rollback-order=stop-first
and update and rollback parallelism of 1. Also, we’ve seen that there’s a --stop-grace-period
option that defaults to 10s but we don’t want to set it to zero because some tasks require some cleanup before they’re shut down and need some time.
All nodes and managers are running docker 19.03.1.
Thank you!