Thank you for maintaining such a great forum. I consistently find answers to my questions here.
I’m using Docker Swarm to provide my service across different regions. For the sake of simplicity, let’s consider a scenario with two nodes: a manager node in the EU and a worker node in the US.
Suppose there is a one-replica service running on the US node. One container that serves long-lived TCP streams for users.
The US node loses connection with the manager (network issues between the US and EU) but continues to operate in the US normally and serve local users.
The manager in EU marks the US node as “down”.
The manager attempts to reschedule this service on another node but can’t find a suitable node, because of a regional constraint region=us.
At this point, on the manager:
The “us-service” shows replicas “0/1”
The “us-service” target state is Running, but the current state is Down
However, on the worker node, the service continue to operate and serve users.
After the connection between nodes is restored:
Expected behaviour: The manager recognises that the service is already running on the US node. It reconciles the state and marks the service replication status to “1/1” and continues normal operation.
Actual behaviour: The manager launches a new instance (since it was marked as Down). Then it sees, that two instances are running, replication status is “2/1”, so it shuts down the old one to keep it “1/1”.
For me, this is the problem, because my service should offer reliable long-lived TCP stream for my clients. Every such restart is a downtime and mass reconnect for all the clients in the region.
I know that swarm clusters are not an optimal choice for flaky, high latency networks.
I really love the low operational complexity, ease of overlay networking and secrets management.
But the scheduling gives me some troubles with multi-regional deploys.
Is there any way of improving the situation here? Any configuration tweaks or reorganizing the stacks? Maybe an alternative to swarm mode even?
I am afraid you already know the answer, as you already summed up all there is to know. You could try if Kubernetes or Nomad behave differently when it commes which replica is deleted after a reconciliation. Though, both of them use the raft consensus algorithm as well, and will suffer from high latency network like the swarm mode does.
Let’s take raft out of the equation for a minute, as your situation could happen in a low latency network as well. After recovery from a split brain scenario, the reconciliation results in the oldest replica being deleted first. You can open a feature request in the https://github.com/moby/swarmkit repository and illustrate why a feature is required that allows to configure the reconciliation to remove the newest first.
I just thought about the issue, and realized that there might be solutions to this:
you could deploy the service in global mode, which would deploy a service task per node (that satisfies the placement constraint). It would not create a new service task in a split-brain scenario, and should detect that a service task is already running on the node after recovery from the split-brain scenario.
if global mode is not an option, you could leverage node labels and placement constraints to pin the deployment of the service to specific nodes. This would allow pinning the services to nodes in a region, so that in a split-brain scenario the service task can’t be scheduled in another region, as the placement constraint is not satisfied. Though, if more than a node is labeled, and the node that still runs the service task recovers from the split-brain scenario after another node with the same node labels, you still might end up with a new service task being deployed, and the old being removed after the node becomes available.
When I read about long-lived TCP connections, is it safe to assume that you use endpoint_mode: dnsrr for the service? endpoint_mode: vip will close unused connections after 900 seconds (at least that’s the treshold I remember).
Well, the point is to have a service with multiple replicas to handle the load. Otherwise it doesn’t makes sense for us.
Yes, thats exactly how it is now.
Suppose there is one worker node with region=US label.
We deploy N replica us-service with a region=US constraint.
After a “split-brain” manager marks US node as Down and creates Pending tasks for us-service.
It fails the scheduling. There are no US nodes left from the manager’s point of view. Tasks remain Pending
After connectivity is restored, “Pending” tasks are executed, starting N new replicas, despite the fact, that there are already N replicas present on the worker node.
So yeah, your assumption is right here — it does reschedule still.
Alright, for those, who may come struggle with the same problem in the future:
Lost After is useful for edge deployments, or scenarios when operators want zero on-client downtime due to node connectivity issues.
I am sure k8s also has those, but we decided against both, since it is an overkill for us.
We have settled on a “mini swarm cluster per region” scheme.
So instead of one big cluster we have 5 — one per each region.
This loses the benefit of an overlay network, and adds some automation overhead, but generally suits us ok.
Still able to scale horizontally inside one region, have service discovery and rolling updates.
Another thing we considered, is raising the --dispatcher-heartbeat from 5s to minutes. Not the best practice, but it would definitely help with transient failures, but still, won’t help with restarts completely. So we decided against it.