I’m currently in the process of figuring out ES on top of docker swarm. I’ve been thinking about this for a while and thought I might as well run my current thinking by the community. So, apologies for what will be a longish post. The context is that we are using elasticsearch in a way that requires it to stay available 24x7. We want to be able to regularly deploy new versions.
All the examples I’ve seen on this topic are pretty much unsuitable for production usage and leave a lot of non trivial stuff as an exercise to the reader. One thing I’m struggling with particularly, which I think is actually a deal breaker for running ES well on any kind of autoscaling cloud environment, is doing rolling updates in a responsible way (without the es cluster going red).
The problem here is that (re)-starting an elasticsearch node and an elasticsearch cluster being fully available are two things. During a rolling restart, simply restarting nodes one by one will kill your cluster and make it unavailable and potentially may cause data loss as well. The last thing you want here is something like docker swarm blindly restarting nodes without checking anything in between. This is basically true for most clustered software btw.
The pattern with a rolling restart in ES is simple: 1) bring an instance down, wait for the cluster to stabilize (can take a long time), 2) restart the node 3) wait for the cluster to stabilize again 4) proceed to the next node. The problem is that waiting can take long, hours if you have lots of data. Also, any of these steps can fail in which case you may need to take manual action. If you get it wrong your es cluster goes red, which means data is unavailable, indexing is impossible, and you may lose data when you try to bring the same nodes back up or it may not recover at all without manual intervention. Red is a really bad state to be in for an ES cluster.
Ideally you go from a green state to a yellow state and then back to a green state as you shut down nodes. Yellow means all data is available but not fully replicated because some nodes are down. Shutting down more nodes when the cluster is yellow is a really bad idea. Es clusters try to rebalance when they go yellow and go green once that process completes. This takes time. You can turn rebalancing off and probably should during a rolling restart and turn it back on afterwards. Rebalancing also happens when new nodes get added. Finally when nodes restart they need to check the data they have before they can rejoin the cluster properly. All this takes a lot of time. So, orchestrating this requires polling the cluster for its current state.
So, in short, relying on docker swarm rolling restarts is basically a really bad idea for elasticsearch until there is a way to have some sanity checks in between node restarts. Simply waiting X minutes/hours/days is not good enough.
Luckily, not all elasticsearch nodes are equal you have master nodes, data nodes and query nodes (and with the upcoming version ingest nodes). Out of the box each node can fulfill all roles but you can choose to specialize elasticsearch nodes to be just e.g. a master or just a query node. The master nodes are similar to swarm manager nodes in the sense that they manage es cluster state. Just like with swarm you want to isolate these nodes on docker hosts that are isolated from the other nodes that do all the heavy lifting. The worst thing that can happen in ES is master nodes being unable to talk to each other due to heavy garbage collecting resulting from e.g. data or querying spikes.
So, my plan: use docker labels to mark three different instance types: esmaster 3x, esdata >=2x, esquery >=2x. I’m actually considering to put the es masters on the swarm masters since both are sort of light on memory/cpu requirements and can probably be co-hosted safely to save some cost. Esdata nodes will need a docker volume to store stuff and a bit of CPU but should be pretty OK on memory. Esquery is doing all of the query heavy lifting so it will need loads memory and cpu.
Then I can manage the master and query nodes as two docker services and do rolling restarts pretty easily for those since both are stateless. I’d probably space their restarts a minute or two apart for safety but restart action on those should not trigger rebalancing. We can use labels to ensure the containers end up on the right docker hosts and any non es containers can simply talk to the esquery service and rely on swarm loadbalancing.
However, for internal cluster communication this wont work. The problem is that each es container needs to know the host or ip of at least one other node in the cluster to discover each other. Typically what you do there is list the three (or five) master node ips explicitly on each node in the cluster as the cluster addresses; they sort of discover any other nodes once they connect to any one of the masters.
In swarm, we need the master nodes to know about each other explicitly. However, the other es nodes could just talk to the esmaster service and rely on swarm load balancing to find a esmaster to talk to. However, the master containers need a way of finding out what the other master nodes are in the swarm. Somebody will probably write a elasticsearch plugin for this at some point but for now I don’t see any other way than somehow hardcoding this when firing up the master docker service. This should be fairly easy to script using the docker API or a docker node command with a filter on the esmaster label.
I don’t want to use docker services for managing the esdata nodes though since every rolling restart of those will need to be coordinated and monitored (this is where all the cluster rebalancing will happen). So the easiest is to simply start elasticsearch containers with docker run one by one explicitly tell each container what node to run on. That way I can restart them one by one in a controlled way.
Finally I’d like to use docker networks to ensure I can control access. Elasticsearch uses two ports: one for internal cluster traffic and one for its REST API. The latter is what needs to be exposed to other containers in the swarm but only on esquery nodes. Additionally, there is no need for anything to talk to Elasticsearch from outside the swarm. So, I’m thinking that only certain nodes in the swarm would have outside connectivity and none of those should be running elasticsearch. Furthermore, I want to use to docker networks (escluster, esquery) where anything that needs access to the esquery service in the swarm needs to be in esquery and all es containers need to be in the escluster network.
I’d love some feedback on any aspects on this. I’m relatively new to swarm but we have been using docker for our stateless services for a while.