I want to run RabbitMQ on Docker Swarm
It requires that each replica starts on the same Docker node due to the fact it persists hostname in the database.
Each Docker node has local volume only.
In essence I want to pin .Task.Slot to particular Docker host.
How to do that?
My current example stack is below
It does not work because .Task.Slot cannot be resolved for placement
There are a couple of ways to pin a container to node.
use global instead of replica deployments, BUT global deployments don’t have {{.Task.Slot}}. So it will not work in your case. You could combine it with a placement constraint based on node tag, to make sure the global deployment will only be deployed to specific nodes.
create a separate service for each instance, pinning it to a node using a node tag for placement
if the service supports to override the config and data directory per environment setting, use a volumed backed by nfsv4 and configure the config and data folders to be separate subfolders per .Task.Slot on the same volume.
Note: a volume is immutable. Once a volume is created (e.g. with a .Task.Slot in it’s name), it will never be updated again, unless it’s manually removed and re-created. Also the definition of a volume is local to a node - if a container using it was run on multiple nodes, it needs to be removed on multiple nodes.
This is why putting the .Task.Slot in the external name might look like a good idea, but actually does not really help. On a second though: it might help, if the volumes are backed by nfsv4 - you would end with having up to all three named volumes on a node, but only one of them would be mounted to the container with the matching .Task.Id. In combination with max_replicas_per_node, it might be a feasible solution as well if used combined with node tags as placement constraint to restrict the nodes it can be placed on.
this is exactly what I just did. It requires setting up 3 services, creates unwanted duplication in the yml file.
I think this requires mapping ports differently. Rabbit uses 5672 and 15672 so to avoid conflict I did as below.
Is that what you were suggesting?
I’m using custom label nodeNum, also specified replica size 1 on each service
By including .Task.Slot in the name my intention was to workaround the limitation but the .Task.Slot cannot be resolved.
I haven’t explored 3) yet, but it would solve the issue
It would actually, but you would have no way to assigned fix hostnames based on .Task.Slot placeholder. The identity of each task would be identical, but the data stored in the local volume differ.
Due to the indistinguishable hostnames, I considered this as non working in your case.
Just a through: It might be possible to pull this one of, if the hostname uses the place holder for the node host name and publishes the port as mode: host (see further below). Then you could rely on volume data to be stored locally.
No need to map the port differently, make sure to use the port publish long syntax with mode: host, to bypass the ingress proxy. As a result the host port will directly bound to the container on the node.
Number 3 requires the volumes to be backed by nfsv4 (which must be reachable from all nodes a replica will potentially deploy to). I know that it seems counterintuitive as rabbit already replicates its data under the hood, so ideal you really just need the container’s volume data to be local. It is still a convenient and flexible solution.
This is invaluable information and allows me to move forward.
I implemented mode:host already and it works as expected.
The good part of it is that application side does not have to concern itself with different ports.
I agree 3) has many advantages, one of which is flexible move of replicas.
Thanks for the suggestion with environment variables.
RabbitMQ has very good support for env variables configuration.
I will give it another go and try to reduce duplication in yml now
Please keep us posted about the approach you finaly implemented, so others that stumble accross this post don’t just find our discussion, but also find your configuration as an example for their own implementation.
If you consider to use approach number 2, you could leverage extension fields to de-duplicate repeating configuration elements. If you are currious, you can get a deep dive in this blog post.
Good to know about extension fields.
This may come handy.
In the meantime I managed to workout improvements to the approach used previously.
In essence the only field I can reliably inject into docker compose to indicate the machine the replica is running on is Hostname. Node labels would be nice but they don’t get expanded.
Also the trick is to ditch the .Task.Slot altogether
My VMs have names vm-node1, vm-node2, vm-node3
Each vm mounts all volumes but it uses only one.
In addition a key is to override RABBITMQ_MNESIA_BASE.
Thanks for steering me me in the right direction.
Regardless which replica lands on the node, it will always use the same local volume.
The downside is that Docker Swarm will create each set of volumes for each node, so it my case that will be 9 volumes in total, although only 1/3rd of that is actually needed.
The final result is a stack with only one service and simpler compose file.
New node can be added just by adding another volume.
I don’t think I can improve on it any further at this point.
Are you using a volume backed by nfsv4? I am not sure what type of consensus and quorum rabbitmq uses to replicate its data amongst its instances, but if it’s a majority based one, you don’t want to use locally stored volumes the way your compose file is at the moment, as it may happen when you loose quorum that you can’t recover it as all local volumes might have different states in the local version of the volume. Your current approach REALY needs a remote fileshare backed volume. If you already use such a volume, then there is no need to define a different volume per service, as the environment variable already takes care to let each replica write into its own dedicated subfolder.
Note: your approach does result in one container per node due to the published ports in mode: host, you might want to make it more explicit by using max_replicas_per_node, it saves the scheduler try and error cycles when trying to schedule a new replica in case of an error. Or you simply switch from replica to global deployment.
I’m not using nfsv4, only local volumes
Replication in RabbitMQ depends on the set up and used queues. For quorum queues and 3 nodes it means 3 copies with 1 node failure tolerance.
I don’t think the current compose differs much from having 3 bare metal servers with RabbitMQ installed as service and clustered manually.
I agree however nfsv4 may be the way to go to simplify the configuration. I think it adds another network hop into the equation and latency.
For production setup this may be more sensible.
Thanks for the remainder on max_replicas_per_node and global.
I had max_replicas_per_node in place previously and took it out for debugging the problem.
Global should also work now considering I’m not using .Task.Slot
Regarding the consensus: it seems to be what I though - either raft, paxos or something similar. This is a good reason to modify your approach and use mode: global to make sure the container that runs on a node always has the same configuration and uses the same volume (identity wise). Once this is done it realy doesn’t differ much
If you use mode: global, you don’t need max_replicas_per_node as by definition in global mode one instance will be placed on each node that satisfies the deployment constraint.
On a second though: you already archived that a container is always created on a node with the same configuration and volume with your current compose file.
I still would make it mode: global and just use the “same” local volume (each node’s volume would only hold data of the instance running on that node). With global you can even drop the RABBITMQ_MNESIA_BASE environment variable, as it is realy is just usefull if your volume is on a remote share used by all nodes.
Yes, it’s raft base consensus.
I tested with max_replicas_per_node and works as expected.
The reason it wasn’t previously it was because my yml specified ver 3.7 not 3.8
Now I updated it to global and works as expected and likely with a faster startup time for the reasons you mentioned.
RABBITMQ_MNESIA_BASE allows me to force RabbitMQ to always go to the same mounted volume on the node, so I think it needs to be there.
Without it it would be default /var/lib/rabbitmq/mnesia which would be a local docker volume
Am I missing something?
Thanks again for your suggestions. it’s been great.
Daniel
True, if your volume would be backed by a remote share.
Each node mounts /var/lib/docker/volumes/{volume name}/_data into the target path in the container. Since the data is local to the node, only the data from the current node’s instance will write into that folder. Swarm doesn’t replicate volume data amongst nodes.
There is no harm in leaving the volumes the way you do, but at the same time is not necessary.
You can verify my claim using ls -R /var/lib/docker/volumes/rabbitmq-data*/_data (as root) on each node and see for yourself what’s inside
I think I got it. I was staring at it trying to understand and it finally clicked that I don’t really need 3 volumes
My current compose below
The nice side-effect is that there are no more duplicated volumes created, so I have only 3 volumes in total, one per VM.
One thing I could improve further is getting rid of that hardcoded cookie and use Docker Secrets.
This means I probably need to hardcode uid/gid which I had bad experience with in the past.
Thank you very much again. Much appreciated
Daniel
You can drop max_replicas_per_node: 1 as mode:global only deploys a single instance per node. It is only usefull if mode: replicated is used.
Does the image suppot to pass the RABBITMQ_ERLANG_COOKIE as secret? Secrets will be mounted as files into the filesystem using a tmpfs filesystem. The entrypoint script needs to read the file and set the environment or render it into the config by itself, The long syntax to set the target to an absolut path, even though the documentation does not mention it.
Thanks, forgot to remove max_replicas_per_node when switched to global
Official image Docker Hub says
it’s enough to use it like this
docker service create ... --secret source=my-erlang-cookie,target=/var/lib/rabbitmq/.erlang.cookie ... rabbitmq
since that target location is where RabbitMQ reads cookie from. But uid/gid setting may be necessary which is my experience in the past. Also service needs user: I think set to the same value.
I’m not sure where the IDs are coming from. I’ve seen examples with 1000/1001.
There is also potentially some complications with mapping from the hosts?
I need to educate myself on that topic.
It realy depends on how the image is designed - if the rabittmq processes are started with an unprivliged user, you might need to tinker arround with user: and setting a uid, gid for the secret - you just need to make sure the user: declaration and the owner for the secret allign. You should have no problem with uid/gid when using a fresh volume - though, it might be a problem if pre-existing data is owned by a differen uid/gid. You could enter a running container and use ps to check which id is used to run the process - if it’s 0, there is no need to tinker around - if it’s >0, you indeed might want to set user: and set the uid/gid for the secret.
The complication you refer to are the typical problem when a bind (when a host path is mapped to container path) is used, where the owner of the host path must allign with the uid/gid of the process inside the container.
@ivenhov lovely compose file, would you please share your config, I have a RabbitMQ cluster setup with Consul has the service discovery, but I’ve been having issues with the discovery, and been looking at other solutions.
@ivenhov are you having issues with long running connections, when the connection is idle for more than 15 minutes? In my mixed Windows/Linux environment, the NAT VFP forgets the connection after 4 minutes, when I use mode: global and ports: mode: host.