RabbitMQ service pinning using Docker Swarm placement

ivenhov · July 18, 2022, 11:24am

Hi

I want to run RabbitMQ on Docker Swarm
It requires that each replica starts on the same Docker node due to the fact it persists hostname in the database.
Each Docker node has local volume only.
In essence I want to pin .Task.Slot to particular Docker host.
How to do that?

My current example stack is below
It does not work because .Task.Slot cannot be resolved for placement

version: '3.7'

services:
  rabbitmq:
    image: rabbitmq:3-management
    hostname: "rabbitmq-{{.Task.Slot}}"
    networks:
      - rabbitmq-swarm-net
    ports:
      - 5672:5672
      - 15672:15672
    volumes:
      - rabbitmq-data:/var/lib/rabbitmq
    configs:
      - source: rabbitmq-config-1x
        target: /etc/rabbitmq/rabbitmq.conf
    environment:
          - RABBITMQ_ERLANG_COOKIE="my-cookie"      
    deploy:
      replicas: 3
      restart_policy:
        condition: any
      placement:
        constraints:
          - node.labels.nodeNum == {{.Task.Slot}}

networks:
  rabbitmq-swarm-net:

volumes:
  rabbitmq-data:
    name: 'rabbitmq-vol-node-{{.Task.Slot}}'
    external: true
    
configs:
  rabbitmq-config-1x:
    external: true

Regards
Daniel

meyay · July 18, 2022, 12:39pm

There are a couple of ways to pin a container to node.

use global instead of replica deployments, BUT global deployments don’t have {{.Task.Slot}}. So it will not work in your case. You could combine it with a placement constraint based on node tag, to make sure the global deployment will only be deployed to specific nodes.
create a separate service for each instance, pinning it to a node using a node tag for placement
if the service supports to override the config and data directory per environment setting, use a volumed backed by nfsv4 and configure the config and data folders to be separate subfolders per .Task.Slot on the same volume.

Note: a volume is immutable. Once a volume is created (e.g. with a .Task.Slot in it’s name), it will never be updated again, unless it’s manually removed and re-created. Also the definition of a volume is local to a node - if a container using it was run on multiple nodes, it needs to be removed on multiple nodes.

This is why putting the .Task.Slot in the external name might look like a good idea, but actually does not really help. On a second though: it might help, if the volumes are backed by nfsv4 - you would end with having up to all three named volumes on a node, but only one of them would be mounted to the container with the matching .Task.Id. In combination with max_replicas_per_node, it might be a feasible solution as well if used combined with node tags as placement constraint to restrict the nodes it can be placed on.

ivenhov · July 18, 2022, 1:37pm

Thanks @meyay

I was considering both 1) and 2) however

as you mentioned does not prevent randomness
this is exactly what I just did. It requires setting up 3 services, creates unwanted duplication in the yml file.

I think this requires mapping ports differently. Rabbit uses 5672 and 15672 so to avoid conflict I did as below.
Is that what you were suggesting?
I’m using custom label nodeNum, also specified replica size 1 on each service

version: '3.7'

services:
  rabbitmq-1:
    image: rabbitmq:3-management
    hostname: "rabbitmq-1"
    networks:
      - rabbitmq-swarm-net
    ports:
      - 5671:5672
      - 15671:15672
    volumes:
      - rabbitmq-data-1:/var/lib/rabbitmq
    configs:
      - source: rabbitmq-config
        target: /etc/rabbitmq/rabbitmq.conf
    environment:
          - RABBITMQ_ERLANG_COOKIE="G5Q3szkyCXk8fDLd3z8e5rX8PyJ"      
    deploy:
      replicas: 1
      restart_policy:
        condition: any
      placement:
        constraints:
          - node.labels.nodeNum == 1
  rabbitmq-2:
    image: rabbitmq:3-management
    hostname: "rabbitmq-2"
    networks:
      - rabbitmq-swarm-net
    ports:
      - 5672:5672
      - 15672:15672
    volumes:
      - rabbitmq-data-2:/var/lib/rabbitmq
    configs:
      - source: rabbitmq-config
        target: /etc/rabbitmq/rabbitmq.conf
    environment:
          - RABBITMQ_ERLANG_COOKIE="G5Q3szkyCXk8fDLd3z8e5rX8PyJ"      
    deploy:
      replicas: 1
      restart_policy:
        condition: any
      placement:
        constraints:
          - node.labels.nodeNum == 2
  rabbitmq-3:
    image: rabbitmq:3-management
    hostname: "rabbitmq-3"
    networks:
      - rabbitmq-swarm-net
    ports:
      - 5673:5672
      - 15673:15672
    volumes:
      - rabbitmq-data-3:/var/lib/rabbitmq
    configs:
      - source: rabbitmq-config
        target: /etc/rabbitmq/rabbitmq.conf
    environment:
          - RABBITMQ_ERLANG_COOKIE="G5Q3szkyCXk8fDLd3z8e5rX8PyJ"      
    deploy:
      replicas: 1
      restart_policy:
        condition: any
      placement:
        constraints:
          - node.labels.nodeNum == 3
networks:
  rabbitmq-swarm-net:

volumes:
  rabbitmq-data-1:
    external: true
  rabbitmq-data-2:
    external: true
  rabbitmq-data-3:
    external: true
    
configs:
  rabbitmq-config:
    external: true

By including .Task.Slot in the name my intention was to workaround the limitation but the .Task.Slot cannot be resolved.
I haven’t explored 3) yet, but it would solve the issue

meyay · July 18, 2022, 1:57pm

It would actually, but you would have no way to assigned fix hostnames based on .Task.Slot placeholder. The identity of each task would be identical, but the data stored in the local volume differ.
Due to the indistinguishable hostnames, I considered this as non working in your case.

Just a through: It might be possible to pull this one of, if the hostname uses the place holder for the node host name and publishes the port as mode: host (see further below). Then you could rely on volume data to be stored locally.

No need to map the port differently, make sure to use the port publish long syntax with mode: host, to bypass the ingress proxy. As a result the host port will directly bound to the container on the node.

Number 3 requires the volumes to be backed by nfsv4 (which must be reachable from all nodes a replica will potentially deploy to). I know that it seems counterintuitive as rabbit already replicates its data under the hood, so ideal you really just need the container’s volume data to be local. It is still a convenient and flexible solution.

ivenhov · July 18, 2022, 3:59pm

Thank you again for you reply

This is invaluable information and allows me to move forward.

I implemented mode:host already and it works as expected.
The good part of it is that application side does not have to concern itself with different ports.

I agree 3) has many advantages, one of which is flexible move of replicas.
Thanks for the suggestion with environment variables.
RabbitMQ has very good support for env variables configuration.

I will give it another go and try to reduce duplication in yml now

meyay · July 18, 2022, 4:57pm

Please keep us posted about the approach you finaly implemented, so others that stumble accross this post don’t just find our discussion, but also find your configuration as an example for their own implementation.

If you consider to use approach number 2, you could leverage extension fields to de-duplicate repeating configuration elements. If you are currious, you can get a deep dive in this blog post.

ivenhov · July 19, 2022, 8:49am

Good to know about extension fields.
This may come handy.

In the meantime I managed to workout improvements to the approach used previously.
In essence the only field I can reliably inject into docker compose to indicate the machine the replica is running on is Hostname. Node labels would be nice but they don’t get expanded.
Also the trick is to ditch the .Task.Slot altogether

My compose file now

version: '3.7'

services:
  rabbitmq:
    image: rabbitmq:3-management
    hostname: "rabbitmq-{{.Node.Hostname}}"
    networks:
      - rabbitmq-swarm-net
    ports:
      - target: 5672
        published: 5672
        protocol: tcp
        mode: host
      - target: 15672
        published: 15672
        protocol: tcp
        mode: host
    volumes:
      - rabbitmq-data-1:/var/lib/rabbitmq/vm-node1/mnesia
      - rabbitmq-data-2:/var/lib/rabbitmq/vm-node2/mnesia
      - rabbitmq-data-3:/var/lib/rabbitmq/vm-node3/mnesia
    configs:
      - source: rabbitmq-config
        target: /etc/rabbitmq/rabbitmq.conf
    environment:
          - RABBITMQ_ERLANG_COOKIE=123G5Q3szkyCXk8fDLd3z8e5rX8PyJ
          - RABBITMQ_MNESIA_BASE=/var/lib/rabbitmq/{{.Node.Hostname}}/mnesia
    deploy:
      replicas: 3
      restart_policy:
        condition: any
      placement:
        constraints:
          - node.labels.nodeSet_1_2_3 == true
          
networks:
  rabbitmq-swarm-net:

volumes:
  rabbitmq-data-1:
    external: true
  rabbitmq-data-2:
    external: true
  rabbitmq-data-3:
    external: true
    
configs:
  rabbitmq-config:
    name: "rabbitmq-config-nodes-1_2_3"
    external: true

My rabbitmq config file for reference

vm_memory_high_watermark.absolute = 1024MiB
disk_free_limit.absolute = 500MB

loopback_users = none

cluster_partition_handling = pause_minority

# Peer discovery
cluster_formation.peer_discovery_backend = classic_config

cluster_formation.classic_config.nodes.1 = rabbit@rabbitmq-vm-node1
cluster_formation.classic_config.nodes.2 = rabbit@rabbitmq-vm-node2
cluster_formation.classic_config.nodes.3 = rabbit@rabbitmq-vm-node3

My VMs have names vm-node1, vm-node2, vm-node3
Each vm mounts all volumes but it uses only one.
In addition a key is to override RABBITMQ_MNESIA_BASE.
Thanks for steering me me in the right direction.

Regardless which replica lands on the node, it will always use the same local volume.
The downside is that Docker Swarm will create each set of volumes for each node, so it my case that will be 9 volumes in total, although only 1/3rd of that is actually needed.

The final result is a stack with only one service and simpler compose file.
New node can be added just by adding another volume.

I don’t think I can improve on it any further at this point.

Regards
Daniel

meyay · July 19, 2022, 9:01am

Thank for sharing your current state!

Are you using a volume backed by nfsv4? I am not sure what type of consensus and quorum rabbitmq uses to replicate its data amongst its instances, but if it’s a majority based one, you don’t want to use locally stored volumes the way your compose file is at the moment, as it may happen when you loose quorum that you can’t recover it as all local volumes might have different states in the local version of the volume. Your current approach REALY needs a remote fileshare backed volume. If you already use such a volume, then there is no need to define a different volume per service, as the environment variable already takes care to let each replica write into its own dedicated subfolder.

Note: your approach does result in one container per node due to the published ports in mode: host, you might want to make it more explicit by using max_replicas_per_node, it saves the scheduler try and error cycles when trying to schedule a new replica in case of an error. Or you simply switch from replica to global deployment.

ivenhov · July 19, 2022, 9:49am

I’m not using nfsv4, only local volumes
Replication in RabbitMQ depends on the set up and used queues. For quorum queues and 3 nodes it means 3 copies with 1 node failure tolerance.
I don’t think the current compose differs much from having 3 bare metal servers with RabbitMQ installed as service and clustered manually.
I agree however nfsv4 may be the way to go to simplify the configuration. I think it adds another network hop into the equation and latency.
For production setup this may be more sensible.

Thanks for the remainder on max_replicas_per_node and global.
I had max_replicas_per_node in place previously and took it out for debugging the problem.
Global should also work now considering I’m not using .Task.Slot

Daniel

meyay · July 19, 2022, 9:58am

Regarding the consensus: it seems to be what I though - either raft, paxos or something similar. This is a good reason to modify your approach and use mode: global to make sure the container that runs on a node always has the same configuration and uses the same volume (identity wise). Once this is done it realy doesn’t differ much

If you use mode: global, you don’t need max_replicas_per_node as by definition in global mode one instance will be placed on each node that satisfies the deployment constraint.

meyay · July 19, 2022, 10:15am

On a second though: you already archived that a container is always created on a node with the same configuration and volume with your current compose file.

I still would make it mode: global and just use the “same” local volume (each node’s volume would only hold data of the instance running on that node). With global you can even drop the RABBITMQ_MNESIA_BASE environment variable, as it is realy is just usefull if your volume is on a remote share used by all nodes.

ivenhov · July 19, 2022, 10:23am

Yes, it’s raft base consensus.
I tested with max_replicas_per_node and works as expected.
The reason it wasn’t previously it was because my yml specified ver 3.7 not 3.8
Now I updated it to global and works as expected and likely with a faster startup time for the reasons you mentioned.
RABBITMQ_MNESIA_BASE allows me to force RabbitMQ to always go to the same mounted volume on the node, so I think it needs to be there.
Without it it would be default /var/lib/rabbitmq/mnesia which would be a local docker volume
Am I missing something?

Thanks again for your suggestions. it’s been great.
Daniel

meyay · July 19, 2022, 10:33am

True, if your volume would be backed by a remote share.
Each node mounts /var/lib/docker/volumes/{volume name}/_data into the target path in the container. Since the data is local to the node, only the data from the current node’s instance will write into that folder. Swarm doesn’t replicate volume data amongst nodes.

There is no harm in leaving the volumes the way you do, but at the same time is not necessary.

You can verify my claim using ls -R /var/lib/docker/volumes/rabbitmq-data*/_data (as root) on each node and see for yourself what’s inside

ivenhov · July 19, 2022, 12:20pm

I think I got it. I was staring at it trying to understand and it finally clicked that I don’t really need 3 volumes
My current compose below
The nice side-effect is that there are no more duplicated volumes created, so I have only 3 volumes in total, one per VM.

version: '3.8'

services:
  rabbitmq:
    image: rabbitmq:3-management
    hostname: "rabbitmq-{{.Node.Hostname}}"
    networks:
      - rabbitmq-swarm-net
    ports:
      - target: 5672
        published: 5672
        protocol: tcp
        mode: host
      - target: 15672
        published: 15672
        protocol: tcp
        mode: host
    volumes:
      - rabbitmq-data:/var/lib/rabbitmq
    configs:
      - source: rabbitmq-config
        target: /etc/rabbitmq/rabbitmq.conf
    environment:
          - RABBITMQ_ERLANG_COOKIE=123G5Q3szkyCXk8fDLd3z8e5rX8PyJ
    deploy:
      mode: global
      restart_policy:
        condition: any
      placement:
        max_replicas_per_node: 1
        constraints:
          - node.labels.rabbitNode == true
          
networks:
  rabbitmq-swarm-net:

volumes:
  rabbitmq-data:

configs:
  rabbitmq-config:
    name: "rabbitmq-config-nodes-1_2_4"
    external: true

One thing I could improve further is getting rid of that hardcoded cookie and use Docker Secrets.
This means I probably need to hardcode uid/gid which I had bad experience with in the past.

Thank you very much again. Much appreciated
Daniel

meyay · July 19, 2022, 12:39pm

You can drop max_replicas_per_node: 1 as mode:global only deploys a single instance per node. It is only usefull if mode: replicated is used.

Does the image suppot to pass the RABBITMQ_ERLANG_COOKIE as secret? Secrets will be mounted as files into the filesystem using a tmpfs filesystem. The entrypoint script needs to read the file and set the environment or render it into the config by itself, The long syntax to set the target to an absolut path, even though the documentation does not mention it.

Your compose file now shows good craftsmanship

ivenhov · July 19, 2022, 1:07pm

Thanks, forgot to remove max_replicas_per_node when switched to global

Official image Docker Hub says
it’s enough to use it like this

docker service create ... --secret source=my-erlang-cookie,target=/var/lib/rabbitmq/.erlang.cookie ... rabbitmq

since that target location is where RabbitMQ reads cookie from. But uid/gid setting may be necessary which is my experience in the past. Also service needs user: I think set to the same value.
I’m not sure where the IDs are coming from. I’ve seen examples with 1000/1001.
There is also potentially some complications with mapping from the hosts?
I need to educate myself on that topic.

Thanks for the heads up on the long syntax

Daniel

meyay · July 19, 2022, 2:32pm

It realy depends on how the image is designed - if the rabittmq processes are started with an unprivliged user, you might need to tinker arround with user: and setting a uid, gid for the secret - you just need to make sure the user: declaration and the owner for the secret allign. You should have no problem with uid/gid when using a fresh volume - though, it might be a problem if pre-existing data is owned by a differen uid/gid. You could enter a running container and use ps to check which id is used to run the process - if it’s 0, there is no need to tinker around - if it’s >0, you indeed might want to set user: and set the uid/gid for the secret.

The complication you refer to are the typical problem when a bind (when a host path is mapped to container path) is used, where the owner of the host path must allign with the uid/gid of the process inside the container.

adelekeadenijio · July 3, 2023, 6:57pm

@ivenhov lovely compose file, would you please share your config, I have a RabbitMQ cluster setup with Consul has the service discovery, but I’ve been having issues with the discovery, and been looking at other solutions.

ishnets · July 30, 2023, 6:07am

@ivenhov are you having issues with long running connections, when the connection is idle for more than 15 minutes? In my mixed Windows/Linux environment, the NAT VFP forgets the connection after 4 minutes, when I use mode: global and ports: mode: host.

Topic		Replies	Views
Blocking traffic to one of the service containers Swarm docker , swarm	5	2805	March 14, 2019
How to connect to a specific task in a docker swarm service? Swarm swarm	2	2302	November 9, 2023
Need help on RabbitMQ Cluster Setup (DNS Auto Discovery) in Docker Swarm Swarm	5	1609	August 29, 2023
Defining container placement in Swarm General	2	3448	October 4, 2017
Running a container/service on every node (but only 1 per node) Swarm docker , beta	3	19069	November 1, 2016

RabbitMQ service pinning using Docker Swarm placement

Related topics