RabbitMQ service pinning using Docker Swarm placement

Hi

I want to run RabbitMQ on Docker Swarm
It requires that each replica starts on the same Docker node due to the fact it persists hostname in the database.
Each Docker node has local volume only.
In essence I want to pin .Task.Slot to particular Docker host.
How to do that?

My current example stack is below
It does not work because .Task.Slot cannot be resolved for placement

version: '3.7'

services:
  rabbitmq:
    image: rabbitmq:3-management
    hostname: "rabbitmq-{{.Task.Slot}}"
    networks:
      - rabbitmq-swarm-net
    ports:
      - 5672:5672
      - 15672:15672
    volumes:
      - rabbitmq-data:/var/lib/rabbitmq
    configs:
      - source: rabbitmq-config-1x
        target: /etc/rabbitmq/rabbitmq.conf
    environment:
          - RABBITMQ_ERLANG_COOKIE="my-cookie"      
    deploy:
      replicas: 3
      restart_policy:
        condition: any
      placement:
        constraints:
          - node.labels.nodeNum == {{.Task.Slot}}

networks:
  rabbitmq-swarm-net:

volumes:
  rabbitmq-data:
    name: 'rabbitmq-vol-node-{{.Task.Slot}}'
    external: true
    
configs:
  rabbitmq-config-1x:
    external: true

Regards
Daniel

There are a couple of ways to pin a container to node.

  1. use global instead of replica deployments, BUT global deployments don’t have {{.Task.Slot}}. So it will not work in your case. You could combine it with a placement constraint based on node tag, to make sure the global deployment will only be deployed to specific nodes.
  2. create a seperate service for each instance, pinning it to a node using a node tag for placement
  3. if the service supports to override the config and data directory per environment setting, use a volumed baked by nfsv4 and configure the config and data folders to be separate subfolders per .Task.Slot on the same volume.

Note: a volume is immutable. Once a volume is created (e.g. with a .Task.Slot in it’s name), it will never be updated again, unless it’s manualy removed and re-created. Also the definition of a volume is local to a node - if a container using it was run on multiple nodes, it needs to be reomved on multiple nodes

This is why putting the .Task.Slot in the external name might look like a good idea, but actualy does not realy help. On a second though: it might help, if the volumes are baked by nfsv4 - you would end up having up to all three named volumes on a node, but only one of the them would be mounted to the container with the matching .Task.Id. In combination with max_replicas_per_node, it might be a feasable solution as well if used combined with node tags as placement constraint to restrict the nodes it can be placed on.

Thanks @meyay

I was considering both 1) and 2) however

  1. as you mentioned does not prevent randomness

  2. this is exactly what I just did. It requires setting up 3 services, creates unwanted duplication in the yml file.

I think this requires mapping ports differently. Rabbit uses 5672 and 15672 so to avoid conflict I did as below.
Is that what you were suggesting?
I’m using custom label nodeNum, also specified replica size 1 on each service

version: '3.7'

services:
  rabbitmq-1:
    image: rabbitmq:3-management
    hostname: "rabbitmq-1"
    networks:
      - rabbitmq-swarm-net
    ports:
      - 5671:5672
      - 15671:15672
    volumes:
      - rabbitmq-data-1:/var/lib/rabbitmq
    configs:
      - source: rabbitmq-config
        target: /etc/rabbitmq/rabbitmq.conf
    environment:
          - RABBITMQ_ERLANG_COOKIE="G5Q3szkyCXk8fDLd3z8e5rX8PyJ"      
    deploy:
      replicas: 1
      restart_policy:
        condition: any
      placement:
        constraints:
          - node.labels.nodeNum == 1
  rabbitmq-2:
    image: rabbitmq:3-management
    hostname: "rabbitmq-2"
    networks:
      - rabbitmq-swarm-net
    ports:
      - 5672:5672
      - 15672:15672
    volumes:
      - rabbitmq-data-2:/var/lib/rabbitmq
    configs:
      - source: rabbitmq-config
        target: /etc/rabbitmq/rabbitmq.conf
    environment:
          - RABBITMQ_ERLANG_COOKIE="G5Q3szkyCXk8fDLd3z8e5rX8PyJ"      
    deploy:
      replicas: 1
      restart_policy:
        condition: any
      placement:
        constraints:
          - node.labels.nodeNum == 2
  rabbitmq-3:
    image: rabbitmq:3-management
    hostname: "rabbitmq-3"
    networks:
      - rabbitmq-swarm-net
    ports:
      - 5673:5672
      - 15673:15672
    volumes:
      - rabbitmq-data-3:/var/lib/rabbitmq
    configs:
      - source: rabbitmq-config
        target: /etc/rabbitmq/rabbitmq.conf
    environment:
          - RABBITMQ_ERLANG_COOKIE="G5Q3szkyCXk8fDLd3z8e5rX8PyJ"      
    deploy:
      replicas: 1
      restart_policy:
        condition: any
      placement:
        constraints:
          - node.labels.nodeNum == 3
networks:
  rabbitmq-swarm-net:

volumes:
  rabbitmq-data-1:
    external: true
  rabbitmq-data-2:
    external: true
  rabbitmq-data-3:
    external: true
    
configs:
  rabbitmq-config:
    external: true

By including .Task.Slot in the name my intention was to workaround the limitation but the .Task.Slot cannot be resolved.
I haven’t explored 3) yet, but it would solve the issue

It would actually, but you would have no way to assigned fix hostnames based on .Task.Slot placeholder. The identity of each task would be identical, but the data stored in the local volume differ.
Due to the undistinguishable hostnames, I considered this as non working in your case.

Just a through: It might be possible to pull this one of, if the hostname uses the place holder for the node host name and publishes the port as mode: host (see further below). Then you could rely on volume data to be stored localy.

No need to map the port differently, make sure to use the port publish long syntax with mode: host, to bypass the ingress proxy. As a result the host port will directly bound to the container on the node.

Number 3 requires the volumes to be baked by nfsv4 (which must be reachable from all nodes a replica will potentialy deployed to). I know that it seems counter intuitive as rabbit already replicates it’s data under the hood, so ideal you realy just need the container’s volume data to be local. It is still a convinient and flexbile solution.

Thank you again for you reply

This is invaluable information and allows me to move forward.

I implemented mode:host already and it works as expected.
The good part of it is that application side does not have to concern itself with different ports.

I agree 3) has many advantages, one of which is flexible move of replicas.
Thanks for the suggestion with environment variables.
RabbitMQ has very good support for env variables configuration.

I will give it another go and try to reduce duplication in yml now

Please keep us posted about the approach you finaly implemented, so others that stumble accross this post don’t just find our discussion, but also find your configuration as an example for their own implementation.

If you consider to use approach number 2, you could leverage extension fields to de-duplicate repeating configuration elements. If you are currious, you can get a deep dive in this blog post.

Good to know about extension fields.
This may come handy.

In the meantime I managed to workout improvements to the approach used previously.
In essence the only field I can reliably inject into docker compose to indicate the machine the replica is running on is Hostname. Node labels would be nice but they don’t get expanded.
Also the trick is to ditch the .Task.Slot altogether

My compose file now

version: '3.7'

services:
  rabbitmq:
    image: rabbitmq:3-management
    hostname: "rabbitmq-{{.Node.Hostname}}"
    networks:
      - rabbitmq-swarm-net
    ports:
      - target: 5672
        published: 5672
        protocol: tcp
        mode: host
      - target: 15672
        published: 15672
        protocol: tcp
        mode: host
    volumes:
      - rabbitmq-data-1:/var/lib/rabbitmq/vm-node1/mnesia
      - rabbitmq-data-2:/var/lib/rabbitmq/vm-node2/mnesia
      - rabbitmq-data-3:/var/lib/rabbitmq/vm-node3/mnesia
    configs:
      - source: rabbitmq-config
        target: /etc/rabbitmq/rabbitmq.conf
    environment:
          - RABBITMQ_ERLANG_COOKIE=123G5Q3szkyCXk8fDLd3z8e5rX8PyJ
          - RABBITMQ_MNESIA_BASE=/var/lib/rabbitmq/{{.Node.Hostname}}/mnesia
    deploy:
      replicas: 3
      restart_policy:
        condition: any
      placement:
        constraints:
          - node.labels.nodeSet_1_2_3 == true
          
networks:
  rabbitmq-swarm-net:

volumes:
  rabbitmq-data-1:
    external: true
  rabbitmq-data-2:
    external: true
  rabbitmq-data-3:
    external: true
    
configs:
  rabbitmq-config:
    name: "rabbitmq-config-nodes-1_2_3"
    external: true

My rabbitmq config file for reference

vm_memory_high_watermark.absolute = 1024MiB
disk_free_limit.absolute = 500MB

loopback_users = none

cluster_partition_handling = pause_minority

# Peer discovery
cluster_formation.peer_discovery_backend = classic_config

cluster_formation.classic_config.nodes.1 = rabbit@rabbitmq-vm-node1
cluster_formation.classic_config.nodes.2 = rabbit@rabbitmq-vm-node2
cluster_formation.classic_config.nodes.3 = rabbit@rabbitmq-vm-node3

My VMs have names vm-node1, vm-node2, vm-node3
Each vm mounts all volumes but it uses only one.
In addition a key is to override RABBITMQ_MNESIA_BASE.
Thanks for steering me me in the right direction.

Regardless which replica lands on the node, it will always use the same local volume.
The downside is that Docker Swarm will create each set of volumes for each node, so it my case that will be 9 volumes in total, although only 1/3rd of that is actually needed.

The final result is a stack with only one service and simpler compose file.
New node can be added just by adding another volume.

I don’t think I can improve on it any further at this point.

Regards
Daniel

Thank for sharing your current state!

Are you using a volume baked by nfsv4? I am not sure what type of consensus and quorum rabbitmq uses to replicate it’s data amongst it’s instances, but if it’s a majority based one, you don’t want to use localy stored volumes the way your compose file is at the moment, as it may happen when you loose quorum that you can’t recover it as all local volumes might have different states in the local version of the volume. Your current approach REALY needs a remote fileshare baked volume. If you alredy use such a volume, then there is no need to define a different volume per service, as the environment variable already takes care to let each replica write into its own dedicated subfolder.

Note: your approach does result in one container per node due to the published ports in mode: host, you might want to make it more explicit by using max_replicas_per_node, it saves the scheduler try and error cycles when trying to schedule a new replica in case of an error. Or you simply switch from replica to global deployment.

I’m not using nfsv4, only local volumes
Replication in RabbitMQ depends on the set up and used queues. For quorum queues and 3 nodes it means 3 copies with 1 node failure tolerance.
I don’t think the current compose differs much from having 3 bare metal servers with RabbitMQ installed as service and clustered manually.
I agree however nfsv4 may be the way to go to simplify the configuration. I think it adds another network hop into the equation and latency.
For production setup this may be more sensible.

Thanks for the remainder on max_replicas_per_node and global.
I had max_replicas_per_node in place previously and took it out for debugging the problem.
Global should also work now considering I’m not using .Task.Slot

Daniel

Regarding the consensus: it seems to be what I though - either raft, paxos or something similar. This is a good reason to modify your approach and use mode: global to make sure the container that runs on a node always has the same configuration and uses the same volume (identity wise). Once this is done it realy doesn’t differ much

If you use mode: global, you don’t need max_replicas_per_node as by definition in global mode one instance will be placed on each node that satisfies the deployment constraint.

On a second though: you already archived that a container is always created on a node with the same configuration and volume with your current compose file.

I still would make it mode: global and just use the “same” local volume (each node’s volume would only hold data of the instance running on that node). With global you can even drop the RABBITMQ_MNESIA_BASE environment variable, as it is realy is just usefull if your volume is on a remote share used by all nodes.

Yes, it’s raft base consensus.
I tested with max_replicas_per_node and works as expected.
The reason it wasn’t previously it was because my yml specified ver 3.7 not 3.8
Now I updated it to global and works as expected and likely with a faster startup time for the reasons you mentioned.
RABBITMQ_MNESIA_BASE allows me to force RabbitMQ to always go to the same mounted volume on the node, so I think it needs to be there.
Without it it would be default /var/lib/rabbitmq/mnesia which would be a local docker volume
Am I missing something?

Thanks again for your suggestions. it’s been great.
Daniel

True, if your volume would be baked by a remote share.
Each node mounts /var/lib/docker/volumes/{volume name}/_data into the target path in the container. Since the data is local to the node, only the data from the current node’s instance will write into that folder. Swarm doesn’t replicate volume data amongst nodes.

There is no harm in leaving the volumes the way you do, but at the same time is not necessary.

You can verify my claim using ls -R /var/lib/docker/volumes/rabbitmq-data*/_data (as root) on each node and see for yourself what’s inside :slight_smile:

I think I got it. I was staring at it trying to understand and it finally clicked that I don’t really need 3 volumes
My current compose below
The nice side-effect is that there are no more duplicated volumes created, so I have only 3 volumes in total, one per VM.

version: '3.8'

services:
  rabbitmq:
    image: rabbitmq:3-management
    hostname: "rabbitmq-{{.Node.Hostname}}"
    networks:
      - rabbitmq-swarm-net
    ports:
      - target: 5672
        published: 5672
        protocol: tcp
        mode: host
      - target: 15672
        published: 15672
        protocol: tcp
        mode: host
    volumes:
      - rabbitmq-data:/var/lib/rabbitmq
    configs:
      - source: rabbitmq-config
        target: /etc/rabbitmq/rabbitmq.conf
    environment:
          - RABBITMQ_ERLANG_COOKIE=123G5Q3szkyCXk8fDLd3z8e5rX8PyJ
    deploy:
      mode: global
      restart_policy:
        condition: any
      placement:
        max_replicas_per_node: 1
        constraints:
          - node.labels.rabbitNode == true
          
networks:
  rabbitmq-swarm-net:

volumes:
  rabbitmq-data:

configs:
  rabbitmq-config:
    name: "rabbitmq-config-nodes-1_2_4"
    external: true

One thing I could improve further is getting rid of that hardcoded cookie and use Docker Secrets.
This means I probably need to hardcode uid/gid which I had bad experience with in the past.

Thank you very much again. Much appreciated
Daniel

1 Like

You can drop max_replicas_per_node: 1 as mode:global only deploys a single instance per node. It is only usefull if mode: replicated is used.

Does the image suppot to pass the RABBITMQ_ERLANG_COOKIE as secret? Secrets will be mounted as files into the filesystem using a tmpfs filesystem. The entrypoint script needs to read the file and set the environment or render it into the config by itself, The long syntax to set the target to an absolut path, even though the documentation does not mention it.

Your compose file now shows good craftsmanship :slight_smile:

Thanks, forgot to remove max_replicas_per_node when switched to global

Official image Docker Hub says
it’s enough to use it like this

docker service create ... --secret source=my-erlang-cookie,target=/var/lib/rabbitmq/.erlang.cookie ... rabbitmq

since that target location is where RabbitMQ reads cookie from. But uid/gid setting may be necessary which is my experience in the past. Also service needs user: I think set to the same value.
I’m not sure where the IDs are coming from. I’ve seen examples with 1000/1001.
There is also potentially some complications with mapping from the hosts?
I need to educate myself on that topic.

Thanks for the heads up on the long syntax

Daniel

It realy depends on how the image is designed - if the rabittmq processes are started with an unprivliged user, you might need to tinker arround with user: and setting a uid, gid for the secret - you just need to make sure the user: declaration and the owner for the secret allign. You should have no problem with uid/gid when using a fresh volume - though, it might be a problem if pre-existing data is owned by a differen uid/gid. You could enter a running container and use ps to check which id is used to run the process - if it’s 0, there is no need to tinker around - if it’s >0, you indeed might want to set user: and set the uid/gid for the secret.

The complication you refer to are the typical problem when a bind (when a host path is mapped to container path) is used, where the owner of the host path must allign with the uid/gid of the process inside the container.

1 Like