Startup order in swarm

Hello,

I read few documentation about the fact that docker swarm doesn’t support depends_on for service.
I am looking for some feedback from the community on how you are dealing with this.

My use case is a stack composed by a rabbitMQ message, and a service (I simplified for this post) :

services:
  rabbitmq01:
    image: rabbitmq:3-management
    deploy:
      labels:
        - "traefik.http.routers.rabbitmq1.rule=Host(`rabbitmq.domain.com`)"
        - "traefik.http.services.rabbitmq1.loadbalancer.server.port=15672"
        - "traefik.swarm.network=traefik_default"
      placement:
        constraints: [node.role == worker]
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 60s
    environment:
      TZ: "Europe/Paris"
      RABBITMQ_DEFAULT_USER: "user"
      RABBITMQ_DEFAULT_PASS: "user"
    networks:
      - message_net
      - traefik_default
  app1:
    image: app1:2.8
    environment:
      TZ: "Europe/Paris"
    hostname: "app-{{.Node.Hostname}}-{{.Task.Slot}}"
    deploy:
      labels:
        - "traefik.http.routers.vad.rule=Host(`app.domain.com`)"
        - "traefik.http.services.vad.loadbalancer.server.port=8080"
        - "traefik.swarm.network=traefik_default"
      replicas: 2
      placement:
        constraints:
          - node.role == worker
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 60s
    networks:
      - message_net
      - traefik_default

So the situation is that some services app are started before rabbitmq is ready to serve connection.
To handle this situation I see two options :slight_smile:

  • Use a dedicated stack file for rabbitMQ, this way I make sure it is started before
  • See with dev team to be sure there is some retry mechanism in the code if it fails to connect to rabbitMQ.

I think currently service start, but if it fails to connect to rabbitMQ service (hence container) remain up & running.

Thank you for your input !

Thoughts to option1: the race condition still applies, it becomes just less likely. What happens in the external service becomes unavailable when the application is running?

Thoughts to option2: this would be the right choice to strengthen resiliency.

Another idea could be to make the entrypoint script delay the start of the main application until the required external services are available. That would fix the race condition, but the question from option1 remains

thank you for your feedback.
Yes I don’t like option 1, because the idea I have from docker, is like just create one stack that describe all the application & requirements.

So I went to option2 to show case, and it works. rabbitmq needs around 15s to be ready.

Hope this helps.

It’s all about resilience when building Internet systems. Your application needs to reconnect when an internal service crashes or network has an outage between services. Trying to get a fixed startup order might work during startup, but you need to implement retries and reconnecting anyway.

2 Likes