Unable to communicate between 2 service from different nodes with ip adrress or taks.<service-name>

Hi,

I am trying to deploy a microservices architectures on ec2 aws instances with docker swarm. I have 5 ec2 instances, 4 workers and 1 manager. That part works perfectly all the workers connect to the manager without any issue.

I am using the docker-compose.yml file below to deploy my stack. It contains 4 service and 3 of them depend on one service so I need to communicate between them. So I setup an overlay network to be able to communicate between themself with docker swarm on different host. I am using the dns tasks.<service-name> fomart to perform it as specify in the docker swarm documentation.

It start all the service but then as 3 out of 4 service can’t access to the service they depend on with the dns name so they crashed and are stopped.

I don’t understand why I can’t communicate between my service from different host with the overlay network and dns name. What I am doing wrong and how can I fix it to make it work ?

PS: all the ec2 instances are on the same subnet and I can communicate between the private ipv4 of each instances on the host and inside a container but not with the 10.xx.xx.xx adress ip or with the dns.

PS 2 : I have a security group (that replace firewall with ec2 instances on aws) where I allow inbound connection
on tcp for the ports : 2377,4789,7946 and the port 7946 for udp protocole. as outbound connection I allow all the protocol on every port and every ipv4 adress

version: '3.9'
services:
  test:
    container_name: test-service
    image: thomaslpro/test-service
    depends_on:
      - registration
    command: sh -c "/wait && java -server -XX:+UnlockExperimentalVMOptions -XX:+UseContainerSupport -jar test.jar"
    deploy:
      placement:
        constraints:
          - node.role == worker
    ports:
      - 8081:8081
    environment:
      - SPRING_PROFILES_ACTIVE=prod
      - EUREKA_CLIENT_SERVICEURL_DEFAULTZONE=http://tasks.my_app_registration:8761/eureka
      - WAIT_HOSTS=tasks.my_app_registration:9999
      - WAIT_HOSTS_TIMEOUT=300
      - WAIT_SLEEP_INTERVAL=30
      - WAIT_HOST_CONNECT_TIMEOUT=30
    networks:
      - app-network
  configuration:
    container_name: config-service
    image: thomaslpro/config-service
    deploy:
      placement:
        constraints:
          - node.role == worker
    ports:
      - 8888:8888
    environment:
      - SPRING_PROFILES_ACTIVE=prod
    networks:
      - app-network
  gateway:
    container_name: gateway
    image: thomaslpro/gateway-service
    depends_on:
      - registration
    command: sh -c "/wait && java -server -XX:+UnlockExperimentalVMOptions -XX:+UseContainerSupport -jar gateway.jar"
    deploy:
      placement:
        constraints:
          - node.role == worker 
    ports:
      - 9999:9999
    environment:
      - SPRING_PROFILES_ACTIVE=prod
      - EUREKA_CLIENT_SERVICEURL_DEFAULTZONE=http://tasks.my_app_registration:8761/eureka
      - WAIT_HOSTS=tasks.my_app_registration:9999
      - WAIT_HOSTS_TIMEOUT=300
      - WAIT_SLEEP_INTERVAL=30
      - WAIT_HOST_CONNECT_TIMEOUT=30
    networks:
      - app-network
  registration:
    container_name: registration
    image: thomaslpro/registration-service
    depends_on:
      - configuration
    command: sh -c "/wait && java -server -XX:+UnlockExperimentalVMOptions -XX:+UseContainerSupport -jar registration.jar"
    deploy:
      placement:
        constraints:
          - node.role == worker
    ports:
      - 8761:8761
    environment:
      - SPRING_PROFILES_ACTIVE=prod
      - EUREKA_CLIENT_SERVICEURL_DEFAULTZONE=http://tasks.my_app_registration:8761/eureka
      - WAIT_HOSTS=tasks.my_app_configuration:8888
      - WAIT_HOSTS_TIMEOUT=300
      - WAIT_SLEEP_INTERVAL=30
      - WAIT_HOST_CONNECT_TIMEOUT=30
    networks:
      - app-network
networks:
  app-network:
    name: app-network
    driver: overlay
    internal: true
    attachable: true

Port 4789 needs to be udp instead of tcp in your SG.

Is it realy intended to use tasks.<service-name>? This resolves the multi-value dns record with all container ip’s, instead of the service vip ip. If your application does not cache dns resolution results, then it should be fine. If it does I would stick to the service name, which will resolve to the vip ip. The vip will distribute traffic round robbin under the hood, and is suited for requests that are processed with 900 seconds. It is unsuited for long lasting connections, like connections to a database container that remain permanently open.

Hi,

Thanks for your response it works ! I feel a bit dump to struggle just because I used the wrong protocole …

About tasks.<service-name> well according to the docker swarm documentation it’s how container discovery works for docker swarm cluster. I don’t cache dns resolution results, at least not that I am aware of and Idk if spring boot app does it by default but never heard of it so I supposed it doesn’t.

As far as I know using service name doesn’t work to communicate between service that are on different host, it’s used for single host containers that isn’t my case. I could use ip adress instead of tasks.<service-name> but it wouldn’t be dynamic and I would have to change the IPs everytime I use new ec2 instances with different IPs

Of course it does, if they are in the same overlay network.

I strongy suggest to not use the container ip at all cost, depending on the use-case use the service name or the task.{service_name}. By default the service name resolves to a virtual ip, which forward traffic to the target containers via round robbin. It is the prefered approach, unless there is a reason to not use it.