Overlay network not working between two swarm containers

I’ve had this problem for a while but I always come to this point not knowing how to fix the problem.

I using docker-compose and have attempted to create an overlay network to connect two containers (running on separate VMs) within a docker swarm. I referenced this section of the official documentation to set this up:

I went ahead and created a swarm and joined the two nodes:

sudo docker node ls
ID HOSTNAME STATUS AVAILABILITY  MANAGER STATUS  ENGINE VERSION
01bpw9tjjlzeyu3ta530piq2e  arch160.domain.com Ready  Active 20.10.8
5be93cjhrc5pxvmk36jt0h563 *  archZFSProxy.domain.com  Ready  Active  Leader  20.10.8

Within the docker-compose file for the manager I have the following:

version: '3.9'
networks:
  net:
    name: net
    driver: bridge
    ipam:
      config:
        - subnet: 10.190.0.0/24
  watchtower-ubuntumc:
    name: watchtower_ubuntumc
    driver: bridge
  openldap-net:
    name: openldap-net
    driver: overlay
    attachable: true
    ipam:
      config:
        - subnet: 10.90.0.0/24

I have the following container (openldap) utilizing this network:

services:

  openldap:
    build: 
      context: .
      dockerfile: Dockerfile-openldap
#    image: osixia/openldap-backup:latest
    container_name: openldap
    labels:
      - "com.centurylinklabs.watchtower.enable=false"
      - "com.centurylinklabs.watchtower.scope=archzfsproxy"
    restart: always
    hostname: openldap
    domainname: domain.com
    networks:
      net:
      openldap-net:
        aliases:
          - openldap1
        ipv4_address: 10.90.0.2

If I inspect the network list from the manager I have the following:

❯ sudo docker network ls
NETWORK ID     NAME                  DRIVER    SCOPE
0afa863d1a38   bridge                bridge    local
c094888160f3   docker_gwbridge       bridge    local
6ea931cc3eda   host                  host      local
tsji27aqyqku   ingress               overlay   swarm
d8197a60ed27   net                   bridge    local
1037c20ae31f   none                  null      local
kqw7j9kxnkk6   openldap-net          overlay   swarm
b9e36dfe816d   watchtower_ubuntumc   bridge    local

Although the worker node has worked in the past - the container can’t start because the overlay network isn’t reachable. Here are the relevant logs on the worker node:

sudo docker-compose up -d         
WARNING: The Docker Engine you're using is running in swarm mode.

Compose does not use swarm mode to deploy services to multiple nodes in a swarm. All containers will be scheduled on the current node.

To deploy your application across the swarm, use `docker stack deploy`.

Starting openldap2 ... error

ERROR: for openldap2  Cannot start service openldap2: Could not attach to network zc17lbud1gsrr7amkrj72pjvc: rpc error: code = NotFound desc = network zc17lbud1gsrr7amkrj72pjvc not found

ERROR: for openldap2  Cannot start service openldap2: Could not attach to network zc17lbud1gsrr7amkrj72pjvc: rpc error: code = NotFound desc = network zc17lbud1gsrr7amkrj72pjvc not found
ERROR: Encountered errors while bringing up the project.

So its trying to find the docker network (which I’m guessing is the overlay network designated by zc17lbud1gsrr7amkrj72pjvc. Where is it getting this network ID??

Here are networks as seen by the worker container:

╰─ sudo docker network ls            ─╯
NETWORK ID     NAME              DRIVER    SCOPE
15ae93d56fa3   bridge            bridge    local
315bfa9f2ade   docker-net        bridge    local
03274edc9e94   docker_gwbridge   bridge    local
5969c9f024f2   host              host      local
tsji27aqyqku   ingress           overlay   swarm
bde961b8ece2   none              null      local

Here are sections of my docker-compose file for the worker node:

---
version: '3.9'

networks:
  docker-net:
    name: docker-net
    driver: bridge
    ipam:
      config:
        - subnet: 10.160.0.0/24
  openldap-net:
    external: true
    name: openldap-net
    driver: overlay

services:
  openldap2:
#    image: osixia/openldap-backup:1.4.0
    build:
      context: .
      dockerfile: Dockerfile
    container_name: openldap2
    hostname: openldap2
    domainname: domain.com
    restart: unless-stopped
    networks:
      docker-net:
      openldap-net:
        aliases:
          - openldap2
        ipv4_address: 10.90.0.4

So I’m stumped. The worker container wont start because it’s looking for a specific network. This type of error usually happens when I have the VM’s up and running and then I manually restart the hosts or turn off the hypervisors. When restarting the containers up cold – I get this type of error with the worker container not being able to start.

How do I debug this issue further??

Is it possible that you declared the openldap-net network with the --attachable option…?

furthermore, are those the complete compose files? If so, then you lack the network declaration for you network. You can not use a network inside a copose file without declaring it. This is necessry regardless the network exists or is created within the stack.

See Compose file version 3 reference | Docker Documentation for what is required, to use an externaly created network within a compose file.

Update #3: I am unclear, if you only declared the networks block in your worker compose file - or also in the mmanager compose files as well. The declaration must exist in every compose file that want to use the external network.

Those are not the complete docker-compose files (rather a snippet to show the relevant sections). openldap-net was not declared to the best of my knowledge with --attachable.

If you look at my examples I do believe I actually included the network declaration within both docker-compose.yml files (there is a networks section in both files).

Here is the snippet from the manager file:

version: '3.9'
networks:
  net:
    name: net
    driver: bridge
    ipam:
      config:
        - subnet: 10.190.0.0/24
  watchtower-ubuntumc:
    name: watchtower_ubuntumc
    driver: bridge
  openldap-net:
    name: openldap-net
    driver: overlay
    attachable: true
    ipam:
      config:
        - subnet: 10.90.0.0/24

services:
  openldap:
    build: 
      context: .
      dockerfile: Dockerfile-openldap
#    image: osixia/openldap-backup:latest
    container_name: openldap
    labels:
      - "com.centurylinklabs.watchtower.enable=false"
      - "com.centurylinklabs.watchtower.scope=archzfsproxy"
    restart: always
    hostname: openldap
    domainname: domain.com
    networks:
      net:
      openldap-net:
        aliases:
          - openldap1
        ipv4_address: 10.90.0.2

And here is the snippet from the worker node

---
version: '3.9'

networks:
  docker-net:
    name: docker-net
    driver: bridge
    ipam:
      config:
        - subnet: 10.160.0.0/24
  openldap-net:
    external: true
    name: openldap-net
    driver: overlay

services:
  openldap2:
#    image: osixia/openldap-backup:1.4.0
    build:
      context: .
      dockerfile: Dockerfile
    container_name: openldap2
    hostname: openldap2
    domainname: domain.com
    restart: unless-stopped
    networks:
      docker-net:
      openldap-net:
        aliases:
          - openldap2
        ipv4_address: 10.90.0.4

The openldap-net network is only declared external in the worker-node file whereas its declared as internal for the manager node. There are more sections to the compose file such as ports, env declarations etc. however nothing else really deals with the network settings.

The network declaration looks good and should work if the master compse file is deployed first.
The attachable flag is configured to the network and thus should allow the worker node compose file to use the network.

Though, something does not add up here. Swarm stack deployments can only be executed on master nodes. I can see no deploy key and no placement constraint that would place a service to manager or worker nodes - thus resulting in random placement. Additionaly I can see that ipv4_address is declared for the network, which shouldn’t be a valid configuration for swarm stack deployments.

Are those compose files realy deployed using docker stack deploy (as in create swarm service)? Or are you running docker-compose on each of your nodes seperatly and try to just use the same overlay network? I am not sure if docker-compose deployments actualy are able to create overlay networks.

@meyay

Thanks for spending some time and looking at this problem.

Reference: Use an overlay network for standalone containers

You are correct. I’m am not technically using a string swarm deployment - rather sharing a network overlay between two stand alone containers. It’s probably a bad design implemented on my part, but there is a big big example in the official documentation on how to do this. They use swarm to create the overlay networks, but the two communicating containers are simply initiated via command line docker run commands – not using a docker stack deploy. In my example I can confirm that in most cases this setup works (with a caveat). Using the compose files as referenced above I can confirm I have in the past been able to communicate between the two independent containers located on different VMs using the overlay network using a TLS encrypted transport.

This design choice however isn’t very robust particularly in cases where either of the hosts is suddenly shut down or in cases where both docker hosts are rebooted. The master node needs to be up before the worker node (which in some cases is kind of hard to control). In most cases even with “timing errors” I usually see the docker swarm network created – the problem is with the overlay network. It’s my belief the worker node “holds on” or somehow caches the network ID of the old overlay network and it’s looking for this particular network on restart. Sometimes I just restart the containers (master and worker) and the problem is solved, other times I need to have worker and master leave the swarm and make a new swarm network and then restart the containers. In other cases I need to repeat these steps like 4-5 times (not knowing what to do) until I can get the worker node to attach to the overlay network with a “new” ID. It’s frankly kind of a cluster and the design isn’t very robust.

The purpose of this overlay network was to backend connect two openldap containers (through a private network) and have them hot sync between the two. Perhaps a different network design choice could have been chosen on my part to allow for this hot sync over a private network – I’ll definitely grant this is inexperience on my part on what is “proper” network design when dealing with “hot” spares or “hot” syncs. I’m just surprised that docker doesn’t seem to handle this situation more robustly since it’s included in their documentation.

I am quite sure that it would be robust if you would use it the way they write it in their documentation.
While in the documentation the overlay network is created on the cli with docker network create --driver=overlay --attachable openldap-net, you create it within a compose file. While the version of the documentation has a decoupled lifecycle and will exist until removal, your version is coupled to the lifecylce of docker compose AND adds an undesirable dependecy that leaves room for race condtions.

I would suggest to modify the network declaration in the master compose to match the current config of the node compose and then actualy do what the docs do: create the network from the cli. This should sort of most of the problems you introduced, if not all of them.

Arw you aware that custom bridge and overlay networks have build in service discovery, which is dns-based and allows to access other containers in the same network by their servicename?

@meyay

Hey thanks for pointing out a few things about my setup. I really appreciate the time you spent looking at it.

To be honest – nearly 100% of the documentation never uses examples with docker-compose, so I’m not sure when only CLI examples are used – does that mean – we recommend you do it this way — or – we just aren’t showing you docker-compose examples since we are not really commenting on the use of that tool.

I’m aware of the internal DNS that docker utilizes, however in working with this problem for so long – I needed a way to more reliably debug the problem. Although not strictly a docker problem, there were times the openldap container was utilizing IP addresses that DID NOT actually correlate to the IP address of the container rather IP addresses that were old and previously assigned. I needed a method I could almost guarantee that the container would be assigned the same IP address every single time. I didn’t delve into the source code of the container and the openldap github isn’t really helpful on why this was happening.

I’ll definitely try suggestion of defining the network outside the container and see what happens. I can envision how my current setup could create “race” conditions.