Overlay network not working not working between two containers - Part II

I’m referencing my original post to give some background on this issue: Overlay network not working between two swarm containers - #6 by meyay

My main issue is the externally defined overlay network defined on the node master is not visible (well at least sometimes) on the worker note with the worker node responding: network openldap-net declared as external, but could not be found

I’m running Docker version 20.10.9, build c2ea9bc90b on two separate VM hosts.

VM Host #1 - IP address 10.0.1.86
VM Host #2 - IP address 10.0.1.160

Hosts can ping each other and can ssh into each other.

I’m trying to create a overlay network – or essentially a private network between two containers running within the docker stack on each of these VM’s. I’m using swarm to create the private network attempt to use it’s overlay feature as described in the official documentation: Networking with overlay networks | Docker Docs

I created the swarm and designated manager and worker nodes:

❯ sudo docker node ls
ID                            HOSTNAME                    STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
01bpw9tjjlzeyu3ta530piq2e        arch160.domain.com            Ready     Active                          20.10.9
5be93cjhrc5pxvmk36jt0h563 *   archZFSProxy.domain.com   Ready     Active         Leader           20.10.9

I created the overlay network for the swarm on the master using the following command:

sudo docker network create --driver overlay --attachable --subnet 10.90.0.0/24 --opt encrypted openldap-net

Upon creation of the docker swarm and overlay network, the networks as seen from the manager appear as the following:

❯ sudo docker network ls
NETWORK ID     NAME                  DRIVER    SCOPE
3b9a33636b3b   bridge                bridge    local
c094888160f3   docker_gwbridge       bridge    local
6ea931cc3eda   host                  host      local
tsji27aqyqku   ingress               overlay   swarm
8d3b52c8124a   net                   bridge    local
1037c20ae31f   none                  null      local
bk5x5d7lhxca   openldap-net          overlay   swarm
b00e0fdb8c90   watchtower_ubuntumc   bridge    local

I’m utilizing docker-compose to manager the stacks on both the manager and worker node.

The manager’s docker-compose has a section like the following (10.0.1.86 host address):

---
version: '3.9'

networks:
  net:
    name: net
    driver: bridge
    ipam:
      config:
        - subnet: 10.190.0.0/24
  watchtower-ubuntumc:
    name: watchtower_ubuntumc
    driver: bridge
  openldap-net:
    external: true
    name: openldap-net
    driver: overlay

services:

  openldap:
    build:
      context: .
      dockerfile: Dockerfile-openldap
    container_name: openldap
    labels:
      - "com.centurylinklabs.watchtower.enable=false"
      - "com.centurylinklabs.watchtower.scope=archzfsproxy"
    restart: always
    hostname: openldap
    domainname: domain.com
    networks:
      net:
      openldap-net:
        aliases:
          - openldap1
        ipv4_address: 10.90.0.2
    ports:
      - 389:389
      - 636:636
    secrets:
      - authentication_backend-ldap_secret
      - openldap-config-database_secret
    environment:
      TZ: ${TZ}
      LDAP_LOG_LEVEL: 256
      LDAP_ORGANISATION: domain
      LDAP_DOMAIN: openldap.domain.com
      LDAP_BASE_DN: dc=ldap,dc=domain,dc=com
      LDAP_ADMIN_PASSWORD_FILE: /run/secrets/authentication_backend-ldap_secret
      LDAP_CONFIG_PASSWORD_FILE: /run/secrets/openldap-config-database_secret
      LDAP_TLS: "true"
      LDAP_TLS_CRT_FILENAME: cert.pem
      LDAP_TLS_KEY_FILENAME: key.pem
      LDAP_TLS_CA_CRT_FILENAME: ca.pem
      LDAP_TLS_DH_PARAM_FILENAME: "dhparam.pem"
      LDAP_TLS_ENFORCE: "false"
      LDAP_TLS_PROTOCOL_MIN: 3.4
      LDAP_TLS_VERIFY_CLIENT: try
      LDAP_REPLICATION: "true"
      LDAP_REPLICATION_HOSTS: "#PYTHON2BASH:['ldap://openldap.domain.com', 'ldap://openldap2.domain.com']"
      KEEP_EXISTING_CONFIG: "false"
      LDAP_REMOVE_CONFIG_AFTER_SETUP: "false"
      LDAP_SSL_HELPER_PREFIX: ldap
      LDAP_OPENLDAP_UID: 439
      LDAP_OPENLDAP_GID: 439
    tty: true
    command: --copy-service --loglevel debug
    stdin_open: true
    volumes:
      - /usr/share/zoneinfo:/usr/share/zoneinfo:ro
      - /etc/localtime:/etc/localtime:ro
      - /etc/timezone:/etc/timezone:ro
      - /data/ldap/db:/var/lib/ldap
      - /data/ldap/config:/etc/ldap/slapd.d
      - /etc/ssl/self-signed-certs/openldap.domain.com/server:/container/service/slapd/assets/certs:ro

The worker node’s docker-compose.yml file appears like the following (host address 10.0.1.160):

---
version: '3.9'

networks:
  docker-net:
    name: docker-net
    driver: bridge
    ipam:
      config:
        - subnet: 10.160.0.0/24
  openldap-net:
    external: true
    name: openldap-net
    driver: overlay

services:

  openldap2:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: openldap2
    hostname: openldap2
    domainname: domain.com
    restart: unless-stopped
    networks:
      docker-net:
      openldap-net:
        aliases:
          - openldap2
        ipv4_address: 10.90.0.4
    ports:
      - 389:389
      - 636:636
    environment:
      TZ: America/Chicago
      LDAP_LOG_LEVEL: 256
      LDAP_ORGANISATION: domain
      LDAP_DOMAIN: openldap.domain.com
      LDAP_BASE_DN: dc=ldap,dc=domain,dc=com
      LDAP_ADMIN_PASSWORD: ***
      LDAP_CONFIG_PASSWORD: ***
      LDAP_TLS: "true"
      LDAP_TLS_CRT_FILENAME: cert.pem
      LDAP_TLS_KEY_FILENAME: key.pem
      LDAP_TLS_CA_CRT_FILENAME: ca.pem
      LDAP_TLS_DH_PARAM_FILENAME: "dhparam.pem"
      LDAP_TLS_ENFORCE: "false"
      LDAP_TLS_PROTOCOL_MIN: 3.4
      LDAP_TLS_VERIFY_CLIENT: try
      KEEP_EXISTING_CONFIG: "false"
      LDAP_REMOVE_CONFIG_AFTER_SETUP: "false"
      LDAP_SSL_HELPER_PREFIX: ldap
      LDAP_OPENLDAP_UID: 439
      LDAP_OPENLDAP_GID: 439
      LDAP_BACKUP_TTL: 15
      LDAP_REPLICATION: "true"
      LDAP_REPLICATION_HOSTS: "#PYTHON2BASH:['ldap://openldap.domain.com','ldap://openldap2.domain.com']"
    tty: true
    command: --copy-service --loglevel debug
    volumes:
      - /usr/share/zoneinfo:/usr/share/zoneinfo:ro
      - /etc/localtime:/etc/localtime:ro
      - /etc/timezone:/etc/timezone:ro
      - /data/ldap/db:/var/lib/ldap
      - /data/ldap/config:/etc/ldap/slapd.d
      - /etc/ssl/self-signed-certs/openldap2.domain.com/server:/container/service/slapd/assets/certs:ro

When trying to start the docker-compose stack on the worker node I get the following:

> sudo docker-compose up openldap2 -d    
network openldap-net declared as external, but could not be found

So I’ve tried restarting the docker-daemons on both the master and worker.

Looking at the docker logs on the worker node:

21-10-08T17:06:20.548706597-05:00" level=info msg="scheme \"\" not registered, fallback to default scheme" module=grpc
21-10-08T17:06:20.548813468-05:00" level=info msg="ccResolverWrapper: sending update to cc: {[{10.0.1.86:2377  <nil> 0 <nil>}] <nil> <nil>}" module=grpc
21-10-08T17:06:20.548847920-05:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
21-10-08T17:06:20.548944897-05:00" level=info msg="manager selected by agent for new session: {5be93cjhrc5pxvmk36jt0h563 10.0.1.86:2377}" module=node/agent node.id=01bpw9tjjl>
21-10-08T17:06:20.549322611-05:00" level=info msg="waiting 0s before registering session" module=node/agent node.id=01bpw9tjjlzeyu3ta530piq2e
21-10-08T17:06:20.809850606-05:00" level=info msg="initialized VXLAN UDP port to 4789 "
21-10-08T17:06:20.809932622-05:00" level=info msg="Daemon has completed initialization"
21-10-08T17:06:20.809890898-05:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=0.0.0.0 Local-addr=10.0.1.160 Adv-addr=10.0.1.160 Data-addr= Remote-addr-list=[10>
21-10-08T17:06:20.810056361-05:00" level=info msg="New memberlist node - Node:arch160.domain.com will use memberlist nodeID:1788da20248a with config:&{NodeID:1788da20248a H>
21-10-08T17:06:20.823633347-05:00" level=info msg="Node 1788da20248a/10.0.1.160, joined gossip cluster"
21-10-08T17:06:20.823798750-05:00" level=info msg="Node 1788da20248a/10.0.1.160, added to nodes list"
21-10-08T17:06:20.834203576-05:00" level=info msg="The new bootstrap node list is:[10.0.1.86]"
21-10-08T17:06:20.839053845-05:00" level=info msg="Node 0b9c5675fe8e/10.0.1.86, joined gossip cluster"
21-10-08T17:06:20.839159881-05:00" level=info msg="Node 0b9c5675fe8e/10.0.1.86, added to nodes list"
cker Application Container Engine.
21-10-08T17:06:20.888374071-05:00" level=info msg="API listen on /var/run/docker.sock"
21-10-08T17:06:20.892881851-05:00" level=info msg="API listen on [::]:2376"
21-10-08T17:06:21.814871802-05:00" level=error msg="error reading the kernel parameter net.ipv4.vs.expire_nodest_conn" error="open /proc/sys/net/ipv4/vs/expire_nodest_conn: n>
21-10-08T17:06:21.814928307-05:00" level=error msg="error reading the kernel parameter net.ipv4.vs.expire_quiescent_template" error="open /proc/sys/net/ipv4/vs/expire_quiesce>
21-10-08T17:06:21.814952089-05:00" level=error msg="error reading the kernel parameter net.ipv4.vs.conn_reuse_mode" error="open /proc/sys/net/ipv4/vs/conn_reuse_mode: no such>

So it looks like the worker node actually can recognize the master node as per the log files.
I have no idea however how to go further with this problem. I don’t even know where to begin to debug this issue. Clearly restarting the docker-daemons isn’t working in this situation. Do I recreate the swarm setup?

I think what you are seeing here is that you’ve told the worker node compose file that there should already be an overlay network present on the node but there isn’t - the overlay network has been created on the swarm, but it will only be plumbed out when a service task is deployed which needs to use that network.

You don’t need to create the overlay network on the other nodes, because it will be automatically created when one of those nodes starts running a service task which requires it.

So as a trick you could deploy a service (docker service create…) on your swarm that creates a container on your worker node that uses that overlay network. Then from the worker node you should see (docker network ls) that it gets plumbed out and then your worker compose file that references an externally defined overlay network will find its mark.

Ok – what you described or what your proposed solution is does work.

I need to have a service running on the worker node in advance prior to running the docker-compose file.

Idk a lot how things work, and may this is by design of the docker people, but this “trick” seems like a bad workaround hack.

Things work on the command line – meaning deploying a container on the worker node without the overlay network being listed under the docker networks. Once the service is deployed (from the worker node) manually the overlay network becomes visible.

This behavior falls down though when you stick docker-compose as an intermediate. Docker compose looks for the external network, doesn’t find it, and gives up. With the command line deploy you specify the overlay network and things work, with the docker-compose deploy scheme – you specify the overlay network and things don’t work since the network isn’t found.

Thanks for the reply. I spent hours scratching around and looking for a solution and didn’t really find one until I just played with things using a very simple and dumbed down test situation.

Yeah - the ‘trick’ was just to test the theory of the problem - I didn’t mean to propose that as a solution. I think what you are really after is using docker stack deploy with a single compose file that defines all the pieces of your app that uses constraints to send your different services to whichever nodes you want them to come up on and the depends_on compose setting to get your service start-up order correct. That said, I don’t use docker-compose much so it may well do exactly what you want (but I tend to think of it for use with a single node, whereas docker stack deploy for use with a Swarm consisting of multiple nodes).