I’m referencing my original post to give some background on this issue: Overlay network not working between two swarm containers - #6 by meyay
My main issue is the externally defined overlay network defined on the node master is not visible (well at least sometimes) on the worker note with the worker node responding: network openldap-net declared as external, but could not be found
I’m running Docker version 20.10.9, build c2ea9bc90b on two separate VM hosts.
VM Host #1 - IP address 10.0.1.86
VM Host #2 - IP address 10.0.1.160
Hosts can ping each other and can ssh into each other.
I’m trying to create a overlay network – or essentially a private network between two containers running within the docker stack on each of these VM’s. I’m using swarm to create the private network attempt to use it’s overlay feature as described in the official documentation: Networking with overlay networks | Docker Docs
I created the swarm and designated manager and worker nodes:
❯ sudo docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
01bpw9tjjlzeyu3ta530piq2e arch160.domain.com Ready Active 20.10.9
5be93cjhrc5pxvmk36jt0h563 * archZFSProxy.domain.com Ready Active Leader 20.10.9
I created the overlay network for the swarm on the master using the following command:
sudo docker network create --driver overlay --attachable --subnet 10.90.0.0/24 --opt encrypted openldap-net
Upon creation of the docker swarm and overlay network, the networks as seen from the manager appear as the following:
❯ sudo docker network ls
NETWORK ID NAME DRIVER SCOPE
3b9a33636b3b bridge bridge local
c094888160f3 docker_gwbridge bridge local
6ea931cc3eda host host local
tsji27aqyqku ingress overlay swarm
8d3b52c8124a net bridge local
1037c20ae31f none null local
bk5x5d7lhxca openldap-net overlay swarm
b00e0fdb8c90 watchtower_ubuntumc bridge local
I’m utilizing docker-compose to manager the stacks on both the manager and worker node.
The manager’s docker-compose has a section like the following (10.0.1.86 host address):
---
version: '3.9'
networks:
net:
name: net
driver: bridge
ipam:
config:
- subnet: 10.190.0.0/24
watchtower-ubuntumc:
name: watchtower_ubuntumc
driver: bridge
openldap-net:
external: true
name: openldap-net
driver: overlay
services:
openldap:
build:
context: .
dockerfile: Dockerfile-openldap
container_name: openldap
labels:
- "com.centurylinklabs.watchtower.enable=false"
- "com.centurylinklabs.watchtower.scope=archzfsproxy"
restart: always
hostname: openldap
domainname: domain.com
networks:
net:
openldap-net:
aliases:
- openldap1
ipv4_address: 10.90.0.2
ports:
- 389:389
- 636:636
secrets:
- authentication_backend-ldap_secret
- openldap-config-database_secret
environment:
TZ: ${TZ}
LDAP_LOG_LEVEL: 256
LDAP_ORGANISATION: domain
LDAP_DOMAIN: openldap.domain.com
LDAP_BASE_DN: dc=ldap,dc=domain,dc=com
LDAP_ADMIN_PASSWORD_FILE: /run/secrets/authentication_backend-ldap_secret
LDAP_CONFIG_PASSWORD_FILE: /run/secrets/openldap-config-database_secret
LDAP_TLS: "true"
LDAP_TLS_CRT_FILENAME: cert.pem
LDAP_TLS_KEY_FILENAME: key.pem
LDAP_TLS_CA_CRT_FILENAME: ca.pem
LDAP_TLS_DH_PARAM_FILENAME: "dhparam.pem"
LDAP_TLS_ENFORCE: "false"
LDAP_TLS_PROTOCOL_MIN: 3.4
LDAP_TLS_VERIFY_CLIENT: try
LDAP_REPLICATION: "true"
LDAP_REPLICATION_HOSTS: "#PYTHON2BASH:['ldap://openldap.domain.com', 'ldap://openldap2.domain.com']"
KEEP_EXISTING_CONFIG: "false"
LDAP_REMOVE_CONFIG_AFTER_SETUP: "false"
LDAP_SSL_HELPER_PREFIX: ldap
LDAP_OPENLDAP_UID: 439
LDAP_OPENLDAP_GID: 439
tty: true
command: --copy-service --loglevel debug
stdin_open: true
volumes:
- /usr/share/zoneinfo:/usr/share/zoneinfo:ro
- /etc/localtime:/etc/localtime:ro
- /etc/timezone:/etc/timezone:ro
- /data/ldap/db:/var/lib/ldap
- /data/ldap/config:/etc/ldap/slapd.d
- /etc/ssl/self-signed-certs/openldap.domain.com/server:/container/service/slapd/assets/certs:ro
The worker node’s docker-compose.yml file appears like the following (host address 10.0.1.160):
---
version: '3.9'
networks:
docker-net:
name: docker-net
driver: bridge
ipam:
config:
- subnet: 10.160.0.0/24
openldap-net:
external: true
name: openldap-net
driver: overlay
services:
openldap2:
build:
context: .
dockerfile: Dockerfile
container_name: openldap2
hostname: openldap2
domainname: domain.com
restart: unless-stopped
networks:
docker-net:
openldap-net:
aliases:
- openldap2
ipv4_address: 10.90.0.4
ports:
- 389:389
- 636:636
environment:
TZ: America/Chicago
LDAP_LOG_LEVEL: 256
LDAP_ORGANISATION: domain
LDAP_DOMAIN: openldap.domain.com
LDAP_BASE_DN: dc=ldap,dc=domain,dc=com
LDAP_ADMIN_PASSWORD: ***
LDAP_CONFIG_PASSWORD: ***
LDAP_TLS: "true"
LDAP_TLS_CRT_FILENAME: cert.pem
LDAP_TLS_KEY_FILENAME: key.pem
LDAP_TLS_CA_CRT_FILENAME: ca.pem
LDAP_TLS_DH_PARAM_FILENAME: "dhparam.pem"
LDAP_TLS_ENFORCE: "false"
LDAP_TLS_PROTOCOL_MIN: 3.4
LDAP_TLS_VERIFY_CLIENT: try
KEEP_EXISTING_CONFIG: "false"
LDAP_REMOVE_CONFIG_AFTER_SETUP: "false"
LDAP_SSL_HELPER_PREFIX: ldap
LDAP_OPENLDAP_UID: 439
LDAP_OPENLDAP_GID: 439
LDAP_BACKUP_TTL: 15
LDAP_REPLICATION: "true"
LDAP_REPLICATION_HOSTS: "#PYTHON2BASH:['ldap://openldap.domain.com','ldap://openldap2.domain.com']"
tty: true
command: --copy-service --loglevel debug
volumes:
- /usr/share/zoneinfo:/usr/share/zoneinfo:ro
- /etc/localtime:/etc/localtime:ro
- /etc/timezone:/etc/timezone:ro
- /data/ldap/db:/var/lib/ldap
- /data/ldap/config:/etc/ldap/slapd.d
- /etc/ssl/self-signed-certs/openldap2.domain.com/server:/container/service/slapd/assets/certs:ro
When trying to start the docker-compose stack on the worker node I get the following:
> sudo docker-compose up openldap2 -d
network openldap-net declared as external, but could not be found
So I’ve tried restarting the docker-daemons on both the master and worker.
Looking at the docker logs on the worker node:
21-10-08T17:06:20.548706597-05:00" level=info msg="scheme \"\" not registered, fallback to default scheme" module=grpc
21-10-08T17:06:20.548813468-05:00" level=info msg="ccResolverWrapper: sending update to cc: {[{10.0.1.86:2377 <nil> 0 <nil>}] <nil> <nil>}" module=grpc
21-10-08T17:06:20.548847920-05:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
21-10-08T17:06:20.548944897-05:00" level=info msg="manager selected by agent for new session: {5be93cjhrc5pxvmk36jt0h563 10.0.1.86:2377}" module=node/agent node.id=01bpw9tjjl>
21-10-08T17:06:20.549322611-05:00" level=info msg="waiting 0s before registering session" module=node/agent node.id=01bpw9tjjlzeyu3ta530piq2e
21-10-08T17:06:20.809850606-05:00" level=info msg="initialized VXLAN UDP port to 4789 "
21-10-08T17:06:20.809932622-05:00" level=info msg="Daemon has completed initialization"
21-10-08T17:06:20.809890898-05:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=0.0.0.0 Local-addr=10.0.1.160 Adv-addr=10.0.1.160 Data-addr= Remote-addr-list=[10>
21-10-08T17:06:20.810056361-05:00" level=info msg="New memberlist node - Node:arch160.domain.com will use memberlist nodeID:1788da20248a with config:&{NodeID:1788da20248a H>
21-10-08T17:06:20.823633347-05:00" level=info msg="Node 1788da20248a/10.0.1.160, joined gossip cluster"
21-10-08T17:06:20.823798750-05:00" level=info msg="Node 1788da20248a/10.0.1.160, added to nodes list"
21-10-08T17:06:20.834203576-05:00" level=info msg="The new bootstrap node list is:[10.0.1.86]"
21-10-08T17:06:20.839053845-05:00" level=info msg="Node 0b9c5675fe8e/10.0.1.86, joined gossip cluster"
21-10-08T17:06:20.839159881-05:00" level=info msg="Node 0b9c5675fe8e/10.0.1.86, added to nodes list"
cker Application Container Engine.
21-10-08T17:06:20.888374071-05:00" level=info msg="API listen on /var/run/docker.sock"
21-10-08T17:06:20.892881851-05:00" level=info msg="API listen on [::]:2376"
21-10-08T17:06:21.814871802-05:00" level=error msg="error reading the kernel parameter net.ipv4.vs.expire_nodest_conn" error="open /proc/sys/net/ipv4/vs/expire_nodest_conn: n>
21-10-08T17:06:21.814928307-05:00" level=error msg="error reading the kernel parameter net.ipv4.vs.expire_quiescent_template" error="open /proc/sys/net/ipv4/vs/expire_quiesce>
21-10-08T17:06:21.814952089-05:00" level=error msg="error reading the kernel parameter net.ipv4.vs.conn_reuse_mode" error="open /proc/sys/net/ipv4/vs/conn_reuse_mode: no such>
So it looks like the worker node actually can recognize the master node as per the log files.
I have no idea however how to go further with this problem. I don’t even know where to begin to debug this issue. Clearly restarting the docker-daemons isn’t working in this situation. Do I recreate the swarm setup?