Docker Community Forums

Share and learn in the Docker community.

Network Overlay is Unstable?


(Nicolas Bihan) #1

Hi,

We are evaluating Docker Datacenter (UCP 2.1.4/DTR 2.2.5) with docker 17.03.2-ee-4, build 1e6d71e installed on 6 nodes.
I deploy a simple Spring Boot 1.5.4 application with a Postgres DB backend.
The stack is working fine when first deployed, but after a while the Boot part is losing connection to the Postgres database in the Swarm.
I tried to run the same app on a single host and it is stable, the app will work without any issue.

Here is the compose I am using to deploy the stack from the UCP

version: '3'

services:
  web:
image: dudockv8.dev.mydomain.com/aa/score:1.0.18
deploy:
  replicas: 1
  update_config:
    parallelism: 1
    delay: 10s
  restart_policy:
    condition: on-failure
  placement:
    constraints:
      - node.role != manager        
restart: always
ports:
  - '8080:8080'
depends_on:
  - db
environment:
  - DB_URL=jdbc:postgresql://db:5432/agilea
  db:
image: postgres:9.6.3-alpine
deploy:
  restart_policy:
    condition: on-failure
  placement:
    constraints:
      - node.role != manager        
restart: always
environment:
  POSTGRES_DB: agilea
  POSTGRES_USER: postgres
  POSTGRES_PASSWORD: postgres
  PGDATA: /var/lib/postgresql/data/pgdata
ports:
  - '5432'  
volumes:
  - postgresdata:/var/lib/postgresql/data/pgdata

volumes:
  postgresdata:

The nodes are VMs in a VMWare network and I can’t find any logs that would point to a resolution.
When I restart the application I can use it again but this is losing connection soon after.

Also, I tried to deploy the Example Voting App and it is not registering some votes

So, clearly something wrong in the network, but I can’t figure out what.

Any idea on what could be going on here?


(Nicolas Bihan) #2

Alright, we try to move everything on a new set of VMs and after a lot of trial we are still experiencing this behavior.
We added monitored pings between servers and see no issue.
In the UCP logs I see that in one box:

Overlay network configuration
Network p8y7gpbv1cz3
nsenter: cannot open /var/run/docker/netns/*-p8y7gpbv1c: No such file or directory
nsenter: cannot open /var/run/docker/netns/*-p8y7gpbv1c: No such file or directory

Network b37b32tlucab
33:33:00:00:00:01 dev br0 self permanent
01:00:5e:00:00:01 dev br0 self permanent
33:33:ff:01:bb:f2 dev br0 self permanent
fe:50:98:86:77:a7 dev vxlan1 vlan 1 master br0 permanent
fe:50:98:86:77:a7 dev vxlan1 master br0 permanent
02:42:0a:ff:00:0c dev vxlan1 dst 192.168.141.36 link-netnsid 0 self permanent
02:42:0a:ff:00:0b dev vxlan1 dst 192.168.141.36 link-netnsid 0 self permanent
02:42:0a:ff:00:07 dev vxlan1 dst 192.168.141.34 link-netnsid 0 self permanent
02:42:0a:ff:00:05 dev vxlan1 dst 192.168.141.34 link-netnsid 0 self permanent
02:42:0a:ff:00:04 dev vxlan1 dst 192.168.141.33 link-netnsid 0 self permanent
02:42:0a:ff:00:03 dev vxlan1 dst 192.168.141.32 link-netnsid 0 self permanent
02:42:0a:ff:00:02 dev vxlan1 dst 192.168.141.31 link-netnsid 0 self permanent
02:42:0a:ff:00:0e dev vxlan1 dst 192.168.141.33 link-netnsid 0 self permanent
02:42:0a:ff:00:0d dev vxlan1 dst 192.168.141.33 link-netnsid 0 self permanent
6a:6d:94:7f:44:96 dev veth2 vlan 1 master br0 permanent
6a:6d:94:7f:44:96 dev veth2 master br0 permanent
33:33:00:00:00:01 dev veth2 self permanent
01:00:5e:00:00:01 dev veth2 self permanent
33:33:ff:7f:44:96 dev veth2 self permanent
port no	mac addr		is local?	ageing timer
  2	6a:6d:94:7f:44:96	yes		   0.00
  2	6a:6d:94:7f:44:96	yes		   0.00
  1	fe:50:98:86:77:a7	yes		   0.00
  1	fe:50:98:86:77:a7	yes		   0.00

Network kxcyq7fmjluu
nsenter: cannot open /var/run/docker/netns/*-kxcyq7fmjl: No such file or directory
nsenter: cannot open /var/run/docker/netns/*-kxcyq7fmjl: No such file or directory

This no such file or directory is a bit worrying for sure…


(Nicolas Bihan) #3

I solved my issue by making sure our VMs cluster was ntp synchronized and setup proper time-out on Hiraki connection pool…