Intermittent connection failures between docker containers

Description

I am experiencing some intermittent communications issues between containers in the same overlay network. I have been struggling to find a solution to this for weeks but everything I see in Google relating to communications issues dosen’t quite match what I am seeing.

So I am hoping someone here can help me figure out what is going on.

We are using Docket 17.06
We are using standalone swarm with three masters and one node.
We have multiple overlay networks

Containers attached to each overlay network

1 container running Apache Tomcat 8.5 and HAproxy 1.7 (called the controller)
1 container just running Apache Tomcat 8.5 (called the apps container)
3 containers running Postgresql 9.6
1 container running an FTP service
1 container running Logstash

Steps to reproduce the issue:

Create a new overlay network Attach containers Look at the logs and after a short while you see the errors

Describe the results you received:

The “controller” polls a servlet on “apps” container every few seconds. Every 15 minutes or so we see a connect timed out error in the log files of the “controller”. And perodically we see connection attempt failed when the controller tries to access its database in one of the Postgresql containers.

Error when polling apps container

org.apache.http.conn.ConnectTimeoutException: Connect to srvpln50-webapp_1.0-1:5050 [srvpln50-webapp_1.0-1/10.0.1.6] failed: connect timed out

Error when trying to connect to database

JavaException: com.ebasetech.xi.exceptions.FormRuntimeException: Error getting connection using Database Connection CONTROLLER, SQLEx ception in StandardPoolDataSource:getConnection exception: java.sql.SQLException: SQLException in StandardPoolDataSource:getConnection no connection available java.sql.SQLException: Cannot get connection for URL jdbc:postgresql://srvpln50-controller-db_latest:5432/ctrldata : The connection attempt failed.

I turned on debug mode on the docker daemon node.

Every time these errors occur I see the following corrosponding entry in the docker logs:

Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.422797691Z" level=debug msg="Name To resolve: srvpln50-webapp_1.0-1."
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.422905040Z" level=debug msg="Lookup for srvpln50-webapp_1.0-1.: IP [10.0.1.6]"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.648262289Z" level=debug msg="miss notification: dest IP 10.0.0.3, dest MAC 02:42:0a:00:00:03"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.716329366Z" level=debug msg="miss notification: dest IP 10.0.0.6, dest MAC 02:42:0a:00:00:06"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.716952000Z" level=debug msg="miss notification: dest IP 10.0.0.6, dest MAC 02:42:0a:00:00:06"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.802320875Z" level=debug msg="miss notification: dest IP 10.0.0.3, dest MAC 02:42:0a:00:00:03"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.944189349Z" level=debug msg="miss notification: dest IP 10.0.0.9, dest MAC 02:42:0a:00:00:09"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.944770233Z" level=debug msg="miss notification: dest IP 10.0.0.9, dest MAC 02:42:0a:00:00:09"

IP 10.0.0.3 is the "controller" container
IP 10.0.0.6 is the "apps" container
IP 10.0.0.9 is the "postgresql" container that the "controller" is trying to connect to.

Output of docker version:

Client:

Version: 17.06.1-ce
API version: 1.30
Go version: go1.8.3
Git commit: 874a737
Built: Thu Aug 17 22:51:12 2017
OS/Arch: linux/amd64

Server:

Version: 17.06.1-ce
API version: 1.30 (minimum version 1.12)
Go version: go1.8.3
Git commit: 874a737
Built: Thu Aug 17 22:50:04 2017
OS/Arch: linux/amd64
Experimental: false

Output of docker info:

Containers: 19
 Running: 19
 Paused: 0
 Stopped: 0
Images: 18
Server Version: 17.06.1-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 385
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 6e23458c129b551d5c9871e5174f6b1b7f6d1170
runc version: 810190ceaa507aa2727d7ae6f4790c76ec150bd2
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-108-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.784GiB
Name: swarm-node-1
ID: O5ON:VQE7:IRV6:WCB7:RQO4:RIZ4:XFHE:AUCX:ZLM2:GPZL:DXQO:BCIX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 217
 Goroutines: 371
 System Time: 2018-02-09T15:50:01.902816981Z
 EventsListeners: 2
Registry: https://index.docker.io/v1/
Labels:
 name=swarm-node-1
Experimental: false
Cluster Store: etcd://localhost:2379/store
Cluster Advertise: 10.80.120.13:2376
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details:

Swarm masters, node and containers are running Ubuntu 16.04 on bare metal servers

If there is anything I have missed that would aid diagnose please let me know.

Having read many comments from the Docker folks on Google about the number of communication related fixes in the latest release we upgrades our systems to 17.12 CE and the issues we were experiencing went away.

I would love to know what the issue was but am happy to see it gone so am closing this topic.