Description
I am experiencing some intermittent communications issues between containers in the same overlay network. I have been struggling to find a solution to this for weeks but everything I see in Google relating to communications issues dosen’t quite match what I am seeing.
So I am hoping someone here can help me figure out what is going on.
We are using Docket 17.06
We are using standalone swarm with three masters and one node.
We have multiple overlay networks
Containers attached to each overlay network
1 container running Apache Tomcat 8.5 and HAproxy 1.7 (called the controller)
1 container just running Apache Tomcat 8.5 (called the apps container)
3 containers running Postgresql 9.6
1 container running an FTP service
1 container running Logstash
Steps to reproduce the issue:
Create a new overlay network Attach containers Look at the logs and after a short while you see the errors
Describe the results you received:
The “controller” polls a servlet on “apps” container every few seconds. Every 15 minutes or so we see a connect timed out error in the log files of the “controller”. And perodically we see connection attempt failed when the controller tries to access its database in one of the Postgresql containers.
Error when polling apps container
org.apache.http.conn.ConnectTimeoutException: Connect to srvpln50-webapp_1.0-1:5050 [srvpln50-webapp_1.0-1/10.0.1.6] failed: connect timed out
Error when trying to connect to database
JavaException: com.ebasetech.xi.exceptions.FormRuntimeException: Error getting connection using Database Connection CONTROLLER, SQLEx ception in StandardPoolDataSource:getConnection exception: java.sql.SQLException: SQLException in StandardPoolDataSource:getConnection no connection available java.sql.SQLException: Cannot get connection for URL jdbc:postgresql://srvpln50-controller-db_latest:5432/ctrldata : The connection attempt failed.
I turned on debug mode on the docker daemon node.
Every time these errors occur I see the following corrosponding entry in the docker logs:
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.422797691Z" level=debug msg="Name To resolve: srvpln50-webapp_1.0-1."
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.422905040Z" level=debug msg="Lookup for srvpln50-webapp_1.0-1.: IP [10.0.1.6]"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.648262289Z" level=debug msg="miss notification: dest IP 10.0.0.3, dest MAC 02:42:0a:00:00:03"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.716329366Z" level=debug msg="miss notification: dest IP 10.0.0.6, dest MAC 02:42:0a:00:00:06"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.716952000Z" level=debug msg="miss notification: dest IP 10.0.0.6, dest MAC 02:42:0a:00:00:06"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.802320875Z" level=debug msg="miss notification: dest IP 10.0.0.3, dest MAC 02:42:0a:00:00:03"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.944189349Z" level=debug msg="miss notification: dest IP 10.0.0.9, dest MAC 02:42:0a:00:00:09"
Feb 09 14:27:26 swarm-node-1 dockerd[12193]: time="2018-02-09T14:27:26.944770233Z" level=debug msg="miss notification: dest IP 10.0.0.9, dest MAC 02:42:0a:00:00:09"
IP 10.0.0.3 is the "controller" container
IP 10.0.0.6 is the "apps" container
IP 10.0.0.9 is the "postgresql" container that the "controller" is trying to connect to.
Output of docker version:
Client:
Version: 17.06.1-ce
API version: 1.30
Go version: go1.8.3
Git commit: 874a737
Built: Thu Aug 17 22:51:12 2017
OS/Arch: linux/amd64
Server:
Version: 17.06.1-ce
API version: 1.30 (minimum version 1.12)
Go version: go1.8.3
Git commit: 874a737
Built: Thu Aug 17 22:50:04 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info:
Containers: 19
Running: 19
Paused: 0
Stopped: 0
Images: 18
Server Version: 17.06.1-ce
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 385
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 6e23458c129b551d5c9871e5174f6b1b7f6d1170
runc version: 810190ceaa507aa2727d7ae6f4790c76ec150bd2
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-108-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.784GiB
Name: swarm-node-1
ID: O5ON:VQE7:IRV6:WCB7:RQO4:RIZ4:XFHE:AUCX:ZLM2:GPZL:DXQO:BCIX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 217
Goroutines: 371
System Time: 2018-02-09T15:50:01.902816981Z
EventsListeners: 2
Registry: https://index.docker.io/v1/
Labels:
name=swarm-node-1
Experimental: false
Cluster Store: etcd://localhost:2379/store
Cluster Advertise: 10.80.120.13:2376
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
Additional environment details:
Swarm masters, node and containers are running Ubuntu 16.04 on bare metal servers
If there is anything I have missed that would aid diagnose please let me know.