Service to Service Connectivity

We currently running in Docker Swarm with 3 Nodes.
After about 1 week the ability of 1 Service to contact another Service using the Service Name given by Docker fails.
Docker 17.06.0-ce running on Ubuntu 16.04.2
Swarm mode with 3 manager nodes

ufw removed
iptables-persistent/netfilter-persistent installed and enabled
/etc/iptables/rules.v4:
# Generated by iptables-save v1.6.0 on Thu Jul 20 00:16:44 2017
*filter
:INPUT ACCEPT [85:5080]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [58:5277]
-A INPUT -p tcp -m tcp --dport 2377 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 7946 -j ACCEPT
-A INPUT -p udp -m udp --dport 7946 -j ACCEPT
-A INPUT -p udp -m udp --dport 4789 -j ACCEPT
COMMIT
# Completed on Thu Jul 20 00:16:44 2017

/etc/iptables/rules.v6:
# Generated by ip6tables-save v1.6.0 on Thu Jul 20 00:16:44 2017
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
COMMIT
# Completed on Thu Jul 20 00:16:44 2017

We have some services running in global mode, and some services running in replicated mode.
The endpoint mode for all services is vip.

Eventually, a container has connection failures or timeouts when connecting to another service.
The container can resolve the target service by service name, and the target service exposes the port being connected to, but the connection fails or times out.
This happens to containers of global-mode services as well as replicated-mode services.

The network config is below:
[ { “Name”: “eieio-net”,
“Id”: “ur8kvgcb7g2arlwrpfgx2cfas”,
“Created”: “2017-07-31T16:10:33.570126891-05:00”,
“Scope”: “swarm”,
“Driver”: “overlay”,
“EnableIPv6”: false,
“IPAM”: {
“Driver”: “default”,
“Options”: null,
“Config”: [
{
“Subnet”: “10.0.0.0/24”,
“Gateway”: “10.0.0.1”
}
]
},
“Internal”: false,
“Attachable”: true,
“Ingress”: false,
“ConfigFrom”: {
“Network”: “”
},
“ConfigOnly”: false,
“Containers”: {
“0829734c25332c0441f9e4d0ff51aa13d5bf1fd1db77a664aa1762721ff97120”: {
“Name”: “devtx_eieio-esb.t31itowkehfplk7ufii346x0t.3rrj2jnd6ngyrku5vqd1aqw4j”,
“EndpointID”: “59fdb16af9373f9845ec24556c8048ba3aaf1b3bc61564e1f1679af848c6b8be”,
“MacAddress”: “02:42:0a:00:00:1e”,
“IPv4Address”: “10.0.0.30/24”,
“IPv6Address”: “”
},
“474ebcc048edd445dc0aac0646a23568e98f97a36bd5222e03ee7e6e63958321”: {
“Name”: “eieio-epsw-210-000-edc5cebda2-8ddb-02ededededed.1.g8dnm149hfy4lq7sjlhze5hbn-local”,
“EndpointID”: “f7eeb20593bd15babe8fd55c8e4a15cca7380b750b91360900b8cf6a643c4da2”,
“MacAddress”: “02:42:0a:00:00:1b”,
“IPv4Address”: “10.0.0.27/24”,
“IPv6Address”: “”
},
“4883838f239abb57bdbca4073fed3d774ad15777b94c932d510605ba2353f576”: {
“Name”: “eieio-epsw-210-000-edc5cebda2-8ddb-02ededededed.1.g8dnm149hfy4lq7sjlhze5hbn”,
“EndpointID”: “d5e4acc7688b640397a67fdd31652b5eac771ce1cefcbf392269b34459537d76”,
“MacAddress”: “02:42:0a:00:00:10”,
“IPv4Address”: “10.0.0.16/24”,
“IPv6Address”: “”
},
“7d936ee80d79d541e5492188e03a6425c55d38cc532c6f6d800a12e8431ec90a”: {
“Name”: “devtx_eieio-management-ui.t31itowkehfplk7ufii346x0t.wnac93pu1pt2rhyamjk5xs0pt”,
“EndpointID”: “d723cde45c3110ea04e36f389a30bb88a2321cc4edb23cb608c3d172db2a0887”,
“MacAddress”: “02:42:0a:00:00:1a”,
“IPv4Address”: “10.0.0.26/24”,
“IPv6Address”: “”
},
“e089be135c7991084d0edb31aabdae67146a21e022b7be4bd3c6c0e4a7cde233”: {
“Name”: “devtx_docker-api-http.t31itowkehfplk7ufii346x0t.ed1zdn70z6og907pxo66fuxs9”,
“EndpointID”: “dea24a89320fe5d89a7e7c460aa225ccd8ea03abe4bfaafc9bda116e872fe27a”,
“MacAddress”: “02:42:0a:00:00:04”,
“IPv4Address”: “10.0.0.4/24”,
“IPv6Address”: “”
},
“eabd91090c8751033d836c327f41b62c647632c00b049c05a8022d2942ff843e”: {
“Name”: “devtx_eieio-management-api.t31itowkehfplk7ufii346x0t.i1fcjedry67a58b83uoxfoo12”,
“EndpointID”: “1ca71f08196826a6979e09cb8b97165a491392e410a4f34783954900c1344d33”,
“MacAddress”: “02:42:0a:00:00:0b”,
“IPv4Address”: “10.0.0.11/24”,
“IPv6Address”: “”
},
“fa86aa144ead200745f9ed083486e86b9131cb5b24309513bb949f8b32bfd267”: {
“Name”: “devtx_eieio-ipaws.1.kz939hryntffy6d30mfqkklzh”,
“EndpointID”: “49d5ba5eca20ad34b6be691887308a5b4c72ef8e5810003fcf0ef7c9708a0cf0”,
“MacAddress”: “02:42:0a:00:00:12”,
“IPv4Address”: “10.0.0.18/24”,
“IPv6Address”: “”
}
},
“Options”: {
“com.docker.network.driver.overlay.vxlanid_list”: “4097”
},
“Labels”: {},
“Peers”: [
{
“Name”: “SMCI-DOCKER-01-37ed209fcf6f”,
“IP”: “192.168.34.22”
},
{
“Name”: “SMCI-DOCKER-02-e0e13e032f5d”,
“IP”: “192.168.34.24”
},
{
“Name”: “SMCI-DOCKER-03-c45a092e6599”,
“IP”: “192.168.34.26”
}
]
}
]

From what you are saying, service name resolution works, but access to the service itself fails.
Following is the flow at low level:
Container->service name -> service ip -> ipvs - > loadbalance -> container

Following are some things that i would look at:

  • is the containers running fine and healthy?
  • Look at service logs and container logs.
  • is container ip address reachable?
  • Dump ipvs tables and check load balancing part
  • Do tcpdump at container and host to check where packets are getting dropped.

Cheers
Sreenivas