Network connectivity seems unstable during parallel service deployments

Hi All, I am battling one issue related to swarm/zookeeper, and wondering if anyone can provide some insight in the process.

The setup:

  1. a typical three node zookeeper cluster named as zookeeper-1, zookeeper-2, zookeeper-3 with proper configuration to form a cluster.
  2. Additionally I got a script to launch the three nodes one by one (and followed by other service launching after I verified the three nodes are in either leader/follower state).

Testing observations:

  1. The cluster eventually always ended up in a properly elected stable state. This is good
  2. About half of the time, the clusters took more than 20 minutes to reach their consensus. This is the issue I am trying to fix. So far the investigation points to the swarm network stability issue. And the problems I found are:

Problems:

  1. There are 4+ seconds lap between the time a service name is resolvable via DNS, and the time when the service is up and running. The consequence is this: zookeeper-2 got up and contacted zookeeper-1, and they elected zookeeper-2 to be the leader through their message exchange. However when zookeeper-1 tried to sync from zookeeper-2, it got the error saying the name zookeeper-2 could not be resolved for at least 4+ seconds. The failure to sync triggered another round of election, so the election drags on.
  2. I observed from time to time that the two slaves’ connection to their elected leader could be broken for some unknown reason (read exception). From that point on, the three nodes remain unreachable from each other for about 17 minutes. Eventually they all connected again and election succeeded. The 17 minutes of isolation was the culprit causing the zookeeper election failure.

Question:

  1. Is the lagging of service name as a DNS entry from the service running normal? Can we get around this somehow? Or at least reduce the lag?
  2. Are their any tools or logging options that I can turn on to confirm/deny there was network partitioning issues in swarm?

Note:

  1. During the nodes isolation, usually I was deploying other services onto the same docker node with my script. And the deployment potentially can cause traffic hike due to pulling of new images that are hundreds of mega bytes. I am not sure this was related though.
  2. I did observe in many occasions deployed service names are not resolvable (with the ping command). I usually worked around the issues by restarting docker nodes.

Thanks for any help.

BTW, all the services are deployed into a swarm-wide network called back-tier