Docker Community Forums

Share and learn in the Docker community.

Intermitten Weave connection hangs [started with last week's API/DNS outage]

dockercloud

(Mark Henwood) #1

Hi folks

Following last week’s outage, we seem to have unshakeable problems with out inter-container comms.

Public symptom: our HAProxy containers often return 503 ‘service unavailable’, indicating the downstream containers are unavailable.

Inspection of the HAProxy logs show that it has successfully found and hooked up to all 4 of the downstream containers.

SSHing onto either of the nodes containing the two HAProxy containers, and curling on the 10.7.x.x address to a known target container exhibits the following behaviour: Approx 19/20 times it responds well in ~20ms. Approx 1/20 times it takes >15 seconds to respond.

There is nothing to suggest that this is an in-container problem in the downstream containers. It appears to be at a weave network level.

These are the same behaviours we experienced during your long Cloud API outage on Friday. Now that your status shows operational, we still have the same problems albeit intermittently. We have restarted all containers and/or restarted the nodes (EC2) on which they run. Problems persist.

HELP!!


(Fernando Mayo) #2

@mhenwood Hey Mark, can you share your username in Docker Cloud, so we can have a look at your networking set up?


(Mark Henwood) #3

Sure, the stack is owned by the Organisation ‘clearreview’. I am ‘mhenwood’, an owner of that Org.


(Fernando Mayo) #4

@mhenwood I checked your networking setup and all nodes are connected to each other just fine. Is this intermittent connectivity issue happen with all containers, or just specific ones? Or containers in a specific node? Network latency between containers should be independent from the availability of Cloud’s API. Running this repo: https://github.com/fermayo/dockercloud-network-tester with a PING_THRESHOLD_MS=100 should log any spikes in latency between your nodes.

As a temporary workaround until we figure out the underlying issue, HAproxy can be configured to perform health checks and redispatch requests to a healthy backend if one of them is unavailable.


(Mark Henwood) #5

Thanks @fermayo. I want to find out the problem so will get around to running your tester container in the mid-term future. However in the short term I am fully engaged in urgent work to shift our operations to a Docker Datacenter installation because a series of recent outages on Docker Cloud have meant that we now deem it unreliable in terms of provision of our core product service. This is a shame because it’s a great facility in theory. [I joined in its Tutum days.]


(Mark Henwood) #6

Update for @fermayo: The move to DDC is… nontrivial. In the short term I’ve taken onboard your helpful comments about healthcheck arguments for HAProxy and am in the process of implementing same. Thanks again for your help.


(Mark Henwood) #7

FURTHER UPDATE:

Investigation further shows that the problem seems to be specific to 1 of the 2 Load Balancer containers, and possibly related to some AppArmor errors. See this thread: AppArmor seems to be interfering with a node's behaviour