Following last week’s outage, we seem to have unshakeable problems with out inter-container comms.
Public symptom: our HAProxy containers often return 503 ‘service unavailable’, indicating the downstream containers are unavailable.
Inspection of the HAProxy logs show that it has successfully found and hooked up to all 4 of the downstream containers.
SSHing onto either of the nodes containing the two HAProxy containers, and curling on the 10.7.x.x address to a known target container exhibits the following behaviour: Approx 19/20 times it responds well in ~20ms. Approx 1/20 times it takes >15 seconds to respond.
There is nothing to suggest that this is an in-container problem in the downstream containers. It appears to be at a weave network level.
These are the same behaviours we experienced during your long Cloud API outage on Friday. Now that your status shows operational, we still have the same problems albeit intermittently. We have restarted all containers and/or restarted the nodes (EC2) on which they run. Problems persist.