Swarm mode and AWS ELB issues

cbo3 · November 22, 2016, 7:41pm

We are running docker 1.12.3 in swarm mode on AWS. We have 3 manager nodes, all drained and 2 worker nodes running 6 services in global mode (so all services are running on both nodes). We also have an ELB configured for our services. I thought it made sense for fault tolerance to list all 3 manager nodes in the ELB target groups and then we could depend on the manager nodes forwarding the traffic on to the worker nodes. The worker nodes are not listed at all in the target groups. This works mostly but what I’ve found is that we get intermittent 502 errors (bad gateway) in this configuration and I don’t see any pattern to them. I can’t tell if one manager is bad somehow but the nodes are all healthy. Just to test, I removed all but one manager from the ELB target groups and now we don’t get random 502 errors at all.

Does that make any sense? Is there any way to debug what is going on with the managers to see if one is somehow “bad”? Doesn’t it make sense to list all managers in the ELB so that if one manager goes down, the ELB won’t send it any traffic and the swarm will stay up?

pier92 · January 9, 2017, 11:29am

i’ve the same problem. Same docker version, i’ trying to deploy nginx service as proxy cache.
I have 1 manager and 2 workers, both on the same manager’s host. I’ve created workers with docker-machine (so there are two VM, using virtualbox).
I have randomly 502 Bad Gateway error, without patterns.
Have you solved the problem?
Thanks.

6454 · June 13, 2017, 9:47pm

Facing the exact same problem with docker-ce. I have 2 Managers and 5 workers. Everything seems to work fine when only one manger is attached to the ELB as soon as I add the 2nd manager it all stops working with none of my webpages loading and timing out.

If anybody has figured this out please help!

startdatelabs · October 29, 2017, 12:24pm

We’re using 19.09.0-ce on Ubuntu 14.04 with exactly the same trouble. The test config has 3 managers, 2 workers. The 3 managers are defined in the ALB target group, the 2 workers actually deploy the test app.

We get the same intermittent 502 Bad Gateway errors, with no detectable pattern. We’ve tried 1 manager, 2 managers and 3 managers – no difference.

Worse – we’ve seen this behavior consistently over all Docker releases in approximately the last year, the same as is being reported here.

If anyone has figured this out, we’d love to know how!