No route to host when contained proxy's target is on other node of swarm

I have two containers in a stack. An nginx reverse proxy, and a simple nodejs ‘hello world’ app.

There are two nodes in the swarm.

The reverse proxy deploys in ‘global’ mode, and its port 80 is mapped to host’s port 80. The reverse proxy is configured to proxy_pass requests to the simple ‘hello world’.

The simple ‘hello world’ app deploys with just 1 (one) replica.

In all tests, I send an http request to the one of the nodes only. Alternately, I get 200 and 502.

When mesh routes to the proxy on the same node as the simple ‘hello world’ app, the proxy works: 200.
When the mesh routes to proxy on the other node, fail. 502. The proxy log complains ‘no route to host’.
When I inspect the stack’s network, the host sought does not exist.

Here is the reverse proxy’s error:
2017/12/06 21:46:27 [error] 5#5: *3 connect() failed (113: No route to host) while connecting to upstream, client: 10.255.0.2, server: router, request: “GET /hello HTTP/1.1”, upstream: “http://10.0.0.5:8080/hello”, host: “10.133.142.132”

10.0.0.5 does not show in output of ‘network inspect stack-network’ on either node. Also, the address of the target, simple ‘hello world’ app, is not 10.0.0.5.

I’ve tested/confirmed the required swarm ports are open, as per https://docs.docker.com/engine/swarm/swarm-tutorial/#open-protocols-and-ports-between-the-hosts

I’m seeking suggestions on how to debug.
Docker 17.06.

Thanks,
John

Still not resolved, but I can prove dns resolution gives the unreachable address.

I discovered that in the nodeapp container, two addresses are bound to eth0. The reverse proxy can reach one but not the other. DNS resolves to the unreachable address. WTF?

On node where nodeapp is running:
$ docker exec mesh_nodeapp.1.o5lfzmaptt09w0r2sn53ku32d ip addr

inet 10.0.0.6/24 scope global eth0
inet 10.0.0.5/32 scope global eth0

On node where nodeapp is not running:

$ docker exec mesh_tool.9h30zs8z30gfmb1hr6hp09sta.ndoks7mu6ykrj74v8i6en2z44 ping nodeapp
PING nodeapp (10.0.0.5): 56 data bytes
and hangs

However, pinging 10.0.0.6 from the node where nodeapp is not running:
$ docker exec mesh_tool.9h30zs8z30gfmb1hr6hp09sta.ndoks7mu6ykrj74v8i6en2z44 ping 10.0.0.6
PING 10.0.0.6 (10.0.0.6): 56 data bytes
64 bytes from 10.0.0.6: seq=0 ttl=64 time=0.980 ms

Why doesn’t DNS resolve to the reachable ip address in the swarm?

Note: above, mesh-tool is added in this scenario. It provides network tools. the router container does not have any network tools, like ping.

IMHO Overlay networking is totally broken in swarm. (just browse github issues) I have found 2 things to help:

  1. Reference services with tasks.service_name format (it will resolve to different ip than just service_name)
  2. In nginx config, set docker’s resolver & put resolve directive after upstream service. Like:
resolver 127.0.0.11 8.8.8.8 8.8.4.4 ipv6=off;

upstream api {
   server tasks.service:8080 resolve;
}

Martin,

Thanks for responding. I’d given up on this forum.

I changed the ‘endpoint_mode’ to dnsrr. I didn’t make the change you suggest. The container’s dns servers are already what you have, and in dnsrr the dns lookup return tasks.service ip addresses. These are physical ip addresses, not the virtual ip addresses, and are routed through the overlay without an issue. (I deeply studied docker networking since I first posted.)

The kernel plumbing is not getting setup to support the vip’s route. I’m looking for help to diagnose the plumbing, so I can describe it here. What linux tools can I use to see the plumbing? What should the plumbing look like? What does it look like now? What logs should I look for (e.g, in /var/log/message) that can reveal the failure?

The vip endpoint_mode is default. I assume it works…otherwise it would not be default. If it systematically does not work, I expect immediate push back here. I mean, this is the default…everyone who runs swarm would experience a problem out of the box. It follows that I should easily find links saying something like: 'always change the endpoint_mode because the default simply does not work." I should also easily find discussions and such. Most I find are for earlier versions of docker and so I’m not sure they apply.

Can you provide the bug id’s for this kind of problem? Is there one that says something like: no route to the virtual ip destination? I could not find such. If you find one, can you provide link? Thanks.

I think I have a server configuration that is not compatible with how docker sets up the kernel plumbing for the vip routing.