Cannot connect via swarm published port (sometimes!?!) How to debug?

Hi Everybody,

i’m having a problem that i’m trying to track down.

We are using Docker 1.12.3 (and locally i’m using the Docker for Windows installation) in Swarm mode. Sometimes (not regulary or persistently when doing something specific) a published port doesn’t seem to connect through the Swarm. I’m researching the bug using a single node and it ‘can’ happen on that.

I’ve looked at the Docker logs and see the call coming in and being forwarded to the 172.18.0.2 machine (172.18 is the docker_gwbridge network and 0.2 is the ingress-sbox container). However when looking in the container that the published port should be connecting to, i don’t see anything (received or sent bytes) on the 10.255 (ingress) or the 172.18 network interfaces. The 10.13 (our own overlay) network interface is sending and receiving bytes happily and the application is working as expected.

The log is littered with the following errors:

[15:53:46.550][VpnKit         ][Debug  ] com.docker.slirp.exe: tcp:0.0.0.0:3483:tcp:172.18.0.2:3483: connected
[15:53:46.551][VpnKit         ][Error  ] com.docker.slirp.exe: Hvsock.read: Unix.Unix_error(Unix.ECONNABORTED, "read", "")
[15:53:46.557][VpnKit         ][Error  ] com.docker.slirp.exe: tcp:0.0.0.0:3483:tcp:172.18.0.2:3483 proxy failed with flow proxy b: write failed with Eof
[15:53:56.569][VpnKit         ][Debug  ] com.docker.slirp.exe: tcp:0.0.0.0:3483:tcp:172.18.0.2:3483: connected
[15:53:56.571][VpnKit         ][Error  ] com.docker.slirp.exe: Hvsock.read: Unix.Unix_error(Unix.ECONNABORTED, "read", "")
[15:53:56.574][VpnKit         ][Error  ] com.docker.slirp.exe: tcp:0.0.0.0:3483:tcp:172.18.0.2:3483 proxy failed with flow proxy b: write failed with Eof
[15:54:06.588][VpnKit         ][Debug  ] com.docker.slirp.exe: tcp:0.0.0.0:3483:tcp:172.18.0.2:3483: connected
[15:54:06.592][VpnKit         ][Error  ] com.docker.slirp.exe: tcp:0.0.0.0:3483:tcp:172.18.0.2:3483 proxy failed with flow proxy b: write failed with Eof

However these errors are also logged when the application within the container is not up and running yet. As an example i could be trying the access the database in a container and that could be failing, but an application can connect (so i know the database is up and running).

I’ve checked and rechecked the configuration; but as the problem does not happen every time, i’m convinced the configuration (of the services; names, published ports, etc) is correct.

I’m stumped at where to look for problems; especially because the problem can’t be reproduced easily. Does anybody have any tips for me where to look next? Is there any way we can look ‘inside’ the routing mesh? Is there a log somewhere for the ingress network or the ingress containers or something like that? Any help would be greatly appreciated.

Regards,
Robert

Using

docker run -it -v /var:/thevar alpine:3.4 /bin/sh

i could lay my hands on the logs inside the moby machine.

The proxy-vsockd.log contains the following lines:

2016/11/25 20:33:36 Can't forward traffic to backend tcp/172.18.0.2:3581: dial tcp 172.18.0.2:3581: getsockopt: connection refused
2016/11/25 20:33:46 Can't forward traffic to backend tcp/172.18.0.2:3581: dial tcp 172.18.0.2:3581: getsockopt: connection refused
2016/11/25 20:33:56 Can't forward traffic to backend tcp/172.18.0.2:3581: dial tcp 172.18.0.2:3581: getsockopt: connection refused
2016/11/25 20:34:06 Can't forward traffic to backend tcp/172.18.0.2:3581: dial tcp 172.18.0.2:3581: getsockopt: connection refused
2016/11/25 20:34:16 Can't forward traffic to backend tcp/172.18.0.2:3581: dial tcp 172.18.0.2:3581: getsockopt: connection refused
2016/11/25 22:23:41 error CloseRead from: HvSocket has been closed for read
2016/11/25 22:23:41 error CloseWrite to: close tcp 172.18.0.1:38002->172.18.0.2:3580: shutdown: transport endpoint is not connected
2016/11/25 22:23:42 error CloseRead from: HvSocket has been closed for read
2016/11/25 22:23:42 error CloseRead from: close tcp 172.18.0.1:33142->172.18.0.2:5432: shutdown: transport endpoint is not connected
2016/11/25 22:23:43 error CloseRead from: HvSocket has been closed for read
2016/11/25 22:23:43 error CloseWrite to: close tcp 172.18.0.1:45104->172.18.0.2:3585: shutdown: transport endpoint is not connected
2016/11/25 22:23:43 error CloseRead from: HvSocket has been closed for read
2016/11/25 22:23:43 error CloseRead from: close tcp 172.18.0.1:33156->172.18.0.2:5432: shutdown: transport endpoint is not connected
2016/11/25 22:23:44 Can't forward traffic to backend tcp/172.18.0.2:3581: dial tcp 172.18.0.2:3581: getsockopt: connection refused
2016/11/25 22:23:54 Can't forward traffic to backend tcp/172.18.0.2:3581: dial tcp 172.18.0.2:3581: getsockopt: connection refused
2016/11/25 22:24:04 Can't forward traffic to backend tcp/172.18.0.2:3581: dial tcp 172.18.0.2:3581: getsockopt: connection refused
2016/11/25 22:24:14 Can't forward traffic to backend tcp/172.18.0.2:3581: dial tcp 172.18.0.2:3581: getsockopt: connection refused

Anybody any ideas about those errors? (i know they are about a different port, but after restarting my machine i have to get the problem reproduced again)

Ps. I validated the application in the container was working correctly by executing the same code that i am using outside the docker containers to connect, inside a docker container and that works like a charm. So it seems the problem must be within the Ingress Sandbox or at least near that area.

I’ve deduced that

2016/11/25 22:23:41 error CloseRead from: HvSocket has been closed for read
2016/11/25 22:23:41 error CloseWrite to: close tcp 172.18.0.1:38002->172.18.0.2:3580: shutdown: transport endpoint is not connected

is ‘normal’ for the connection. However the

2016/11/25 22:23:44 Can't forward traffic to backend tcp/172.18.0.2:3581: dial tcp 172.18.0.2:3581: getsockopt: connection refused
2016/11/25 22:23:54 Can't forward traffic to backend tcp/172.18.0.2:3581: dial tcp 172.18.0.2:3581: getsockopt: connection refused
2016/11/25 22:24:04 Can't forward traffic to backend tcp/172.18.0.2:3581: dial tcp 172.18.0.2:3581: getsockopt: connection refused

seem to be only there when the published port cannot be connected…

and when i do the following

docker service scale <myservice>=0
docker service scale <myservice>=1

the problem is resolved…

I’ve been trying to read the Docker source code, but having a hard time doing that as i’m a Java developer :wink:

Is there a way we could get information on the Ingress Sandbox from the node?
Or could there be a way to expose the port directly on the node, bypassing the routing mesh?