4 minute timeout when connecting to published TCP port on Docker Swarm

I can establish a TCP connection to Docker Swarm service published like this:

    ports:
       - target: 5454
         published: 5455
         protocol: tcp
         mode: host

…and send messages, but after 4 minutes, the connection is lost and upon message send I see TCP retransmission, and RST, ACK in Wireshark.

I could only find mention of Azure Load Balancer having 4 minute timeout, but I’m not sure which component is responsible in my case, since I’m not using Azure, I’m using Docker Swarm on my local dev machine.

My setup is:

Local Dev Machine: Windows 11 Pro with Hyper-V
- Local VM Windows Server 2022 (Docker Swarm Manager) Node-A
- Local VM Windows Server 2022 (Docker Swarm Worker) Node-B

I’m using the NAT Switch with the newly created NAT network.

When I connect to Node-B on port 5455 from Hyper-V host it timeouts in 4 minutes.

@vrapolinario do you have an idea? Even though it’s not mentioned, based on the other threads of @ishnets, it is high likely about Windows containers.

If it had been Linux and the timeout would be roughly 900 seconds, I would have said it could be the default lvs/vips timeout. Though, I have no idea how this is implemented for Windows containers.

With Linux containers switching endpoint_mode: from vip to dnsrr would mitigate those issues for long-lived connections, at the cost of resolving the service name to a multivalue-dns result (which will prevent loadbalancing if the client caches the resolved ip), instead of resolving it to the vip, which then takes care of balancing the traffic to the swarm tasks.

This seems to be the issue:

The VFP NAT rules have a default idle-connection timeout of 240 seconds. Windows will “forget” any connections from a container to an external host if no packets are sent or received for that duration, leaving a connection which appears to be open but does not actually route anywhere. The timeout for the nat network, and overlay NAT on Windows Server 2016, is 1800 seconds [citation needed] , which would explain why connections are not dropped in those configurations.

…taken from Windows Server 2019 closes connection on swarm overlay network after ~6 Minutes · Issue #44082 · moby/moby (github.com)

Solution: set TCP socket keepalive time to something less than 240000 milliseconds in my Docker service.

2 Likes