Needs help troubleshooting networking between containers

locinus · April 26, 2021, 3:02pm

Hi.

I’m having a hard time debugging the connections betweens my tasks.

Everything works well amongst nodes hosted within the same server provider datacenter. As soon as I add some nodes hosted in another datacenter (another server provider), things get troublesome: my servers can access one another, but the connections within the Swarm are somewhat broken.

Here how it goes:

I have a cluster of Swarm nodes:

N10 and N11, hosted by provider P1 (some datacenter)
N20 and N21, hosted by provider P2 (some other datacenter)

Traefik is deployed on N10:

if my service is deployed on N11: Traefik routes the calls correctly.
if my service is deployed on N20: Traefik goes in Gateway timeout 504.

Within my service

if my app is deployed on N11 and my database on N12, all works.
if my app is deployed on N11 and my database on N20, connection with the database fails.

Clearly, there’s a connection failure between N10/N11 and N20, and the same errors occurs if I try with a fresh clean server N21 from provider P2.

Yet:

nodes from P2 have no firewall (test servers)
nodes from P1 have ports 2377, 4789, 7946 (UDP/TCP) open for N20/N21
nodes from P2 join the swarm joyfully
I can deploy stacks to nodes from P2 without errors
Swarmpit, a swarm manager, deployed on N11 can access info and logs from services deployed on N20/N21
within Traefik container on N10, connections to services on N20/N21 fail (timeout); but connections from N10 (Traefik’s host) to these services (via N20/N21 direct IP) succeeds.

I can’t get exactly where the fault is during connections betweens nodes, and my knowledge in networking has to improve.

Could anyone give me some leads or ideas to investigate my case?

We still want to share our swarm through different datacenters/server providers, but we currenlty can’t do it.

yamakasi · July 3, 2021, 4:06pm

Have you been able to resolve this ? I run into the same issue but have no clue either. I see the question more often without a solution. It feels pretty vague.

meyay · July 3, 2021, 6:15pm

The limiting factor for what you try to do is that Raft requires low latency network connections amongst nodes by design.

Topic		Replies	Views
Servce High Availability connectivity issue, Docker Swarm (3 manager, 1 worker) Swarm docker , swarm	7	208	December 18, 2024
Overlay network ping works, but HTTP requests only work within same swarm node. Hangs as if messages dropped if to other node General docker , swarm	3	4500	October 24, 2022
Network Overlay - Connection between two nodes doesn't work Swarm swarm	1	569	January 12, 2024
Swarm nodes timeout communicating General docker , swarm	0	1118	September 5, 2022
Connectivity issue in Containers under Docker Swarm clusters Swarm	6	486	February 20, 2024

Needs help troubleshooting networking between containers

Related topics