I’m having a hard time debugging the connections betweens my tasks.
Everything works well amongst nodes hosted within the same server provider datacenter. As soon as I add some nodes hosted in another datacenter (another server provider), things get troublesome: my servers can access one another, but the connections within the Swarm are somewhat broken.
Here how it goes:
I have a cluster of Swarm nodes:
- N10 and N11, hosted by provider P1 (some datacenter)
- N20 and N21, hosted by provider P2 (some other datacenter)
Traefik is deployed on N10:
- if my service is deployed on N11: Traefik routes the calls correctly.
- if my service is deployed on N20: Traefik goes in Gateway timeout 504.
Within my service
- if my app is deployed on N11 and my database on N12, all works.
- if my app is deployed on N11 and my database on N20, connection with the database fails.
Clearly, there’s a connection failure between N10/N11 and N20, and the same errors occurs if I try with a fresh clean server N21 from provider P2.
- nodes from P2 have no firewall (test servers)
- nodes from P1 have ports 2377, 4789, 7946 (UDP/TCP) open for N20/N21
- nodes from P2 join the swarm joyfully
- I can deploy stacks to nodes from P2 without errors
- Swarmpit, a swarm manager, deployed on N11 can access info and logs from services deployed on N20/N21
- within Traefik container on N10, connections to services on N20/N21 fail (timeout); but connections from N10 (Traefik’s host) to these services (via N20/N21 direct IP) succeeds.
I can’t get exactly where the fault is during connections betweens nodes, and my knowledge in networking has to improve.
Could anyone give me some leads or ideas to investigate my case?
We still want to share our swarm through different datacenters/server providers, but we currenlty can’t do it.