I’m about ready to give up, I can’t seem to find the answer to my issue and any assistance would be appreciated.
Scenario / Setup:
Home 1 has “systemA” setup and configured with GlusterFS and Docker Swarm
Home 2 has “systemB” setup and configured with GlusterFS and Docker Swarm
Home 3 has “systemC” setup and configured with GlusterFS and Docker Swarm
For the sake of simplicity my compose file has two services on it at the moment:
I have all services setup to a global deployment and everything is working with one exception.
Currently, my DNS is setup to point to “systemA” with a failover DNS setup to “systemB”.
I can only get the services to work once out of every third time. So if I visit www.catcontainer.com it works the first time, fails the second, fails the third, and works again.
Please share more details: os version, docker version, content of compose file, network cidr
When I introduce load distribution amongst replicas to others, I typical use the docker-demo container, which provides a web-ui that detects the replicas of the service, fires up requests every second and indicate which replica served the response:
Thank you for your response. Sorry, yeah, just read the guidelines for posting, was a bit frustrated yesterday but am more collected today. Thank you for the service suggestion. Let me fire that baby up and see if I can get some more insight as to what’s going on.
Traefik is the one that publises the port, depending if treafik-web was created as overlay or bridge network.
As it is a global mode deployment, I prefer my traefik to publish the ports like this (as a bonus it retains the client ip’s!):
I have not checked the configuration in the commands or checked the labels for the ui either - shouldn’t be relevant for the issue.
Your cats container on the other hand is also a global service, but it does not publish any ports. As such it leaves the load distribution to traefik, which does seem to fail.
I would realy advise to use docker-demo to test your loadalancing, at it allows to clearly identify which replica responded.
Are you using something like keepalived to get a failover-ip amongst the nodes? If you don’t use it, you might want to consider it… In my homelab swarm cluster, my WAN port forwards to the keepalived failover ip, thus as long as one of my node is reachable, traefik and theirfor my containers are reachable
.
I it possible that the traefik-web network is not an overlay network?
"Traefik is the one that publises the port, depending if treafik-web was created as overlay or bridge network."
traefik-web was created as an overlay network.
"As it is a global mode deployment, I prefer my traefik to publish the ports like this (as a bonus it retains the client ip’s!):"
Roger that, let me try that configuration to see if it helps anything.
"Your cats container on the other hand is also a global service, but it does not publish any ports. As such it leaves the load distribution to traefik, which does seem to fail."
Tried publishing the port (5000) but no difference.
"I would realy advise to use docker-demo to test your loadalancing, at it allows to clearly identify which replica responded."
So I did, based on what I see, it looks like it works similarly to the cat service. Same results though, I only get one hit out of three with the docker-demo service. At this point, I think it might be an external DNS issue? My traffic only seems to be routed to systemA ONLY. When it hits systemB or system C, it fails. Once it gets back to systemA, it works.
"Are you using something like keepalived to get a failover-ip amongst the nodes? If you don’t use it, you might want to consider it… In my homelab swarm cluster, my WAN port forwards to the keepalived failover ip, thus as long as one of my node is reachable, traefik and theirfor my containers are reachable"
I’ve looked into this several times but I don’t know that I can make this work across multiple homes/networks? Everything I’ve read indicates that the systems need to be on the same network. I’m trying to avoid a single point of failure configuration.
"I it possible that the traefik-web network is not an overlay network?"
Just ran docker network ls to confirm that it IS an overlay network.
Does this mean your swam nodes are not in the same network? If so that would explain a lot. Swarm and Kubernetes use the raft consensus algorithm for quorom under the hood. Raft reqires low latency network connections, such as those you have in a mutli-az setup in a single region of a cloud hyperscaler - If you try the same with one node in different region it will fail due to high network latency. Very few consens algorithms are designed to be reliable with high latency network (realy just egelitarian Paxos and Hashgraph comme to mind) actualy.
When I read Home 1, Home 2, Home 3, I just though thouse are the names of the nodes - odd, but hey why not. But I guess what it actualy ment is that those are different locations.
Thus said, you might be better of with single nodes, orchestrated with Portainer and their edge agent to get a grip on the “central managent” aspect. Never used it - and I personaly dislike uis to manager my privaet swarm cluster or kubernetes clusters.
Thank you sooooo much for your time. I’ve been reading up a lot, and I mean a lot on this and the more I read, the more examples I saw, I started coming to the conclusion that it wasn’t a supported configuraiton. You confirmed that suspicion and for that and your support, I thank you.
I’m not a fan of Portainer.
I think I’m just going to do a cluster at home1 and a cluster on home 2. Everything will run on home 1 and if the cluster fails / internet, I’ll have have a DNS fail over rule to route traffic to home 2.