Multi-home Swarm Setup

jdevstart · November 4, 2021, 9:48pm

I’m about ready to give up, I can’t seem to find the answer to my issue and any assistance would be appreciated.

Scenario / Setup:

Home 1 has “systemA” setup and configured with GlusterFS and Docker Swarm
Home 2 has “systemB” setup and configured with GlusterFS and Docker Swarm
Home 3 has “systemC” setup and configured with GlusterFS and Docker Swarm

For the sake of simplicity my compose file has two services on it at the moment:

Traefik and mikesir87/cats - Docker Image | Docker Hub which is a simple container that displays cat GIFs with the container ID hosting that info.

I have all services setup to a global deployment and everything is working with one exception.

Currently, my DNS is setup to point to “systemA” with a failover DNS setup to “systemB”.

I can only get the services to work once out of every third time. So if I visit www.catcontainer.com it works the first time, fails the second, fails the third, and works again.

Any help would be appreciated, thanks!

meyay · November 5, 2021, 7:11am

Please share more details: os version, docker version, content of compose file, network cidr

When I introduce load distribution amongst replicas to others, I typical use the docker-demo container, which provides a web-ui that detects the replicas of the service, fires up requests every second and indicate which replica served the response:

version: '3.7'
services:
  demo:
    image: dockersuccess/docker-demo
    ports:
    -  published: 8080
       target: 8080
       protocol: tcp
       mode: ingress
    volumes:
    - type: bind
      source: /var/run/docker.sock
      target: /var/run/docker.sock

jdevstart · November 5, 2021, 6:02pm

Thank you for your response. Sorry, yeah, just read the guidelines for posting, was a bit frustrated yesterday but am more collected today. Thank you for the service suggestion. Let me fire that baby up and see if I can get some more insight as to what’s going on.

OS version: Ubuntu 20.04.3 / HyperV
Docker Version: 20.10.10
CIDR = 10.0.1.0/24

Content of compose file:

version: '3.8'

volumes:
  traefik-certification:
     driver: glusterfs
     name: "gv0/traefik-certification"

networks:
  traefik-web:
    external: true

services:
  traefik:
    image: traefik:latest
    command: 
      - --api
      - --providers.docker
      - --log.level=DEBUG
      - --providers.docker.watch=true
      - --providers.docker.swarmmode=true
      - --providers.docker.exposedbydefault=false
      - --entrypoints.web.address=:80
      - --entrypoints.websecure.address=:443
      - --certificatesresolvers.myresolver.acme.httpchallenge=true
      - --certificatesresolvers.myresolver.acme.httpchallenge.entrypoint=web
      - --certificatesresolvers.myresolver.acme.email=my personal e-mail
      - --certificatesresolvers.myresolver.acme.storage=/certs/acme.json
      - --entrypoints.web.http.redirections.entryPoint.to=websecure
      - --entrypoints.web.http.redirections.entryPoint.scheme=https
      - --pilot.token=my token
      - --ping=true
    networks:
      - traefik-web
    ports:
      - 80:80
      - 443:443
      - 8080:8080
    deploy:
      mode: global
      labels:
        - traefik.enable=true
        - traefik.http.routers.api.rule=Host('proxy.domain.com')
        - traefik.http.routers.api.service=api@internal
        - traefik.http.routers.api.tls.certresolver=myresolver
        - traefik.http.routers.api.entrypoints=websecure
        - traefik.http.routers.api.middlewares=api-compress,api-auth
        - traefik.http.services.api.loadbalancer.server.port=8080
        - traefik.http.middlewares.api-compress.compress=true
        - traefik.http.middlewares.api-auth.basicauth.users=user:password
        - traefik.docker.network=traefik-web
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - traefik-certification:/certs
#-----------------------------------------------------------------------------
  catapp:
    image: mikesir87/cats:1.0
    networks:
      - traefik-web
    ports:
      - 5000
    deploy:
      mode: global
      labels:
       - traefik.enable=true"
       - traefik.http.routers.catapp.rule=Host(`cat.domain.com')
       - traefik.http.routers.catapp.service=catapp
       - traefik.http.services.catapp.loadbalancer.server.port=5000
       - traefik.http.routers.catapp.entrypoints=websecure
       - traefik.http.routers.catapp.tls.certresolver=myresolver
       - traefik.docker.network=traefik-web
#-----------------------------------------------------------------------------

mod update: wrapped compose file in code block

meyay · November 5, 2021, 6:59pm

Observations:

Traefik is the one that publises the port, depending if treafik-web was created as overlay or bridge network.
As it is a global mode deployment, I prefer my traefik to publish the ports like this (as a bonus it retains the client ip’s!):

    ports:
      - target: 80
        published: 80
        mode: host
      - target: 443
        published: 443
        mode: host
      - target: 8080
        published: 8080
        mode: ingress

I have not checked the configuration in the commands or checked the labels for the ui either - shouldn’t be relevant for the issue.

Your cats container on the other hand is also a global service, but it does not publish any ports. As such it leaves the load distribution to traefik, which does seem to fail.

I would realy advise to use docker-demo to test your loadalancing, at it allows to clearly identify which replica responded.

Are you using something like keepalived to get a failover-ip amongst the nodes? If you don’t use it, you might want to consider it… In my homelab swarm cluster, my WAN port forwards to the keepalived failover ip, thus as long as one of my node is reachable, traefik and theirfor my containers are reachable
.
I it possible that the traefik-web network is not an overlay network?

jdevstart · November 5, 2021, 7:38pm

"Traefik is the one that publises the port, depending if treafik-web was created as overlay or bridge network."
traefik-web was created as an overlay network.

"As it is a global mode deployment, I prefer my traefik to publish the ports like this (as a bonus it retains the client ip’s!):"
Roger that, let me try that configuration to see if it helps anything.

"Your cats container on the other hand is also a global service, but it does not publish any ports. As such it leaves the load distribution to traefik, which does seem to fail."
Tried publishing the port (5000) but no difference.

"I would realy advise to use docker-demo to test your loadalancing, at it allows to clearly identify which replica responded."
So I did, based on what I see, it looks like it works similarly to the cat service. Same results though, I only get one hit out of three with the docker-demo service. At this point, I think it might be an external DNS issue? My traffic only seems to be routed to systemA ONLY. When it hits systemB or system C, it fails. Once it gets back to systemA, it works.

"Are you using something like keepalived to get a failover-ip amongst the nodes? If you don’t use it, you might want to consider it… In my homelab swarm cluster, my WAN port forwards to the keepalived failover ip, thus as long as one of my node is reachable, traefik and theirfor my containers are reachable"
I’ve looked into this several times but I don’t know that I can make this work across multiple homes/networks? Everything I’ve read indicates that the systems need to be on the same network. I’m trying to avoid a single point of failure configuration.

"I it possible that the traefik-web network is not an overlay network?"
Just ran docker network ls to confirm that it IS an overlay network.

Thank you for your time and continued feedback.

meyay · November 5, 2021, 8:40pm

Does this mean your swam nodes are not in the same network? If so that would explain a lot. Swarm and Kubernetes use the raft consensus algorithm for quorom under the hood. Raft reqires low latency network connections, such as those you have in a mutli-az setup in a single region of a cloud hyperscaler - If you try the same with one node in different region it will fail due to high network latency. Very few consens algorithms are designed to be reliable with high latency network (realy just egelitarian Paxos and Hashgraph comme to mind) actualy.

When I read Home 1, Home 2, Home 3, I just though thouse are the names of the nodes - odd, but hey why not. But I guess what it actualy ment is that those are different locations.

meyay · November 5, 2021, 8:48pm

Thus said, you might be better of with single nodes, orchestrated with Portainer and their edge agent to get a grip on the “central managent” aspect. Never used it - and I personaly dislike uis to manager my privaet swarm cluster or kubernetes clusters.

jdevstart · November 6, 2021, 1:27am

Thank you sooooo much for your time. I’ve been reading up a lot, and I mean a lot on this and the more I read, the more examples I saw, I started coming to the conclusion that it wasn’t a supported configuraiton. You confirmed that suspicion and for that and your support, I thank you.

I’m not a fan of Portainer.

I think I’m just going to do a cluster at home1 and a cluster on home 2. Everything will run on home 1 and if the cluster fails / internet, I’ll have have a DNS fail over rule to route traffic to home 2.

Thank you again!!!

meyay · November 6, 2021, 8:41am

I share your feeling about Portainer.

Regarding the cluster in home1, make sure have consider this constraint with your setup:

Topic		Replies	Views
Servce High Availability connectivity issue, Docker Swarm (3 manager, 1 worker) Swarm docker , swarm	7	243	December 18, 2024
SWARM : Some ports are exposed, some other not Swarm	21	2309	May 4, 2024
Overlay network ping works, but HTTP requests only work within same swarm node. Hangs as if messages dropped if to other node General docker , swarm	3	4546	October 24, 2022
I cannot access my services in worker mode General swarm	5	427	July 14, 2024
Docker swarm services cannot communicate across nodes Swarm swarm	7	13912	May 14, 2024

Multi-home Swarm Setup

Related topics