Docker Community Forums

Share and learn in the Docker community.

Docker swarm load balance network not working

Hello all,

I am following the getting started tutorial and am on part 4 where I create a load balanced python web app and distribute it across nodes using docker swarm.

Instead of using docker-machine and virtualbox, I am using a cluster of raspberry pis with docker installed.
For sake of this diagnosis, I am going to use only two replicas and two nodes in the swarm: chicago (manager) and denver.

I am able to add them to the docker swarm and distribute the app across them. I know that port 7946 is open and I have added firewall rules to allow access for both 7946 tcp/udp and port 4879 udp. I am able to curl this port on the external interfaces (I know this is not orthodox but it is the only immediate way I know to make sure the port is open) and I will get garbled output, as I would expect.

The problem is, when I try to curl the web app (which is running on port 4000) I get a response the first time:

curl chicago:4000

Hello World!

Hostname: 4e6a6687c92e
Visits: cannot connect to Redis, counter disabled

but the second time it hangs forever.

I see the same thing from the second node:
curl denver:4000

Hello World!

Hostname: 8e04674d0757
Visits: cannot connect to Redis, counter disabled

second request also hangs.

It appears that it is trying to load balance (as you can see the container id of the container being directly pinged in the first request) but for some reason the second request it is not able to access the other node, so it times out.

I have tried doing a tcpdump on port 4000 on denver, then run the curl against chicago twice, and I don’t see anything on either eth0 or docker0.

I am at a loss for how to further diagnose this. Can someone point me in the right direction?

Thanks

EDIT: In case it is relevant, here is the output of network inspect (192.168.4.100 is chicago and 192.168.4.101 is denver):

docker network inspect getstartedlab_webnet
[
    {
        "Name": "getstartedlab_webnet",
        "Id": "tfmnjicyoygd8unyubccqqlje",
        "Created": "2019-06-21T11:43:38.552690472+01:00",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.0.1.0/24",
                    "Gateway": "10.0.1.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "59b66568e536b86c145dce9b6c21233de3c7865d91b0b274a8cc5fc02548bddb": {
                "Name": "getstartedlab_web.2.cs04pkh1j1zrdfs89he5dptjx",
                "EndpointID": "5ba6add127a86feff4ee5dc2fd90e04ca68350684d225b28902186ffefecb87f",
                "MacAddress": "02:42:0a:00:01:03",
                "IPv4Address": "10.0.1.3/24",
                "IPv6Address": ""
            },
            "lb-getstartedlab_webnet": {
                "Name": "getstartedlab_webnet-endpoint",
                "EndpointID": "75de1dfb3a80eab7696a13faa66074e92542b1f5ad5b302227addb6e8d60fba5",
                "MacAddress": "02:42:0a:00:01:05",
                "IPv4Address": "10.0.1.5/24",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4098"
        },
        "Labels": {
            "com.docker.stack.namespace": "getstartedlab"
        },
        "Peers": [
            {
                "Name": "43b9b37a724a",
                "IP": "192.168.4.100"
            },
            {
                "Name": "6c38ca20f355",
                "IP": "192.168.4.101"
            }
        ]
    }
]

Did you open port 2377/tcp as well? It is required for cluster management communication…

Yes, I opened that port as well. I am seeing the same behavior.

So I went on to part 5 where we incorporate a redis instance on the manager node, and I am finding that webnet is not even working between nodes. So for example, if I do a curl on chicago:4000 if the request actually goes to chicago (instead of being load balanced to denver, then the redis counter gets incremented and printed, which isn’t surprising since the redis instance is running on the manager. However, if I access denver directly, it isn’t even able to talk to the redis instance. I tried doing a tcpdump on denver for the redis port, and I don’t even see a request going out. So I have to conclude that there is something wrong with webnet. I will post my entire docker-compose.yml here. Any help would be appreciated.

version: "3"
services:
  web:
    # replace username/repo:tag with your name and image details
    image: jusschwa/get-started:part2
    deploy:
      replicas: 2
      resources:
        limits:
          cpus: "0.1"
          memory: 50M
      restart_policy:
        condition: on-failure
    ports:
      - "4000:80"
    networks:
      - webnet
  visualizer:
    image: alexellis2/visualizer-arm:latest
    ports:
      - "8081:8080"
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock"
    deploy:
      placement:
        constraints: [node.role == manager]
    networks:
      - webnet
  redis:
    image: arm32v7/redis
    ports:
      - "6379:6379"
    volumes:
      - "/home/jusschwa/data:/data"
    deploy:
      placement:
        constraints: [node.role == manager]
    command: redis-server --appendonly yes
    networks:
      - webnet
networks:
  webnet:

By the way here is the output of docker network ls:

$ docker network ls
NETWORK ID          NAME                   DRIVER              SCOPE
69637f4be62c        bridge                 bridge              local
c58663393ba1        docker_gwbridge        bridge              local
bok5vq29cvr3        getstartedlab_webnet   overlay             swarm
09227b6003b2        host                   host                local
i2mmtvu7j95t        ingress                overlay             swarm
3e56e33b4714        none                   null                local