All deployed services within a swarm are unreachable, while containers deployed normally work fine

I’ve run into an issue that seems similar too this one; Can't access service in swarm. My setup is a little bit different though and I haven’t found a solution to my problem yet.

The minimal, reproducible example

  1. Build a swarm cluster between atleast 3 Ubuntu 20.04 docker swarm managers.

  2. Deploy a service docker service create --name test_web --replicas 3 --publish published=8080,target=80 nginxdemos/hello

  3. Check that the containers and services were created properly and observe the failure of connecting to that service:

demi-ubu01:~/stacks$ docker ps

CONTAINER ID   IMAGE                     COMMAND                  CREATED              STATUS              PORTS     NAMES
d4a12a3c5448   nginxdemos/hello:latest   "nginx -g 'daemon of…"   About a minute ago   Up About a minute   80/tcp    test_web.2.yul33wdycarig3qoxnehgrjrz
demi-ubu01:~/stacks$ docker service ls

ID             NAME      MODE         REPLICAS   IMAGE                     PORTS
0yqd7gvggwuh   test_web      replicated   3/3        nginxdemos/hello:latest   *:8080->80/tcp
# External test:
demi-ubu01:~/stacks$ curl -I 10.100.4.5:8080     
curl: (7) Failed to connect to 10.100.4.5 port 8080: Connection refused

# Inside container to published service port:
demi-ubu01:~/stacks$ docker exec -it d4a12a3c5448 wget http://test_web:8080
Connecting to test_web:8080 (10.0.4.2:8080)
wget: can't connect to remote host (10.0.4.2): Host is unreachable

# Inside container to apps exposed port:
demi-ubu01:~/stacks$ docker exec -it d4a12a3c5448 wget http://localhost:80
Connecting to localhost:80 (127.0.0.1:80)
index.html    100% |****************************|  7217   0:00:00 ETA

The expected result of the first curl command should be a Status 200 Ok.

The detailed report

My setup is 4 nodes in total. They are identical Ubuntu 20.04 KVM virtual machines all on the same network. There are no firewalls between them. I have 3 Managers and 1 Worker (which i’ve only added as a step during troubleshooting).

:~/stacks$ docker node ls 
ID                            HOSTNAME     STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
kcm5v64psntjxngnqkfdj1jzh *   demi-ubu01   Ready     Active         Reachable        20.10.1
uo3rljg6ax5qkjm898pyym9t1     demi-ubu02   Ready     Active         Leader           20.10.1
pysnl8sohdp4fv67gui156z4k     demi-ubu03   Ready     Active         Reachable        20.10.1
rp2otsqpnxkgbmxbpkv21yjs6     demi-ubu04   Ready     Active                          20.10.1

I can run a container normally and reach it on the local host fine.

demi-ubu01:~/stacks$ docker run -p 8080:80 -d nginxdemos/hello
de4d0a937710acb1d6d8ae3b7eb9175860b6614dfd9ce92bc972efe619ae095f

demi-ubu01:~/stacks$ docker ps
CONTAINER ID   IMAGE              COMMAND                  CREATED         STATUS         PORTS                  NAMES
de4d0a937710   nginxdemos/hello   "nginx -g 'daemon of…"   4 seconds ago   Up 2 seconds   0.0.0.0:8080->80/tcp   pedantic_wiles

demi-ubu01:~/stacks$ curl -I 10.100.4.5:8080
HTTP/1.1 200 OK
Server: nginx/1.13.8
Date: Sat, 19 Dec 2020 17:59:23 GMT
Content-Type: text/html
Connection: keep-alive
Expires: Sat, 19 Dec 2020 17:59:22 GMT
Cache-Control: no-cache

However the same app deployed as a service using the following compose file:

demi-ubu01:~/stacks$ cat test.yml 
version: "3.6"

services:
  web:
    image: nginxdemos/hello:latest
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: "0.1"
          memory: 50M
      restart_policy:
        condition: on-failure
    ports:
      - target: 80
        published: 8080
        protocol: tcp
        mode: ingress
    networks:
      - webnet

networks:
  webnet:
    driver: overlay

It does not become reachable from any of the hosts at all:

demi-ubu01:~/stacks$ docker stack deploy -c test.yml test
Creating network test_webnet
Creating service test_web

demi-ubu01:~/stacks$ docker ps
CONTAINER ID   IMAGE                     COMMAND                  CREATED          STATUS         PORTS     NAMES
05030ef897a1   nginxdemos/hello:latest   "nginx -g 'daemon of…"   10 seconds ago   Up 7 seconds   80/tcp    test_web.1.kobrpkp68f2qbs4jhd6o8aebg

# Trying on all of the hosts in the cluster. No firewalls here.

demi-ubu01:~/stacks$ curl -I 10.100.4.5:8080
curl: (7) Failed to connect to 10.100.4.5 port 8080: Connection refused
demi-ubu01:~/stacks$ curl -I 10.100.4.9:8080
curl: (7) Failed to connect to 10.100.4.9 port 8080: Connection refused
demi-ubu01:~/stacks$ curl -I 10.100.4.10:8080
curl: (7) Failed to connect to 10.100.4.10 port 8080: Connection refused
demi-ubu01:~/stacks$ curl -I 10.100.4.11:8080
curl: (7) Failed to connect to 10.100.4.11 port 8080: Connection refused

demi-ubu01:~/stacks$ docker service ls
ID             NAME       MODE         REPLICAS   IMAGE                     PORTS
elvfm7o4v4zo   test_web   replicated   3/3        nginxdemos/hello:latest   *:8080->80/tcp

I also don’t see any port bindings being made on those hosts at all, so it doesn’t look like any ports are being published.


INeed2Poo@demi-ubu01:~/stacks$ docker service inspect test_web
[
    ## https://pastebin.com/WqqyDnVS ##
]

demi-ubu01:~/stacks$ netstat -na | grep LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:49152           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:24007           0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN

demi-ubu01:~/stacks$ docker network ls
NETWORK ID     NAME              DRIVER    SCOPE
6e5f7e7cebc3   bridge            bridge    local
7a1155f87a62   docker_gwbridge   bridge    local
ab32da8ac1ec   host              host      local
46id8wzw4ayf   ingress           overlay   swarm
a24a40ef78f4   none              null      local
d9l7msysdx8m   test_webnet       overlay   swarm
INeed2Poo@demi-ubu01:~/stacks$ docker network inspect 46id8wzw4ayf
[
    https://pastebin.com/JPA0ZBjE
]

I also can’t reach the service while exec’ed into a container for that service. Execing into a container, I’m able to hit the LOCAL app port, however I cannot hit the service by name. The container CAN resolve the service name.

## Testing the app's service from the local container fails:

demi-ubu01:~/stacks$ docker exec -it 05030ef897a1 wget http://test_web:8080
Connecting to test_web:8080 (10.0.4.2:8080)
wget: can't connect to remote host (10.0.4.2): Host is unreachable


## Testing the app's local port from the local container is sucessful:

demi-ubu01:~/stacks$ docker exec -it 05030ef897a1 wget http://localhost:80
Connecting to localhost:80 (127.0.0.1:80)
index.html    100% |****************************|  7217   0:00:00 ETA
demi-ubu01:~/stacks$ docker --version
Docker version 20.10.1, build 831ebea

I’ve gone and made sure that I’m not using any overlapping networks that might be causing this and have gone so far as to completely redeploy the cluster. I’ve just about exhausted all of my troubleshooting idea’s. Any Idea’s?

Is it safe to assume that 10.100.4.5 is one of your nodes ip?

The default address pool is 10.0.0.0/8, see: docker info --format '{{json .Swarm.Cluster.DefaultAddrPool}}'

If this is the case, you might find this blog post helpful - you can safely ignore that it refers to Docker EE, the problem and solution is valid for Docker CE as well. You need to alter default-addr-pool either when initiating the swarm or by modifying each node’s /etc/docker/daemon.json configuration file (and restart the daemon then).

demi-ubu01:~/stacks$ docker info --format '{{json .Swarm.Cluster.DefaultAddrPool}}'
["10.0.0.0/8"]

That seems to be the issue! I’ll take a look at this and get that straightened out. Thanks!

I spoke too soon. I changed the CIDR and confirmed the change by looking at the new CIDR setting after reinitializing the cluster. I also looked at the network configs for the services. The result is still the same as before.

demi-ubu01:~$ docker info --format '{{json .Swarm.Cluster.DefaultAddrPool}}' 
["10.135.0.0/16"]]]. 

I’m still not able to connect to any services however. I removed all nodes from the swarm cluster and did an init on the cluster to change it. After I redeployed the docker compose from the original post.

Update: I redeployed using Ubuntu 18.04 as my base image, and the same exact setup on that (deployed using ansible) seems to work fine… So this is an issue with the current version of Docker on Ubuntu 20.04.