(I previously posted on Serverfault but I don’t think anyone saw the post so I am reposting here)
I am looking for help debugging the following setup:
I have 3 VPs Cloud instances from a hosting company. (I believe the VPS’s are VMWare but I can’t find any documentation on the host companies site.)
- All are running Ubuntu 18.04.
- I have installed docker on all 3.
All the docker versions are the same:
Client: Docker Engine - Community
Version: 19.03.5
API version: 1.40
Go version: go1.12.12
Git commit: 633a0ea838
Built: Wed Nov 13 07:29:52 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Version: 19.03.5
API version: 1.40 (minimum version 1.12)
Go version: go1.12.12
Git commit: 633a0ea838
Built: Wed Nov 13 07:28:22 2019
OS/Arch: linux/amd64
Experimental: false
Version: 1.2.4
GitCommit: e6b3f5632f50dbc4e9cb6288d911bf4f5e95b18e
Version: 1.0.0-rc6+dev
GitCommit: 6635b4f0c6af3810594d2770f662f34ddc15b40d
Version: 0.18.0
GitCommit: fec3683
One Node 1 I ran the following init command:
docker swarm init --advertise-addr NODE_1_IP --data-path-port=7789
And on nodes 2 and 3 I ran the following join commands
docker swarm join --token XXX -advertise-addr NODE_2/3_IP NODE_1_IP:2377
Token is taken from the value Node 1 gave me. I have resolved a previous problem by specifying data-path-port. I think this is because the VPS are VMWare and it conflicts with the standard dataport
My cloud provider gives me a ui to apply firewall rules to individual VPS. I have used a firewall group to apply the following rules to all 3 servers:
TCP ACCEPT to dest ports 80, 443, (and my SSH port)
To test this I ran the following commands on node 1 (Which is the manager)
docker network create --driver=overlay --attachable testnet
docker network create --opt encrypted --driver=overlay --attachable testnet_encrypted
docker service create --network=testnet --name web --publish 80 --replicas=1 --constraint 'node.labels.type == test' nginx:latest
docker service create --network=testnet_encrypted --name webt_encrypted --publish 80 --replicas=1 --constraint 'node.labels.type == test' nginx:latest
The constraint means the services runs on node 3 only.
Once the service is running across the cluster I do the following:
docker run --rm --name alpine --net=testnet -ti alpine:latest sh
apk add --no-cache curl
I then run curl:
curl web
Any time I run this I get a response.
Then I switch the server over to the encrypted network and repeat the same test:
docker run --rm --name alpine --net=testnet_encrypted -ti alpine:latest sh
apk add --no-cache curl
I then run curl:
curl web_encrypted
If I do this on Node 1 and Node 2 it hangs then times out. If I do this on Node 3 it works.
The ESP ACCEPT rule was added to my cloud provider firewall ruleset after some research into the issue.
I have tried rebooting the cluster but no luck.
Debug work 1
sudo tcpdump src $NODE_1_IP and dst $NODE_3_IP and port 7789
sudo tcpdump src $NODE_3_IP and dst $NODE_1_IP and port 7789
One console will show me traffic into NODE_3 the the other traffic out of NODE_3.
I then ran the unencrypted test again.
I see about 7 lines appear on both incoming console and 5 lines appear on outgoing console. So there is traffic going into NODE_3 and traffic going out of NODE_3, and the test is working
I then ran the encrypted test
This time I see a single line appear on the incoming console, and nothing on the outgoing console. So a single packet is getting to NODE_3. I am not sure if it is getting decrypted and sent back to the container.
Debug work 2
One area of config I failed to mention is that I have the following /etc/docker/daemon.json setup:
"hosts": ["unix:///var/run/docker.sock", "tcp://"],
"tlscacert": "/var/docker/ca.pem",
"tlscert": "/var/docker/server-cert.pem",
"tlskey": "/var/docker/server-key.pem",
"tlsverify": true
This is to allow me to use ssl client certs to connect remotely. This file was setup on all nodes before I created the swarm.
As Decryption of the packets looks like a possible cause I have changed my daemon.json to the following:
"hosts": []
I then rebooted each machine. The test results are the same - still not working.
I then ran the command:
docker swarm ca --rotate
and re-ran the tests. This has the same result.
I have not removed and re-inited the cluster with the new config. (I could do if someone thinks it would help but I have a lot of docker secrets and config which I would lose in the process.)
Debug work 3
i have now completely removed and re-inited the cluster. This has not solved the issue.
Some sources say that the following command:
sudo tcpdump -p esp
When run on the nodes should show traffic. I have run this on all nodes in the cluster and repeated all tests and there is no output anywhere.
ufw it inactive on all the nodes:
robert@metcaac6:/var/log$ sudo ufw status
[sudo] password for robert:
Status: inactive
but when I run iptables -L I get the same rules on every node:
robert@metcaac6:/var/log$ sudo iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT udp -- anywhere anywhere policy match dir in pol ipsec udp dpt:7789 u32 "0x0>>0x16&0x3c@0xc&0xffffff00=0x100300"
DROP udp -- anywhere anywhere udp dpt:7789 u32 "0x0>>0x16&0x3c@0xc&0xffffff00=0x100300"
Chain FORWARD (policy DROP)
target prot opt source destination
DOCKER-USER all -- anywhere anywhere
DOCKER-INGRESS all -- anywhere anywhere
DOCKER-ISOLATION-STAGE-1 all -- anywhere anywhere
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
DOCKER all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
DOCKER all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
DROP all -- anywhere anywhere
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Chain DOCKER (2 references)
target prot opt source destination
Chain DOCKER-INGRESS (1 references)
target prot opt source destination
ACCEPT tcp -- anywhere anywhere tcp dpt:https
ACCEPT tcp -- anywhere anywhere state RELATED,ESTABLISHED tcp spt:https
ACCEPT tcp -- anywhere anywhere tcp dpt:http
ACCEPT tcp -- anywhere anywhere state RELATED,ESTABLISHED tcp spt:http
ACCEPT tcp -- anywhere anywhere tcp dpt:30001
ACCEPT tcp -- anywhere anywhere state RELATED,ESTABLISHED tcp spt:30001
ACCEPT tcp -- anywhere anywhere tcp dpt:30000
ACCEPT tcp -- anywhere anywhere state RELATED,ESTABLISHED tcp spt:30000
RETURN all -- anywhere anywhere
Chain DOCKER-ISOLATION-STAGE-1 (1 references)
target prot opt source destination
DOCKER-ISOLATION-STAGE-2 all -- anywhere anywhere
DOCKER-ISOLATION-STAGE-2 all -- anywhere anywhere
RETURN all -- anywhere anywhere
Chain DOCKER-ISOLATION-STAGE-2 (2 references)
target prot opt source destination
DROP all -- anywhere anywhere
DROP all -- anywhere anywhere
RETURN all -- anywhere anywhere
Chain DOCKER-USER (1 references)
target prot opt source destination
RETURN all -- anywhere anywhere
I have inspected dmesg and /var/log/syslog looking for possible issues but I can’t find any.
Now I am stuck. Are there any recommendations into how I can proceed with debugging.