Swam 1.12 Multi-host networking help

mrjana · August 26, 2016, 5:46am

@rusher81572 Sorry to hear about your issues. You seem to be running a Rpi cluster. Is that correct? Not sure if you are hitting any issue specific to being in Raspberry pi cluster. Would you mind opening an issue with detailed information about your setup and a sequence of steps to reproduce the problem?

alexellis2 · August 26, 2016, 7:52am

Some of identified that the Raspbian Kernel was missing the vxlan module. Running rpi-update adds the module (plus a reboot).

I put these test scenarios together to try and document what was going wrong:

After the update I was able to get through them all. I’ve now got an 8-node cluster which can run my redis hit-counter in Swarmmode.

LMK if this helps @rusher81572 @mrjana

rusher81572 · August 28, 2016, 5:03am

@mrjana @alexellis2 Thanks for the suggestions. Please note that my basic Swarm functionality testing done per the creation of this forum topic was with docker-machine on x86 hardware with VirtualBox using 1.12.0.

I ran rpi-update and updated the kernel. However, I did not see a vxlan module with lsmod so I used modprobe and now it is visible.

Testing Procedure:

Used three Pi’s with Docker 1.12.1-rc2, build 236317f, experimental, vxlan module loaded. Kernel 4.4.19-v7+.

docker swarm init
docker swarm join.....(for each Pi as needed)

1mmb8fgd9s8m9peodyr3x6qoc *  rpi-2     Ready   Active        Leader
c26mil621fs4f0iwa1e8884ar    rpi-3     Ready   Active
em26jwtm7u3f800pkrl4fw2py    rpi-4     Ready   Active

docker network create chat -d overlay
docker service create --name mysql --network chat registry:5000/mysql
docker service create --name test --network chat -p 444:444 registry:5000/test

docker service ls

ID            NAME   REPLICAS  IMAGE                   COMMAND
4n6r4fbiu0cl  mysql  1/1       registry:5000/mysql
8h36vyo2luyy  test   1/1       registry:5000/test

We can see here that the database and Node.js application is running on separate machines.

# docker node ps rpi-3

ID                         NAME     IMAGE                NODE   DESIRED STATE  CURRENT STATE               ERROR
1zfxjmkbqgz7c0q9ji26h7qkg  mysql.1  registry:5000/mysql  rpi-3  Running        Running about a minute ago

# docker node ps rpi-2
ID                         NAME    IMAGE                   NODE   DESIRED STATE  CURRENT STATE               ERROR
5dcwkxxdy4aunbvg6sk9q3wtw  test.1  registry:5000/test  rpi-2  Running        Running about a minute ago

The service named test is running my Node.js app fine with database connectivity and is accessible from each host. More testing is needed with other apps but is looking good so far.

My only problem that remains now is with labels and constraints. That does not seem to be working right.

This command works and runs MySQL on rpi-3 as expected:

docker service create --name sql2 -l "node=rpi-3" --network chat registry:5000/mysql

Then I tried to run MySQL on a node and label that does not exist and it just went to a random node:

docker service create --name sql2 -l "node=rpi-1" --network chat registry:5000/mysql

For my final test, I created three MySQL services using the same label and they all went to random nodes which is not behavior that I would expect.

docker service create --name sql3 -l "node=rpi-4" --network chat registry:5000/mysql
docker service create --name sql2 -l "node=rpi-4" --network chat registry:5000/mysql
docker service create --name sql1 -l "node=rpi-4" --network chat registry:5000/mysql

mavenugo · August 28, 2016, 6:21am

@rusher81572 thanks for confirming the network connectivity behaviour with the rpi-update and 1.12.1-rc (though I would recommend using 1.12.1 released version).

Regarding the scheduling constraints, I don’t think you are using the correct options in docker service create command. “-l” just adds a label to any object. if you are looking to constraint the scheduling, you should be using the --constraint option with appropriate supported constraints as mentioned here : https://docs.docker.com/engine/reference/commandline/service_create/#/specify-service-constraints .

alexellis2 · August 28, 2016, 7:22am

I used constraints in classical Swarm (where you need an external kv store) - as I remember it involved editing the daemon script to add a label. I haven’t tried scheduling by hostname, but this looked relevant: https://github.com/docker/docker/pull/24397#issuecomment-231227571

Btw. Did the RC come via experimental.docker.com? If you are finding that re-running the get.docker.com command is not refreshing the Docker version to the general release, then you might want to run apt-get remove docker-engine prior to the script.

Let us know how that goes - I will also try scheduling via label on my swarm and report back.

alexellis2 · August 28, 2016, 7:56am

I’ve had a quick go on my 7 node cluster.

Set up via node hostname

$ docker service create --constraint node.hostname==pi2swarm7 --name hello1 --publish 3000:3000 --replicas=1 alexellis2/arm-alpinehello
6u0lvc8cm1d8unfasek2n6x2r

$ docker service ps hello1
ID                         NAME      IMAGE                       NODE       DESIRED STATE  CURRENT STATE          ERROR
97p4vkcgs42309srjvn6gb5d5  hello1.1  alexellis2/arm-alpinehello  pi2swarm7  Running        Running 3 minutes ago

This is an example with a custom node label:

$ docker node update pi2swarm2 --label-add db=1

$ docker service create --constraint 'node.labels.db == 1' --name hello2 --publish 3001:3000 --replicas=1 alexellis2/arm-alpinehello

$ docker service ps hello2
ID                         NAME      IMAGE                       NODE       DESIRED STATE  CURRENT STATE           ERROR
e5zb4g4myzsf35ipydt21fsvi  hello2.1  alexellis2/arm-alpinehello  pi2swarm2  Running        Running 36 seconds ago

Hope these examples help with your set-up.

rusher81572 · August 28, 2016, 7:56pm

Constraints are working, thanks @alexellis2. I moved Swarm 1.12.1 into production on my Pi’s. I did notice a few issues.

First, I am running phpfpm in a container and tinyrss will constantly reload over and over again after clicking on “All Articles”. I scaled phpfpm to 1 instance and that looked like it resolved the problem. I would like to scale it out like I did before using nginx. Then started getting 504 Gateway Time-outs when accessing TinyRSS. The Wordpress blog and TinyRSS both connect to the same phpfpm container and wordpress works fine. There must be something wrong with the internal Swarm plumbing. I removed the service and created it again and it seems to work now. The only problem is that 2/3 of the phpfpm instances are now running on the same host.

There is also performance issues. I connect to the first Pi and that should load balance all my requests to other Pi’s. This was just like my setup before. But with Swarm taking over the load balancing with the mesh networking, it is very slow. This behavior is breaking all of my Node.js apps now.

Like previous docker versions, It is annoying that containers on a virtual network can not resolve each other if one of them is restarted.

Lastly, I noticed that scaling containers will sometimes schedule multiple instances on the same Pi. Is this expected behavior? I rather have 1 instance on each Pi.

mavenugo · August 28, 2016, 11:33pm

@rusher81572 Can you please confirm if the issues that you raised earlier in this thread are addressed.

Using 1.12.1
multi-host networking issue that you raised in r-pi is resolved by rpi-update (and modprobe ?)
Constraints issue that you brought up is a human error (invalid flag usage).

Regarding the other issues that you have raised in the recent comments, I think we should take this to docker/docker issue tracker so that we can gather more information (and correct any human errors) and also better visibility / support from maintainers.

rusher81572 · August 29, 2016, 5:53pm

Hello @mavenugo

Yes, the earlier issues in this thread are addressed which are:

Using 1.12.1
multi-host networking issue that you raised in r-pi is resolved by rpi-update (and modprobe ?)
Constraints issue that you brought up is a human error (invalid flag usage).

Thanks for the help.

mrjana · August 30, 2016, 9:16pm

@rusher81572 Can you please characterize what you mean by “load balancing with mesh networking is very slow”? Slow as in it is taking a long time or are you seeing throughput issues? Also is this when you expose a port and access it from outside the cluster or are you doing intra cluster?

kriz · March 29, 2017, 1:26am

I meet the same problem today with the newest docker engine version (CE)。 Are the swarm mode routing mesh and internal routing Production Ready?